Sample records for large sample sets

  1. The prevalence of terraced treescapes in analyses of phylogenetic data sets.

    PubMed

    Dobrin, Barbara H; Zwickl, Derrick J; Sanderson, Michael J

    2018-04-04

    The pattern of data availability in a phylogenetic data set may lead to the formation of terraces, collections of equally optimal trees. Terraces can arise in tree space if trees are scored with parsimony or with partitioned, edge-unlinked maximum likelihood. Theory predicts that terraces can be large, but their prevalence in contemporary data sets has never been surveyed. We selected 26 data sets and phylogenetic trees reported in recent literature and investigated the terraces to which the trees would belong, under a common set of inference assumptions. We examined terrace size as a function of the sampling properties of the data sets, including taxon coverage density (the proportion of taxon-by-gene positions with any data present) and a measure of gene sampling "sufficiency". We evaluated each data set in relation to the theoretical minimum gene sampling depth needed to reduce terrace size to a single tree, and explored the impact of the terraces found in replicate trees in bootstrap methods. Terraces were identified in nearly all data sets with taxon coverage densities < 0.90. They were not found, however, in high-coverage-density (i.e., ≥ 0.94) transcriptomic and genomic data sets. The terraces could be very large, and size varied inversely with taxon coverage density and with gene sampling sufficiency. Few data sets achieved a theoretical minimum gene sampling depth needed to reduce terrace size to a single tree. Terraces found during bootstrap resampling reduced overall support. If certain inference assumptions apply, trees estimated from empirical data sets often belong to large terraces of equally optimal trees. Terrace size correlates to data set sampling properties. Data sets seldom include enough genes to reduce terrace size to one tree. When bootstrap replicate trees lie on a terrace, statistical support for phylogenetic hypotheses may be reduced. Although some of the published analyses surveyed were conducted with edge-linked inference models (which do not induce terraces), unlinked models have been used and advocated. The present study describes the potential impact of that inference assumption on phylogenetic inference in the context of the kinds of multigene data sets now widely assembled for large-scale tree construction.

  2. Sampling Large Graphs for Anticipatory Analytics

    DTIC Science & Technology

    2015-05-15

    low. C. Random Area Sampling Random area sampling [8] is a “ snowball ” sampling method in which a set of random seed vertices are selected and areas... Sampling Large Graphs for Anticipatory Analytics Lauren Edwards, Luke Johnson, Maja Milosavljevic, Vijay Gadepally, Benjamin A. Miller Lincoln...systems, greater human-in-the-loop involvement, or through complex algorithms. We are investigating the use of sampling to mitigate these challenges

  3. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

    PubMed

    Yu, Qiang; Wei, Dingbang; Huo, Hongwei

    2018-06-18

    Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

  4. An evaluation of sampling and full enumeration strategies for Fisher Jenks classification in big data settings

    USGS Publications Warehouse

    Rey, Sergio J.; Stephens, Philip A.; Laura, Jason R.

    2017-01-01

    Large data contexts present a number of challenges to optimal choropleth map classifiers. Application of optimal classifiers to a sample of the attribute space is one proposed solution. The properties of alternative sampling-based classification methods are examined through a series of Monte Carlo simulations. The impacts of spatial autocorrelation, number of desired classes, and form of sampling are shown to have significant impacts on the accuracy of map classifications. Tradeoffs between improved speed of the sampling approaches and loss of accuracy are also considered. The results suggest the possibility of guiding the choice of classification scheme as a function of the properties of large data sets.

  5. Cognitive Sex Differences in Reasoning Tasks: Evidence from Brazilian Samples of Educational Settings

    ERIC Educational Resources Information Center

    Flores-Mendoza, Carmen; Widaman, Keith F.; Rindermann, Heiner; Primi, Ricardo; Mansur-Alves, Marcela; Pena, Carla Couto

    2013-01-01

    Sex differences on the Attention Test (AC), the Raven's Standard Progressive Matrices (SPM), and the Brazilian Cognitive Battery (BPR5), were investigated using four large samples (total N=6780), residing in the states of Minas Gerais and Sao Paulo. The majority of samples used, which were obtained from educational settings, could be considered a…

  6. Large Sample Confidence Limits for Goodman and Kruskal's Proportional Prediction Measure TAU-b

    ERIC Educational Resources Information Center

    Berry, Kenneth J.; Mielke, Paul W.

    1976-01-01

    A Fortran Extended program which computes Goodman and Kruskal's Tau-b, its asymmetrical counterpart, Tau-a, and three sets of confidence limits for each coefficient under full multinomial and proportional stratified sampling is presented. A correction of an error in the calculation of the large sample standard error of Tau-b is discussed.…

  7. The CAMELS data set: catchment attributes and meteorology for large-sample studies

    NASA Astrophysics Data System (ADS)

    Addor, Nans; Newman, Andrew J.; Mizukami, Naoki; Clark, Martyn P.

    2017-10-01

    We present a new data set of attributes for 671 catchments in the contiguous United States (CONUS) minimally impacted by human activities. This complements the daily time series of meteorological forcing and streamflow provided by Newman et al. (2015b). To produce this extension, we synthesized diverse and complementary data sets to describe six main classes of attributes at the catchment scale: topography, climate, streamflow, land cover, soil, and geology. The spatial variations among basins over the CONUS are discussed and compared using a series of maps. The large number of catchments, combined with the diversity of the attributes we extracted, makes this new data set well suited for large-sample studies and comparative hydrology. In comparison to the similar Model Parameter Estimation Experiment (MOPEX) data set, this data set relies on more recent data, it covers a wider range of attributes, and its catchments are more evenly distributed across the CONUS. This study also involves assessments of the limitations of the source data sets used to compute catchment attributes, as well as detailed descriptions of how the attributes were computed. The hydrometeorological time series provided by Newman et al. (2015b, https://doi.org/10.5065/D6MW2F4D) together with the catchment attributes introduced in this paper (https://doi.org/10.5065/D6G73C3Q) constitute the freely available CAMELS data set, which stands for Catchment Attributes and MEteorology for Large-sample Studies.

  8. A large volume particulate and water multi-sampler with in situ preservation for microbial and biogeochemical studies

    NASA Astrophysics Data System (ADS)

    Breier, J. A.; Sheik, C. S.; Gomez-Ibanez, D.; Sayre-McCord, R. T.; Sanger, R.; Rauch, C.; Coleman, M.; Bennett, S. A.; Cron, B. R.; Li, M.; German, C. R.; Toner, B. M.; Dick, G. J.

    2014-12-01

    A new tool was developed for large volume sampling to facilitate marine microbiology and biogeochemical studies. It was developed for remotely operated vehicle and hydrocast deployments, and allows for rapid collection of multiple sample types from the water column and dynamic, variable environments such as rising hydrothermal plumes. It was used successfully during a cruise to the hydrothermal vent systems of the Mid-Cayman Rise. The Suspended Particulate Rosette V2 large volume multi-sampling system allows for the collection of 14 sample sets per deployment. Each sample set can include filtered material, whole (unfiltered) water, and filtrate. Suspended particulate can be collected on filters up to 142 mm in diameter and pore sizes down to 0.2 μm. Filtration is typically at flowrates of 2 L min-1. For particulate material, filtered volume is constrained only by sampling time and filter capacity, with all sample volumes recorded by digital flowmeter. The suspended particulate filter holders can be filled with preservative and sealed immediately after sample collection. Up to 2 L of whole water, filtrate, or a combination of the two, can be collected as part of each sample set. The system is constructed of plastics with titanium fasteners and nickel alloy spring loaded seals. There are no ferrous alloys in the sampling system. Individual sample lines are prefilled with filtered, deionized water prior to deployment and remain sealed unless a sample is actively being collected. This system is intended to facilitate studies concerning the relationship between marine microbiology and ocean biogeochemistry.

  9. Best Practices in Using Large, Complex Samples: The Importance of Using Appropriate Weights and Design Effect Compensation

    ERIC Educational Resources Information Center

    Osborne, Jason W.

    2011-01-01

    Large surveys often use probability sampling in order to obtain representative samples, and these data sets are valuable tools for researchers in all areas of science. Yet many researchers are not formally prepared to appropriately utilize these resources. Indeed, users of one popular dataset were generally found "not" to have modeled…

  10. Software engineering the mixed model for genome-wide association studies on large samples

    USDA-ARS?s Scientific Manuscript database

    Mixed models improve the ability to detect phenotype-genotype associations in the presence of population stratification and multiple levels of relatedness in genome-wide association studies (GWAS), but for large data sets the resource consumption becomes impractical. At the same time, the sample siz...

  11. bigSCale: an analytical framework for big-scale single-cell data.

    PubMed

    Iacono, Giovanni; Mereu, Elisabetta; Guillaumet-Adkins, Amy; Corominas, Roser; Cuscó, Ivon; Rodríguez-Esteban, Gustavo; Gut, Marta; Pérez-Jurado, Luis Alberto; Gut, Ivo; Heyn, Holger

    2018-06-01

    Single-cell RNA sequencing (scRNA-seq) has significantly deepened our insights into complex tissues, with the latest techniques capable of processing tens of thousands of cells simultaneously. Analyzing increasing numbers of cells, however, generates extremely large data sets, extending processing time and challenging computing resources. Current scRNA-seq analysis tools are not designed to interrogate large data sets and often lack sensitivity to identify marker genes. With bigSCale, we provide a scalable analytical framework to analyze millions of cells, which addresses the challenges associated with large data sets. To handle the noise and sparsity of scRNA-seq data, bigSCale uses large sample sizes to estimate an accurate numerical model of noise. The framework further includes modules for differential expression analysis, cell clustering, and marker identification. A directed convolution strategy allows processing of extremely large data sets, while preserving transcript information from individual cells. We evaluated the performance of bigSCale using both a biological model of aberrant gene expression in patient-derived neuronal progenitor cells and simulated data sets, which underlines the speed and accuracy in differential expression analysis. To test its applicability for large data sets, we applied bigSCale to assess 1.3 million cells from the mouse developing forebrain. Its directed down-sampling strategy accumulates information from single cells into index cell transcriptomes, thereby defining cellular clusters with improved resolution. Accordingly, index cell clusters identified rare populations, such as reelin ( Reln )-positive Cajal-Retzius neurons, for which we report previously unrecognized heterogeneity associated with distinct differentiation stages, spatial organization, and cellular function. Together, bigSCale presents a solution to address future challenges of large single-cell data sets. © 2018 Iacono et al.; Published by Cold Spring Harbor Laboratory Press.

  12. Time to stabilization in single leg drop jump landings: an examination of calculation methods and assessment of differences in sample rate, filter settings and trial length on outcome values.

    PubMed

    Fransz, Duncan P; Huurnink, Arnold; de Boode, Vosse A; Kingma, Idsart; van Dieën, Jaap H

    2015-01-01

    Time to stabilization (TTS) is the time it takes for an individual to return to a baseline or stable state following a jump or hop landing. A large variety exists in methods to calculate the TTS. These methods can be described based on four aspects: (1) the input signal used (vertical, anteroposterior, or mediolateral ground reaction force) (2) signal processing (smoothed by sequential averaging, a moving root-mean-square window, or fitting an unbounded third order polynomial), (3) the stable state (threshold), and (4) the definition of when the (processed) signal is considered stable. Furthermore, differences exist with regard to the sample rate, filter settings and trial length. Twenty-five healthy volunteers performed ten 'single leg drop jump landing' trials. For each trial, TTS was calculated according to 18 previously reported methods. Additionally, the effects of sample rate (1000, 500, 200 and 100 samples/s), filter settings (no filter, 40, 15 and 10 Hz), and trial length (20, 14, 10, 7, 5 and 3s) were assessed. The TTS values varied considerably across the calculation methods. The maximum effect of alterations in the processing settings, averaged over calculation methods, were 2.8% (SD 3.3%) for sample rate, 8.8% (SD 7.7%) for filter settings, and 100.5% (SD 100.9%) for trial length. Differences in TTS calculation methods are affected differently by sample rate, filter settings and trial length. The effects of differences in sample rate and filter settings are generally small, while trial length has a large effect on TTS values. Copyright © 2014 Elsevier B.V. All rights reserved.

  13. Decoder calibration with ultra small current sample set for intracortical brain-machine interface

    NASA Astrophysics Data System (ADS)

    Zhang, Peng; Ma, Xuan; Chen, Luyao; Zhou, Jin; Wang, Changyong; Li, Wei; He, Jiping

    2018-04-01

    Objective. Intracortical brain-machine interfaces (iBMIs) aim to restore efficient communication and movement ability for paralyzed patients. However, frequent recalibration is required for consistency and reliability, and every recalibration will require relatively large most current sample set. The aim in this study is to develop an effective decoder calibration method that can achieve good performance while minimizing recalibration time. Approach. Two rhesus macaques implanted with intracortical microelectrode arrays were trained separately on movement and sensory paradigm. Neural signals were recorded to decode reaching positions or grasping postures. A novel principal component analysis-based domain adaptation (PDA) method was proposed to recalibrate the decoder with only ultra small current sample set by taking advantage of large historical data, and the decoding performance was compared with other three calibration methods for evaluation. Main results. The PDA method closed the gap between historical and current data effectively, and made it possible to take advantage of large historical data for decoder recalibration in current data decoding. Using only ultra small current sample set (five trials of each category), the decoder calibrated using the PDA method could achieve much better and more robust performance in all sessions than using other three calibration methods in both monkeys. Significance. (1) By this study, transfer learning theory was brought into iBMIs decoder calibration for the first time. (2) Different from most transfer learning studies, the target data in this study were ultra small sample set and were transferred to the source data. (3) By taking advantage of historical data, the PDA method was demonstrated to be effective in reducing recalibration time for both movement paradigm and sensory paradigm, indicating a viable generalization. By reducing the demand for large current training data, this new method may facilitate the application of intracortical brain-machine interfaces in clinical practice.

  14. Factors Affecting Adult Student Dropout Rates in the Korean Cyber-University Degree Programs

    ERIC Educational Resources Information Center

    Choi, Hee Jun; Kim, Byoung Uk

    2018-01-01

    Few empirical studies of adult distance learners' decisions to drop out of degree programs have used large enough sample sizes to generalize the findings or data sets drawn from multiple online programs that address various subjects. Accordingly, in this study, we used a large administrative data set drawn from multiple online degree programs to…

  15. Random sampling of elementary flux modes in large-scale metabolic networks.

    PubMed

    Machado, Daniel; Soons, Zita; Patil, Kiran Raosaheb; Ferreira, Eugénio C; Rocha, Isabel

    2012-09-15

    The description of a metabolic network in terms of elementary (flux) modes (EMs) provides an important framework for metabolic pathway analysis. However, their application to large networks has been hampered by the combinatorial explosion in the number of modes. In this work, we develop a method for generating random samples of EMs without computing the whole set. Our algorithm is an adaptation of the canonical basis approach, where we add an additional filtering step which, at each iteration, selects a random subset of the new combinations of modes. In order to obtain an unbiased sample, all candidates are assigned the same probability of getting selected. This approach avoids the exponential growth of the number of modes during computation, thus generating a random sample of the complete set of EMs within reasonable time. We generated samples of different sizes for a metabolic network of Escherichia coli, and observed that they preserve several properties of the full EM set. It is also shown that EM sampling can be used for rational strain design. A well distributed sample, that is representative of the complete set of EMs, should be suitable to most EM-based methods for analysis and optimization of metabolic networks. Source code for a cross-platform implementation in Python is freely available at http://code.google.com/p/emsampler. dmachado@deb.uminho.pt Supplementary data are available at Bioinformatics online.

  16. Improved variance estimation of classification performance via reduction of bias caused by small sample size.

    PubMed

    Wickenberg-Bolin, Ulrika; Göransson, Hanna; Fryknäs, Mårten; Gustafsson, Mats G; Isaksson, Anders

    2006-03-13

    Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT). Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets.

  17. Sample Selection for Training Cascade Detectors.

    PubMed

    Vállez, Noelia; Deniz, Oscar; Bueno, Gloria

    2015-01-01

    Automatic detection systems usually require large and representative training datasets in order to obtain good detection and false positive rates. Training datasets are such that the positive set has few samples and/or the negative set should represent anything except the object of interest. In this respect, the negative set typically contains orders of magnitude more images than the positive set. However, imbalanced training databases lead to biased classifiers. In this paper, we focus our attention on a negative sample selection method to properly balance the training data for cascade detectors. The method is based on the selection of the most informative false positive samples generated in one stage to feed the next stage. The results show that the proposed cascade detector with sample selection obtains on average better partial AUC and smaller standard deviation than the other compared cascade detectors.

  18. The observed clustering of damaging extratropical cyclones in Europe

    NASA Astrophysics Data System (ADS)

    Cusack, Stephen

    2016-04-01

    The clustering of severe European windstorms on annual timescales has substantial impacts on the (re-)insurance industry. Our knowledge of the risk is limited by large uncertainties in estimates of clustering from typical historical storm data sets covering the past few decades. Eight storm data sets are gathered for analysis in this study in order to reduce these uncertainties. Six of the data sets contain more than 100 years of severe storm information to reduce sampling errors, and observational errors are reduced by the diversity of information sources and analysis methods between storm data sets. All storm severity measures used in this study reflect damage, to suit (re-)insurance applications. The shortest storm data set of 42 years provides indications of stronger clustering with severity, particularly for regions off the main storm track in central Europe and France. However, clustering estimates have very large sampling and observational errors, exemplified by large changes in estimates in central Europe upon removal of one stormy season, 1989/1990. The extended storm records place 1989/1990 into a much longer historical context to produce more robust estimates of clustering. All the extended storm data sets show increased clustering between more severe storms from return periods (RPs) of 0.5 years to the longest measured RPs of about 20 years. Further, they contain signs of stronger clustering off the main storm track, and weaker clustering for smaller-sized areas, though these signals are more uncertain as they are drawn from smaller data samples. These new ultra-long storm data sets provide new information on clustering to improve our management of this risk.

  19. Meta-Storms: efficient search for similar microbial communities based on a novel indexing scheme and similarity score for metagenomic data.

    PubMed

    Su, Xiaoquan; Xu, Jian; Ning, Kang

    2012-10-01

    It has long been intriguing scientists to effectively compare different microbial communities (also referred as 'metagenomic samples' here) in a large scale: given a set of unknown samples, find similar metagenomic samples from a large repository and examine how similar these samples are. With the current metagenomic samples accumulated, it is possible to build a database of metagenomic samples of interests. Any metagenomic samples could then be searched against this database to find the most similar metagenomic sample(s). However, on one hand, current databases with a large number of metagenomic samples mostly serve as data repositories that offer few functionalities for analysis; and on the other hand, methods to measure the similarity of metagenomic data work well only for small set of samples by pairwise comparison. It is not yet clear, how to efficiently search for metagenomic samples against a large metagenomic database. In this study, we have proposed a novel method, Meta-Storms, that could systematically and efficiently organize and search metagenomic data. It includes the following components: (i) creating a database of metagenomic samples based on their taxonomical annotations, (ii) efficient indexing of samples in the database based on a hierarchical taxonomy indexing strategy, (iii) searching for a metagenomic sample against the database by a fast scoring function based on quantitative phylogeny and (iv) managing database by index export, index import, data insertion, data deletion and database merging. We have collected more than 1300 metagenomic data from the public domain and in-house facilities, and tested the Meta-Storms method on these datasets. Our experimental results show that Meta-Storms is capable of database creation and effective searching for a large number of metagenomic samples, and it could achieve similar accuracies compared with the current popular significance testing-based methods. Meta-Storms method would serve as a suitable database management and search system to quickly identify similar metagenomic samples from a large pool of samples. ningkang@qibebt.ac.cn Supplementary data are available at Bioinformatics online.

  20. Biases in the OSSOS Detection of Large Semimajor Axis Trans-Neptunian Objects

    NASA Astrophysics Data System (ADS)

    Gladman, Brett; Shankman, Cory; OSSOS Collaboration

    2017-10-01

    The accumulating but small set of large semimajor axis trans-Neptunian objects (TNOs) shows an apparent clustering in the orientations of their orbits. This clustering must either be representative of the intrinsic distribution of these TNOs, or else have arisen as a result of observation biases and/or statistically expected variations for such a small set of detected objects. The clustered TNOs were detected across different and independent surveys, which has led to claims that the detections are therefore free of observational bias. This apparent clustering has led to the so-called “Planet 9” hypothesis that a super-Earth currently resides in the distant solar system and causes this clustering. The Outer Solar System Origins Survey (OSSOS) is a large program that ran on the Canada-France-Hawaii Telescope from 2013 to 2017, discovering more than 800 new TNOs. One of the primary design goals of OSSOS was the careful determination of observational biases that would manifest within the detected sample. We demonstrate the striking and non-intuitive biases that exist for the detection of TNOs with large semimajor axes. The eight large semimajor axis OSSOS detections are an independent data set, of comparable size to the conglomerate samples used in previous studies. We conclude that the orbital distribution of the OSSOS sample is consistent with being detected from a uniform underlying angular distribution.

  1. OSSOS. VI. Striking Biases in the Detection of Large Semimajor Axis Trans-Neptunian Objects

    NASA Astrophysics Data System (ADS)

    Shankman, Cory; Kavelaars, J. J.; Bannister, Michele T.; Gladman, Brett J.; Lawler, Samantha M.; Chen, Ying-Tung; Jakubik, Marian; Kaib, Nathan; Alexandersen, Mike; Gwyn, Stephen D. J.; Petit, Jean-Marc; Volk, Kathryn

    2017-08-01

    The accumulating but small set of large semimajor axis trans-Neptunian objects (TNOs) shows an apparent clustering in the orientations of their orbits. This clustering must either be representative of the intrinsic distribution of these TNOs, or else have arisen as a result of observation biases and/or statistically expected variations for such a small set of detected objects. The clustered TNOs were detected across different and independent surveys, which has led to claims that the detections are therefore free of observational bias. This apparent clustering has led to the so-called “Planet 9” hypothesis that a super-Earth currently resides in the distant solar system and causes this clustering. The Outer Solar System Origins Survey (OSSOS) is a large program that ran on the Canada–France–Hawaii Telescope from 2013 to 2017, discovering more than 800 new TNOs. One of the primary design goals of OSSOS was the careful determination of observational biases that would manifest within the detected sample. We demonstrate the striking and non-intuitive biases that exist for the detection of TNOs with large semimajor axes. The eight large semimajor axis OSSOS detections are an independent data set, of comparable size to the conglomerate samples used in previous studies. We conclude that the orbital distribution of the OSSOS sample is consistent with being detected from a uniform underlying angular distribution.

  2. Estimation of reference intervals from small samples: an example using canine plasma creatinine.

    PubMed

    Geffré, A; Braun, J P; Trumel, C; Concordet, D

    2009-12-01

    According to international recommendations, reference intervals should be determined from at least 120 reference individuals, which often are impossible to achieve in veterinary clinical pathology, especially for wild animals. When only a small number of reference subjects is available, the possible bias cannot be known and the normality of the distribution cannot be evaluated. A comparison of reference intervals estimated by different methods could be helpful. The purpose of this study was to compare reference limits determined from a large set of canine plasma creatinine reference values, and large subsets of this data, with estimates obtained from small samples selected randomly. Twenty sets each of 120 and 27 samples were randomly selected from a set of 1439 plasma creatinine results obtained from healthy dogs in another study. Reference intervals for the whole sample and for the large samples were determined by a nonparametric method. The estimated reference limits for the small samples were minimum and maximum, mean +/- 2 SD of native and Box-Cox-transformed values, 2.5th and 97.5th percentiles by a robust method on native and Box-Cox-transformed values, and estimates from diagrams of cumulative distribution functions. The whole sample had a heavily skewed distribution, which approached Gaussian after Box-Cox transformation. The reference limits estimated from small samples were highly variable. The closest estimates to the 1439-result reference interval for 27-result subsamples were obtained by both parametric and robust methods after Box-Cox transformation but were grossly erroneous in some cases. For small samples, it is recommended that all values be reported graphically in a dot plot or histogram and that estimates of the reference limits be compared using different methods.

  3. A fast learning method for large scale and multi-class samples of SVM

    NASA Astrophysics Data System (ADS)

    Fan, Yu; Guo, Huiming

    2017-06-01

    A multi-class classification SVM(Support Vector Machine) fast learning method based on binary tree is presented to solve its low learning efficiency when SVM processing large scale multi-class samples. This paper adopts bottom-up method to set up binary tree hierarchy structure, according to achieved hierarchy structure, sub-classifier learns from corresponding samples of each node. During the learning, several class clusters are generated after the first clustering of the training samples. Firstly, central points are extracted from those class clusters which just have one type of samples. For those which have two types of samples, cluster numbers of their positive and negative samples are set respectively according to their mixture degree, secondary clustering undertaken afterwards, after which, central points are extracted from achieved sub-class clusters. By learning from the reduced samples formed by the integration of extracted central points above, sub-classifiers are obtained. Simulation experiment shows that, this fast learning method, which is based on multi-level clustering, can guarantee higher classification accuracy, greatly reduce sample numbers and effectively improve learning efficiency.

  4. A Comparison of the Social Competence of Children with Moderate Intellectual Disability in Inclusive versus Segregated School Settings

    ERIC Educational Resources Information Center

    Hardiman, Sharon; Guerin, Suzanne; Fitzsimons, Elaine

    2009-01-01

    This is the first study to compare the social competence of children with moderate intellectual disability in inclusive versus segregated school settings in the Republic of Ireland. A convenience sample was recruited through two large ID services. The sample comprised 45 children across two groups: Group 1 (n = 20; inclusive school) and Group 2 (n…

  5. Caught you: threats to confidentiality due to the public release of large-scale genetic data sets

    PubMed Central

    2010-01-01

    Background Large-scale genetic data sets are frequently shared with other research groups and even released on the Internet to allow for secondary analysis. Study participants are usually not informed about such data sharing because data sets are assumed to be anonymous after stripping off personal identifiers. Discussion The assumption of anonymity of genetic data sets, however, is tenuous because genetic data are intrinsically self-identifying. Two types of re-identification are possible: the "Netflix" type and the "profiling" type. The "Netflix" type needs another small genetic data set, usually with less than 100 SNPs but including a personal identifier. This second data set might originate from another clinical examination, a study of leftover samples or forensic testing. When merged to the primary, unidentified set it will re-identify all samples of that individual. Even with no second data set at hand, a "profiling" strategy can be developed to extract as much information as possible from a sample collection. Starting with the identification of ethnic subgroups along with predictions of body characteristics and diseases, the asthma kids case as a real-life example is used to illustrate that approach. Summary Depending on the degree of supplemental information, there is a good chance that at least a few individuals can be identified from an anonymized data set. Any re-identification, however, may potentially harm study participants because it will release individual genetic disease risks to the public. PMID:21190545

  6. Caught you: threats to confidentiality due to the public release of large-scale genetic data sets.

    PubMed

    Wjst, Matthias

    2010-12-29

    Large-scale genetic data sets are frequently shared with other research groups and even released on the Internet to allow for secondary analysis. Study participants are usually not informed about such data sharing because data sets are assumed to be anonymous after stripping off personal identifiers. The assumption of anonymity of genetic data sets, however, is tenuous because genetic data are intrinsically self-identifying. Two types of re-identification are possible: the "Netflix" type and the "profiling" type. The "Netflix" type needs another small genetic data set, usually with less than 100 SNPs but including a personal identifier. This second data set might originate from another clinical examination, a study of leftover samples or forensic testing. When merged to the primary, unidentified set it will re-identify all samples of that individual. Even with no second data set at hand, a "profiling" strategy can be developed to extract as much information as possible from a sample collection. Starting with the identification of ethnic subgroups along with predictions of body characteristics and diseases, the asthma kids case as a real-life example is used to illustrate that approach. Depending on the degree of supplemental information, there is a good chance that at least a few individuals can be identified from an anonymized data set. Any re-identification, however, may potentially harm study participants because it will release individual genetic disease risks to the public.

  7. Sampling errors in the estimation of empirical orthogonal functions. [for climatology studies

    NASA Technical Reports Server (NTRS)

    North, G. R.; Bell, T. L.; Cahalan, R. F.; Moeng, F. J.

    1982-01-01

    Empirical Orthogonal Functions (EOF's), eigenvectors of the spatial cross-covariance matrix of a meteorological field, are reviewed with special attention given to the necessary weighting factors for gridded data and the sampling errors incurred when too small a sample is available. The geographical shape of an EOF shows large intersample variability when its associated eigenvalue is 'close' to a neighboring one. A rule of thumb indicating when an EOF is likely to be subject to large sampling fluctuations is presented. An explicit example, based on the statistics of the 500 mb geopotential height field, displays large intersample variability in the EOF's for sample sizes of a few hundred independent realizations, a size seldom exceeded by meteorological data sets.

  8. Software engineering the mixed model for genome-wide association studies on large samples.

    PubMed

    Zhang, Zhiwu; Buckler, Edward S; Casstevens, Terry M; Bradbury, Peter J

    2009-11-01

    Mixed models improve the ability to detect phenotype-genotype associations in the presence of population stratification and multiple levels of relatedness in genome-wide association studies (GWAS), but for large data sets the resource consumption becomes impractical. At the same time, the sample size and number of markers used for GWAS is increasing dramatically, resulting in greater statistical power to detect those associations. The use of mixed models with increasingly large data sets depends on the availability of software for analyzing those models. While multiple software packages implement the mixed model method, no single package provides the best combination of fast computation, ability to handle large samples, flexible modeling and ease of use. Key elements of association analysis with mixed models are reviewed, including modeling phenotype-genotype associations using mixed models, population stratification, kinship and its estimation, variance component estimation, use of best linear unbiased predictors or residuals in place of raw phenotype, improving efficiency and software-user interaction. The available software packages are evaluated, and suggestions made for future software development.

  9. Characterization of Large Structural Genetic Mosaicism in Human Autosomes

    PubMed Central

    Machiela, Mitchell J.; Zhou, Weiyin; Sampson, Joshua N.; Dean, Michael C.; Jacobs, Kevin B.; Black, Amanda; Brinton, Louise A.; Chang, I-Shou; Chen, Chu; Chen, Constance; Chen, Kexin; Cook, Linda S.; Crous Bou, Marta; De Vivo, Immaculata; Doherty, Jennifer; Friedenreich, Christine M.; Gaudet, Mia M.; Haiman, Christopher A.; Hankinson, Susan E.; Hartge, Patricia; Henderson, Brian E.; Hong, Yun-Chul; Hosgood, H. Dean; Hsiung, Chao A.; Hu, Wei; Hunter, David J.; Jessop, Lea; Kim, Hee Nam; Kim, Yeul Hong; Kim, Young Tae; Klein, Robert; Kraft, Peter; Lan, Qing; Lin, Dongxin; Liu, Jianjun; Le Marchand, Loic; Liang, Xiaolin; Lissowska, Jolanta; Lu, Lingeng; Magliocco, Anthony M.; Matsuo, Keitaro; Olson, Sara H.; Orlow, Irene; Park, Jae Yong; Pooler, Loreall; Prescott, Jennifer; Rastogi, Radhai; Risch, Harvey A.; Schumacher, Fredrick; Seow, Adeline; Setiawan, Veronica Wendy; Shen, Hongbing; Sheng, Xin; Shin, Min-Ho; Shu, Xiao-Ou; VanDen Berg, David; Wang, Jiu-Cun; Wentzensen, Nicolas; Wong, Maria Pik; Wu, Chen; Wu, Tangchun; Wu, Yi-Long; Xia, Lucy; Yang, Hannah P.; Yang, Pan-Chyr; Zheng, Wei; Zhou, Baosen; Abnet, Christian C.; Albanes, Demetrius; Aldrich, Melinda C.; Amos, Christopher; Amundadottir, Laufey T.; Berndt, Sonja I.; Blot, William J.; Bock, Cathryn H.; Bracci, Paige M.; Burdett, Laurie; Buring, Julie E.; Butler, Mary A.; Carreón, Tania; Chatterjee, Nilanjan; Chung, Charles C.; Cook, Michael B.; Cullen, Michael; Davis, Faith G.; Ding, Ti; Duell, Eric J.; Epstein, Caroline G.; Fan, Jin-Hu; Figueroa, Jonine D.; Fraumeni, Joseph F.; Freedman, Neal D.; Fuchs, Charles S.; Gao, Yu-Tang; Gapstur, Susan M.; Patiño-Garcia, Ana; Garcia-Closas, Montserrat; Gaziano, J. Michael; Giles, Graham G.; Gillanders, Elizabeth M.; Giovannucci, Edward L.; Goldin, Lynn; Goldstein, Alisa M.; Greene, Mark H.; Hallmans, Goran; Harris, Curtis C.; Henriksson, Roger; Holly, Elizabeth A.; Hoover, Robert N.; Hu, Nan; Hutchinson, Amy; Jenab, Mazda; Johansen, Christoffer; Khaw, Kay-Tee; Koh, Woon-Puay; Kolonel, Laurence N.; Kooperberg, Charles; Krogh, Vittorio; Kurtz, Robert C.; LaCroix, Andrea; Landgren, Annelie; Landi, Maria Teresa; Li, Donghui; Liao, Linda M.; Malats, Nuria; McGlynn, Katherine A.; McNeill, Lorna H.; McWilliams, Robert R.; Melin, Beatrice S.; Mirabello, Lisa; Peplonska, Beata; Peters, Ulrike; Petersen, Gloria M.; Prokunina-Olsson, Ludmila; Purdue, Mark; Qiao, You-Lin; Rabe, Kari G.; Rajaraman, Preetha; Real, Francisco X.; Riboli, Elio; Rodríguez-Santiago, Benjamín; Rothman, Nathaniel; Ruder, Avima M.; Savage, Sharon A.; Schwartz, Ann G.; Schwartz, Kendra L.; Sesso, Howard D.; Severi, Gianluca; Silverman, Debra T.; Spitz, Margaret R.; Stevens, Victoria L.; Stolzenberg-Solomon, Rachael; Stram, Daniel; Tang, Ze-Zhong; Taylor, Philip R.; Teras, Lauren R.; Tobias, Geoffrey S.; Viswanathan, Kala; Wacholder, Sholom; Wang, Zhaoming; Weinstein, Stephanie J.; Wheeler, William; White, Emily; Wiencke, John K.; Wolpin, Brian M.; Wu, Xifeng; Wunder, Jay S.; Yu, Kai; Zanetti, Krista A.; Zeleniuch-Jacquotte, Anne; Ziegler, Regina G.; de Andrade, Mariza; Barnes, Kathleen C.; Beaty, Terri H.; Bierut, Laura J.; Desch, Karl C.; Doheny, Kimberly F.; Feenstra, Bjarke; Ginsburg, David; Heit, John A.; Kang, Jae H.; Laurie, Cecilia A.; Li, Jun Z.; Lowe, William L.; Marazita, Mary L.; Melbye, Mads; Mirel, Daniel B.; Murray, Jeffrey C.; Nelson, Sarah C.; Pasquale, Louis R.; Rice, Kenneth; Wiggs, Janey L.; Wise, Anastasia; Tucker, Margaret; Pérez-Jurado, Luis A.; Laurie, Cathy C.; Caporaso, Neil E.; Yeager, Meredith; Chanock, Stephen J.

    2015-01-01

    Analyses of genome-wide association study (GWAS) data have revealed that detectable genetic mosaicism involving large (>2 Mb) structural autosomal alterations occurs in a fraction of individuals. We present results for a set of 24,849 genotyped individuals (total GWAS set II [TGSII]) in whom 341 large autosomal abnormalities were observed in 168 (0.68%) individuals. Merging data from the new TGSII set with data from two prior reports (the Gene-Environment Association Studies and the total GWAS set I) generated a large dataset of 127,179 individuals; we then conducted a meta-analysis to investigate the patterns of detectable autosomal mosaicism (n = 1,315 events in 925 [0.73%] individuals). Restricting to events >2 Mb in size, we observed an increase in event frequency as event size decreased. The combined results underscore that the rate of detectable mosaicism increases with age (p value = 5.5 × 10−31) and is higher in men (p value = 0.002) but lower in participants of African ancestry (p value = 0.003). In a subset of 47 individuals from whom serial samples were collected up to 6 years apart, complex changes were noted over time and showed an overall increase in the proportion of mosaic cells as age increased. Our large combined sample allowed for a unique ability to characterize detectable genetic mosaicism involving large structural events and strengthens the emerging evidence of non-random erosion of the genome in the aging population. PMID:25748358

  10. Statistical characterization of a large geochemical database and effect of sample size

    USGS Publications Warehouse

    Zhang, C.; Manheim, F.T.; Hinde, J.; Grossman, J.N.

    2005-01-01

    The authors investigated statistical distributions for concentrations of chemical elements from the National Geochemical Survey (NGS) database of the U.S. Geological Survey. At the time of this study, the NGS data set encompasses 48,544 stream sediment and soil samples from the conterminous United States analyzed by ICP-AES following a 4-acid near-total digestion. This report includes 27 elements: Al, Ca, Fe, K, Mg, Na, P, Ti, Ba, Ce, Co, Cr, Cu, Ga, La, Li, Mn, Nb, Nd, Ni, Pb, Sc, Sr, Th, V, Y and Zn. The goal and challenge for the statistical overview was to delineate chemical distributions in a complex, heterogeneous data set spanning a large geographic range (the conterminous United States), and many different geological provinces and rock types. After declustering to create a uniform spatial sample distribution with 16,511 samples, histograms and quantile-quantile (Q-Q) plots were employed to delineate subpopulations that have coherent chemical and mineral affinities. Probability groupings are discerned by changes in slope (kinks) on the plots. Major rock-forming elements, e.g., Al, Ca, K and Na, tend to display linear segments on normal Q-Q plots. These segments can commonly be linked to petrologic or mineralogical associations. For example, linear segments on K and Na plots reflect dilution of clay minerals by quartz sand (low in K and Na). Minor and trace element relationships are best displayed on lognormal Q-Q plots. These sensitively reflect discrete relationships in subpopulations within the wide range of the data. For example, small but distinctly log-linear subpopulations for Pb, Cu, Zn and Ag are interpreted to represent ore-grade enrichment of naturally occurring minerals such as sulfides. None of the 27 chemical elements could pass the test for either normal or lognormal distribution on the declustered data set. Part of the reasons relate to the presence of mixtures of subpopulations and outliers. Random samples of the data set with successively smaller numbers of data points showed that few elements passed standard statistical tests for normality or log-normality until sample size decreased to a few hundred data points. Large sample size enhances the power of statistical tests, and leads to rejection of most statistical hypotheses for real data sets. For large sample sizes (e.g., n > 1000), graphical methods such as histogram, stem-and-leaf, and probability plots are recommended for rough judgement of probability distribution if needed. ?? 2005 Elsevier Ltd. All rights reserved.

  11. Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Matulef, Kevin Michael

    The purpose of this project was to develop sampling-based algorithms to discover hidden struc- ture in massive data sets. Inferring structure in large data sets is an increasingly common task in many critical national security applications. These data sets come from myriad sources, such as network traffic, sensor data, and data generated by large-scale simulations. They are often so large that traditional data mining techniques are time consuming or even infeasible. To address this problem, we focus on a class of algorithms that do not compute an exact answer, but instead use sampling to compute an approximate answer using fewermore » resources. The particular class of algorithms that we focus on are streaming algorithms , so called because they are designed to handle high-throughput streams of data. Streaming algorithms have only a small amount of working storage - much less than the size of the full data stream - so they must necessarily use sampling to approximate the correct answer. We present two results: * A streaming algorithm called HyperHeadTail , that estimates the degree distribution of a graph (i.e., the distribution of the number of connections for each node in a network). The degree distribution is a fundamental graph property, but prior work on estimating the degree distribution in a streaming setting was impractical for many real-world application. We improve upon prior work by developing an algorithm that can handle streams with repeated edges, and graph structures that evolve over time. * An algorithm for the task of maintaining a weighted subsample of items in a stream, when the items must be sampled according to their weight, and the weights are dynamically changing. To our knowledge, this is the first such algorithm designed for dynamically evolving weights. We expect it may be useful as a building block for other streaming algorithms on dynamic data sets.« less

  12. Neuro-genetic system for optimization of GMI samples sensitivity.

    PubMed

    Pitta Botelho, A C O; Vellasco, M M B R; Hall Barbosa, C R; Costa Silva, E

    2016-03-01

    Magnetic sensors are largely used in several engineering areas. Among them, magnetic sensors based on the Giant Magnetoimpedance (GMI) effect are a new family of magnetic sensing devices that have a huge potential for applications involving measurements of ultra-weak magnetic fields. The sensitivity of magnetometers is directly associated with the sensitivity of their sensing elements. The GMI effect is characterized by a large variation of the impedance (magnitude and phase) of a ferromagnetic sample, when subjected to a magnetic field. Recent studies have shown that phase-based GMI magnetometers have the potential to increase the sensitivity by about 100 times. The sensitivity of GMI samples depends on several parameters, such as sample length, external magnetic field, DC level and frequency of the excitation current. However, this dependency is yet to be sufficiently well-modeled in quantitative terms. So, the search for the set of parameters that optimizes the samples sensitivity is usually empirical and very time consuming. This paper deals with this problem by proposing a new neuro-genetic system aimed at maximizing the impedance phase sensitivity of GMI samples. A Multi-Layer Perceptron (MLP) Neural Network is used to model the impedance phase and a Genetic Algorithm uses the information provided by the neural network to determine which set of parameters maximizes the impedance phase sensitivity. The results obtained with a data set composed of four different GMI sample lengths demonstrate that the neuro-genetic system is able to correctly and automatically determine the set of conditioning parameters responsible for maximizing their phase sensitivities. Copyright © 2015 Elsevier Ltd. All rights reserved.

  13. Screening experiments of ecstasy street samples using near infrared spectroscopy.

    PubMed

    Sondermann, N; Kovar, K A

    1999-12-20

    Twelve different sets of confiscated ecstasy samples were analysed applying both near infrared spectroscopy in reflectance mode (1100-2500 nm) and high-performance liquid chromatography (HPLC). The sets showed a large variance in composition. A calibration data set was generated based on the theory of factorial designs. It contained 221 N-methyl-3,4-methylenedioxyamphetamine (MDMA) samples, 167 N-ethyl-3,4-methylenedioxyamphetamine (MDE), 111 amphetamine and 106 samples without a controlled substance, which will be called placebo samples thereafter. From this data set, PLS-1 models were calculated and were successfully applied for validation of various external laboratory test sets. The transferability of these results to confiscated tablets is demonstrated here. It is shown that differentiation into placebo, amphetamine and ecstasy samples is possible. Analysis of intact tablets is practicable. However, more reliable results are obtained from pulverised samples. This is due to ill-defined production procedures. The use of mathematically pretreated spectra improves the prediction quality of all the PLS-1 models studied. It is possible to improve discrimination between MDE and MDMA with the help of a second model based on raw spectra. Alternative strategies are briefly discussed.

  14. Validating a large geophysical data set: Experiences with satellite-derived cloud parameters

    NASA Technical Reports Server (NTRS)

    Kahn, Ralph; Haskins, Robert D.; Knighton, James E.; Pursch, Andrew; Granger-Gallegos, Stephanie

    1992-01-01

    We are validating the global cloud parameters derived from the satellite-borne HIRS2 and MSU atmospheric sounding instrument measurements, and are using the analysis of these data as one prototype for studying large geophysical data sets in general. The HIRS2/MSU data set contains a total of 40 physical parameters, filling 25 MB/day; raw HIRS2/MSU data are available for a period exceeding 10 years. Validation involves developing a quantitative sense for the physical meaning of the derived parameters over the range of environmental conditions sampled. This is accomplished by comparing the spatial and temporal distributions of the derived quantities with similar measurements made using other techniques, and with model results. The data handling needed for this work is possible only with the help of a suite of interactive graphical and numerical analysis tools. Level 3 (gridded) data is the common form in which large data sets of this type are distributed for scientific analysis. We find that Level 3 data is inadequate for the data comparisons required for validation. Level 2 data (individual measurements in geophysical units) is needed. A sampling problem arises when individual measurements, which are not uniformly distributed in space or time, are used for the comparisons. Standard 'interpolation' methods involve fitting the measurements for each data set to surfaces, which are then compared. We are experimenting with formal criteria for selecting geographical regions, based upon the spatial frequency and variability of measurements, that allow us to quantify the uncertainty due to sampling. As part of this project, we are also dealing with ways to keep track of constraints placed on the output by assumptions made in the computer code. The need to work with Level 2 data introduces a number of other data handling issues, such as accessing data files across machine types, meeting large data storage requirements, accessing other validated data sets, processing speed and throughput for interactive graphical work, and problems relating to graphical interfaces.

  15. U.S. Food safety and Inspection Service testing for Salmonella in selected raw meat and poultry products in the United States, 1998 through 2003: an establishment-level analysis.

    PubMed

    Eblen, Denise R; Barlow, Kristina E; Naugle, Alecia Larew

    2006-11-01

    The U.S. Food Safety and Inspection Service (FSIS) pathogen reduction-hazard analysis critical control point systems final rule, published in 1996, established Salmonella performance standards for broiler chicken, cow and bull, market hog, and steer and heifer carcasses and for ground beef, chicken, and turkey meat. In 1998, the FSIS began testing to verify that establishments are meeting performance standards. Samples are collected in sets in which the number of samples is defined but varies according to product class. A sample set fails when the number of positive Salmonella samples exceeds the maximum number of positive samples allowed under the performance standard. Salmonella sample sets collected at 1,584 establishments from 1998 through 2003 were examined to identify factors associated with failure of one or more sets. Overall, 1,282 (80.9%) of establishments never had failed sets. In establishments that did experience set failure(s), generally the failed sets were collected early in the establishment testing history, with the exception of broiler establishments where failure(s) occurred both early and late in the course of testing. Small establishments were more likely to have experienced a set failure than were large or very small establishments, and broiler establishments were more likely to have failed than were ground beef, market hog, or steer-heifer establishments. Agency response to failed Salmonella sample sets in the form of in-depth verification reviews and related establishment-initiated corrective actions have likely contributed to declines in the number of establishments that failed sets. A focus on food safety measures in small establishments and broiler processing establishments should further reduce the number of sample sets that fail to meet the Salmonella performance standard.

  16. A posteriori noise estimation in variable data sets. With applications to spectra and light curves

    NASA Astrophysics Data System (ADS)

    Czesla, S.; Molle, T.; Schmitt, J. H. M. M.

    2018-01-01

    Most physical data sets contain a stochastic contribution produced by measurement noise or other random sources along with the signal. Usually, neither the signal nor the noise are accurately known prior to the measurement so that both have to be estimated a posteriori. We have studied a procedure to estimate the standard deviation of the stochastic contribution assuming normality and independence, requiring a sufficiently well-sampled data set to yield reliable results. This procedure is based on estimating the standard deviation in a sample of weighted sums of arbitrarily sampled data points and is identical to the so-called DER_SNR algorithm for specific parameter settings. To demonstrate the applicability of our procedure, we present applications to synthetic data, high-resolution spectra, and a large sample of space-based light curves and, finally, give guidelines to apply the procedure in situation not explicitly considered here to promote its adoption in data analysis.

  17. The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets.

    PubMed

    González-Recio, O; Jiménez-Montero, J A; Alenda, R

    2013-01-01

    In the next few years, with the advent of high-density single nucleotide polymorphism (SNP) arrays and genome sequencing, genomic evaluation methods will need to deal with a large number of genetic variants and an increasing sample size. The boosting algorithm is a machine-learning technique that may alleviate the drawbacks of dealing with such large data sets. This algorithm combines different predictors in a sequential manner with some shrinkage on them; each predictor is applied consecutively to the residuals from the committee formed by the previous ones to form a final prediction based on a subset of covariates. Here, a detailed description is provided and examples using a toy data set are included. A modification of the algorithm called "random boosting" was proposed to increase predictive ability and decrease computation time of genome-assisted evaluation in large data sets. Random boosting uses a random selection of markers to add a subsequent weak learner to the predictive model. These modifications were applied to a real data set composed of 1,797 bulls genotyped for 39,714 SNP. Deregressed proofs of 4 yield traits and 1 type trait from January 2009 routine evaluations were used as dependent variables. A 2-fold cross-validation scenario was implemented. Sires born before 2005 were used as a training sample (1,576 and 1,562 for production and type traits, respectively), whereas younger sires were used as a testing sample to evaluate predictive ability of the algorithm on yet-to-be-observed phenotypes. Comparison with the original algorithm was provided. The predictive ability of the algorithm was measured as Pearson correlations between observed and predicted responses. Further, estimated bias was computed as the average difference between observed and predicted phenotypes. The results showed that the modification of the original boosting algorithm could be run in 1% of the time used with the original algorithm and with negligible differences in accuracy and bias. This modification may be used to speed the calculus of genome-assisted evaluation in large data sets such us those obtained from consortiums. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  18. Threshold Theory Tested in an Organizational Setting: The Relation between Perceived Innovativeness and Intelligence in a Large Sample of Leaders

    ERIC Educational Resources Information Center

    Christensen, Bo T.; Hartmann, Peter V. W.; Rasmussen, Thomas Hedegaard

    2017-01-01

    A large sample of leaders (N = 4257) was used to test the link between leader innovativeness and intelligence. The threshold theory of the link between creativity and intelligence assumes that below a certain IQ level (approximately IQ 120), there is some correlation between IQ and creative potential, but above this cutoff point, there is no…

  19. Exploring Collaborative Culture and Leadership in Large High Schools

    ERIC Educational Resources Information Center

    Jeffers, Michael P.

    2013-01-01

    The purpose of this exploratory study was to analyze how high school principals approached developing a collaborative culture and providing collaborative leadership in a large high school setting. The population sample for this study was 82 principals of large comprehensive high schools of grades 9 through 12 or some combination thereof with…

  20. Evaluating information content of SNPs for sample-tagging in re-sequencing projects.

    PubMed

    Hu, Hao; Liu, Xiang; Jin, Wenfei; Hilger Ropers, H; Wienker, Thomas F

    2015-05-15

    Sample-tagging is designed for identification of accidental sample mix-up, which is a major issue in re-sequencing studies. In this work, we develop a model to measure the information content of SNPs, so that we can optimize a panel of SNPs that approach the maximal information for discrimination. The analysis shows that as low as 60 optimized SNPs can differentiate the individuals in a population as large as the present world, and only 30 optimized SNPs are in practice sufficient in labeling up to 100 thousand individuals. In the simulated populations of 100 thousand individuals, the average Hamming distances, generated by the optimized set of 30 SNPs are larger than 18, and the duality frequency, is lower than 1 in 10 thousand. This strategy of sample discrimination is proved robust in large sample size and different datasets. The optimized sets of SNPs are designed for Whole Exome Sequencing, and a program is provided for SNP selection, allowing for customized SNP numbers and interested genes. The sample-tagging plan based on this framework will improve re-sequencing projects in terms of reliability and cost-effectiveness.

  1. Comparison of Two Methods for Estimating the Sampling-Related Uncertainty of Satellite Rainfall Averages Based on a Large Radar Data Set

    NASA Technical Reports Server (NTRS)

    Lau, William K. M. (Technical Monitor); Bell, Thomas L.; Steiner, Matthias; Zhang, Yu; Wood, Eric F.

    2002-01-01

    The uncertainty of rainfall estimated from averages of discrete samples collected by a satellite is assessed using a multi-year radar data set covering a large portion of the United States. The sampling-related uncertainty of rainfall estimates is evaluated for all combinations of 100 km, 200 km, and 500 km space domains, 1 day, 5 day, and 30 day rainfall accumulations, and regular sampling time intervals of 1 h, 3 h, 6 h, 8 h, and 12 h. These extensive analyses are combined to characterize the sampling uncertainty as a function of space and time domain, sampling frequency, and rainfall characteristics by means of a simple scaling law. Moreover, it is shown that both parametric and non-parametric statistical techniques of estimating the sampling uncertainty produce comparable results. Sampling uncertainty estimates, however, do depend on the choice of technique for obtaining them. They can also vary considerably from case to case, reflecting the great variability of natural rainfall, and should therefore be expressed in probabilistic terms. Rainfall calibration errors are shown to affect comparison of results obtained by studies based on data from different climate regions and/or observation platforms.

  2. Dynamic permeability in fault damage zones induced by repeated coseismic fracturing events

    NASA Astrophysics Data System (ADS)

    Aben, F. M.; Doan, M. L.; Mitchell, T. M.

    2017-12-01

    Off-fault fracture damage in upper crustal fault zones change the fault zone properties and affect various co- and interseismic processes. One of these properties is the permeability of the fault damage zone rocks, which is generally higher than the surrounding host rock. This allows large-scale fluid flow through the fault zone that affects fault healing and promotes mineral transformation processes. Moreover, it might play an important role in thermal fluid pressurization during an earthquake rupture. The damage zone permeability is dynamic due to coseismic damaging. It is crucial for earthquake mechanics and for longer-term processes to understand how the dynamic permeability structure of a fault looks like and how it evolves with repeated earthquakes. To better detail coseismically induced permeability, we have performed uniaxial split Hopkinson pressure bar experiments on quartz-monzonite rock samples. Two sample sets were created and analyzed: single-loaded samples subjected to varying loading intensities - with damage varying from apparently intact to pulverized - and samples loaded at a constant intensity but with a varying number of repeated loadings. The first set resembles a dynamic permeability structure created by a single large earthquake. The second set resembles a permeability structure created by several earthquakes. After, the permeability and acoustic velocities were measured as a function of confining pressure. The permeability in both datasets shows a large and non-linear increase over several orders of magnitude (from 10-20 up to 10-14 m2) with an increasing amount of fracture damage. This, combined with microstructural analyses of the varying degrees of damage, suggests a percolation threshold. The percolation threshold does not coincide with the pulverization threshold. With increasing confining pressure, the permeability might drop up to two orders of magnitude, which supports the possibility of large coseismic fluid pulses over relatively large distances along a fault. Also, a relatively small threshold could potentially increase permeability in a large volume of rock, given that previous earthquakes already damaged these rocks.

  3. Characterization of large structural genetic mosaicism in human autosomes.

    PubMed

    Machiela, Mitchell J; Zhou, Weiyin; Sampson, Joshua N; Dean, Michael C; Jacobs, Kevin B; Black, Amanda; Brinton, Louise A; Chang, I-Shou; Chen, Chu; Chen, Constance; Chen, Kexin; Cook, Linda S; Crous Bou, Marta; De Vivo, Immaculata; Doherty, Jennifer; Friedenreich, Christine M; Gaudet, Mia M; Haiman, Christopher A; Hankinson, Susan E; Hartge, Patricia; Henderson, Brian E; Hong, Yun-Chul; Hosgood, H Dean; Hsiung, Chao A; Hu, Wei; Hunter, David J; Jessop, Lea; Kim, Hee Nam; Kim, Yeul Hong; Kim, Young Tae; Klein, Robert; Kraft, Peter; Lan, Qing; Lin, Dongxin; Liu, Jianjun; Le Marchand, Loic; Liang, Xiaolin; Lissowska, Jolanta; Lu, Lingeng; Magliocco, Anthony M; Matsuo, Keitaro; Olson, Sara H; Orlow, Irene; Park, Jae Yong; Pooler, Loreall; Prescott, Jennifer; Rastogi, Radhai; Risch, Harvey A; Schumacher, Fredrick; Seow, Adeline; Setiawan, Veronica Wendy; Shen, Hongbing; Sheng, Xin; Shin, Min-Ho; Shu, Xiao-Ou; VanDen Berg, David; Wang, Jiu-Cun; Wentzensen, Nicolas; Wong, Maria Pik; Wu, Chen; Wu, Tangchun; Wu, Yi-Long; Xia, Lucy; Yang, Hannah P; Yang, Pan-Chyr; Zheng, Wei; Zhou, Baosen; Abnet, Christian C; Albanes, Demetrius; Aldrich, Melinda C; Amos, Christopher; Amundadottir, Laufey T; Berndt, Sonja I; Blot, William J; Bock, Cathryn H; Bracci, Paige M; Burdett, Laurie; Buring, Julie E; Butler, Mary A; Carreón, Tania; Chatterjee, Nilanjan; Chung, Charles C; Cook, Michael B; Cullen, Michael; Davis, Faith G; Ding, Ti; Duell, Eric J; Epstein, Caroline G; Fan, Jin-Hu; Figueroa, Jonine D; Fraumeni, Joseph F; Freedman, Neal D; Fuchs, Charles S; Gao, Yu-Tang; Gapstur, Susan M; Patiño-Garcia, Ana; Garcia-Closas, Montserrat; Gaziano, J Michael; Giles, Graham G; Gillanders, Elizabeth M; Giovannucci, Edward L; Goldin, Lynn; Goldstein, Alisa M; Greene, Mark H; Hallmans, Goran; Harris, Curtis C; Henriksson, Roger; Holly, Elizabeth A; Hoover, Robert N; Hu, Nan; Hutchinson, Amy; Jenab, Mazda; Johansen, Christoffer; Khaw, Kay-Tee; Koh, Woon-Puay; Kolonel, Laurence N; Kooperberg, Charles; Krogh, Vittorio; Kurtz, Robert C; LaCroix, Andrea; Landgren, Annelie; Landi, Maria Teresa; Li, Donghui; Liao, Linda M; Malats, Nuria; McGlynn, Katherine A; McNeill, Lorna H; McWilliams, Robert R; Melin, Beatrice S; Mirabello, Lisa; Peplonska, Beata; Peters, Ulrike; Petersen, Gloria M; Prokunina-Olsson, Ludmila; Purdue, Mark; Qiao, You-Lin; Rabe, Kari G; Rajaraman, Preetha; Real, Francisco X; Riboli, Elio; Rodríguez-Santiago, Benjamín; Rothman, Nathaniel; Ruder, Avima M; Savage, Sharon A; Schwartz, Ann G; Schwartz, Kendra L; Sesso, Howard D; Severi, Gianluca; Silverman, Debra T; Spitz, Margaret R; Stevens, Victoria L; Stolzenberg-Solomon, Rachael; Stram, Daniel; Tang, Ze-Zhong; Taylor, Philip R; Teras, Lauren R; Tobias, Geoffrey S; Viswanathan, Kala; Wacholder, Sholom; Wang, Zhaoming; Weinstein, Stephanie J; Wheeler, William; White, Emily; Wiencke, John K; Wolpin, Brian M; Wu, Xifeng; Wunder, Jay S; Yu, Kai; Zanetti, Krista A; Zeleniuch-Jacquotte, Anne; Ziegler, Regina G; de Andrade, Mariza; Barnes, Kathleen C; Beaty, Terri H; Bierut, Laura J; Desch, Karl C; Doheny, Kimberly F; Feenstra, Bjarke; Ginsburg, David; Heit, John A; Kang, Jae H; Laurie, Cecilia A; Li, Jun Z; Lowe, William L; Marazita, Mary L; Melbye, Mads; Mirel, Daniel B; Murray, Jeffrey C; Nelson, Sarah C; Pasquale, Louis R; Rice, Kenneth; Wiggs, Janey L; Wise, Anastasia; Tucker, Margaret; Pérez-Jurado, Luis A; Laurie, Cathy C; Caporaso, Neil E; Yeager, Meredith; Chanock, Stephen J

    2015-03-05

    Analyses of genome-wide association study (GWAS) data have revealed that detectable genetic mosaicism involving large (>2 Mb) structural autosomal alterations occurs in a fraction of individuals. We present results for a set of 24,849 genotyped individuals (total GWAS set II [TGSII]) in whom 341 large autosomal abnormalities were observed in 168 (0.68%) individuals. Merging data from the new TGSII set with data from two prior reports (the Gene-Environment Association Studies and the total GWAS set I) generated a large dataset of 127,179 individuals; we then conducted a meta-analysis to investigate the patterns of detectable autosomal mosaicism (n = 1,315 events in 925 [0.73%] individuals). Restricting to events >2 Mb in size, we observed an increase in event frequency as event size decreased. The combined results underscore that the rate of detectable mosaicism increases with age (p value = 5.5 × 10(-31)) and is higher in men (p value = 0.002) but lower in participants of African ancestry (p value = 0.003). In a subset of 47 individuals from whom serial samples were collected up to 6 years apart, complex changes were noted over time and showed an overall increase in the proportion of mosaic cells as age increased. Our large combined sample allowed for a unique ability to characterize detectable genetic mosaicism involving large structural events and strengthens the emerging evidence of non-random erosion of the genome in the aging population. Copyright © 2015 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  4. U.S. Food Safety and Inspection Service testing for Salmonella in selected raw meat and poultry products in the United States, 1998 through 2003: analysis of set results.

    PubMed

    Naugle, Alecia Larew; Barlow, Kristina E; Eblen, Denise R; Teter, Vanessa; Umholtz, Robert

    2006-11-01

    The U.S. Food Safety and Inspection Service (FSIS) tests sets of samples of selected raw meat and poultry products for Salmonella to ensure that federally inspected establishments meet performance standards defined in the pathogen reduction-hazard analysis and critical control point system (PR-HACCP) final rule. In the present report, sample set results are described and associations between set failure and set and establishment characteristics are identified for 4,607 sample sets collected from 1998 through 2003. Sample sets were obtained from seven product classes: broiler chicken carcasses (n = 1,010), cow and bull carcasses (n = 240), market hog carcasses (n = 560), steer and heifer carcasses (n = 123), ground beef (n = 2,527), ground chicken (n = 31), and ground turkey (n = 116). Of these 4,607 sample sets, 92% (4,255) were collected as part of random testing efforts (A sets), and 93% (4,166) passed. However, the percentage of positive samples relative to the maximum number of positive results allowable in a set increased over time for broilers but decreased or stayed the same for the other product classes. Three factors associated with set failure were identified: establishment size, product class, and year. Set failures were more likely early in the testing program (relative to 2003). Small and very small establishments were more likely to fail than large ones. Set failure was less likely in ground beef than in other product classes. Despite an overall decline in set failures through 2003, these results highlight the need for continued vigilance to reduce Salmonella contamination in broiler chicken and continued implementation of programs designed to assist small and very small establishments with PR-HACCP compliance issues.

  5. How large a training set is needed to develop a classifier for microarray data?

    PubMed

    Dobbin, Kevin K; Zhao, Yingdong; Simon, Richard M

    2008-01-01

    A common goal of gene expression microarray studies is the development of a classifier that can be used to divide patients into groups with different prognoses, or with different expected responses to a therapy. These types of classifiers are developed on a training set, which is the set of samples used to train a classifier. The question of how many samples are needed in the training set to produce a good classifier from high-dimensional microarray data is challenging. We present a model-based approach to determining the sample size required to adequately train a classifier. It is shown that sample size can be determined from three quantities: standardized fold change, class prevalence, and number of genes or features on the arrays. Numerous examples and important experimental design issues are discussed. The method is adapted to address ex post facto determination of whether the size of a training set used to develop a classifier was adequate. An interactive web site for performing the sample size calculations is provided. We showed that sample size calculations for classifier development from high-dimensional microarray data are feasible, discussed numerous important considerations, and presented examples.

  6. Comparative Characterization of Crofelemer Samples Using Data Mining and Machine Learning Approaches With Analytical Stability Data Sets.

    PubMed

    Nariya, Maulik K; Kim, Jae Hyun; Xiong, Jian; Kleindl, Peter A; Hewarathna, Asha; Fisher, Adam C; Joshi, Sangeeta B; Schöneich, Christian; Forrest, M Laird; Middaugh, C Russell; Volkin, David B; Deeds, Eric J

    2017-11-01

    There is growing interest in generating physicochemical and biological analytical data sets to compare complex mixture drugs, for example, products from different manufacturers. In this work, we compare various crofelemer samples prepared from a single lot by filtration with varying molecular weight cutoffs combined with incubation for different times at different temperatures. The 2 preceding articles describe experimental data sets generated from analytical characterization of fractionated and degraded crofelemer samples. In this work, we use data mining techniques such as principal component analysis and mutual information scores to help visualize the data and determine discriminatory regions within these large data sets. The mutual information score identifies chemical signatures that differentiate crofelemer samples. These signatures, in many cases, would likely be missed by traditional data analysis tools. We also found that supervised learning classifiers robustly discriminate samples with around 99% classification accuracy, indicating that mathematical models of these physicochemical data sets are capable of identifying even subtle differences in crofelemer samples. Data mining and machine learning techniques can thus identify fingerprint-type attributes of complex mixture drugs that may be used for comparative characterization of products. Copyright © 2017 American Pharmacists Association®. All rights reserved.

  7. Viewpoints: Interactive Exploration of Large Multivariate Earth and Space Science Data Sets

    NASA Astrophysics Data System (ADS)

    Levit, C.; Gazis, P. R.

    2006-05-01

    Analysis and visualization of extremely large and complex data sets may be one of the most significant challenges facing earth and space science investigators in the forthcoming decades. While advances in hardware speed and storage technology have roughly kept up with (indeed, have driven) increases in database size, the same is not of our abilities to manage the complexity of these data. Current missions, instruments, and simulations produce so much data of such high dimensionality that they outstrip the capabilities of traditional visualization and analysis software. This problem can only be expected to get worse as data volumes increase by orders of magnitude in future missions and in ever-larger supercomputer simulations. For large multivariate data (more than 105 samples or records with more than 5 variables per sample) the interactive graphics response of most existing statistical analysis, machine learning, exploratory data analysis, and/or visualization tools such as Torch, MLC++, Matlab, S++/R, and IDL stutters, stalls, or stops working altogether. Fortunately, the graphics processing units (GPUs) built in to all professional desktop and laptop computers currently on the market are capable of transforming, filtering, and rendering hundreds of millions of points per second. We present a prototype open-source cross-platform application which leverages much of the power latent in the GPU to enable smooth interactive exploration and analysis of large high- dimensional data using a variety of classical and recent techniques. The targeted application is the interactive analysis of large, complex, multivariate data sets, with dimensionalities that may surpass 100 and sample sizes that may exceed 106-108.

  8. Crowdsourcing for Cognitive Science – The Utility of Smartphones

    PubMed Central

    Brown, Harriet R.; Zeidman, Peter; Smittenaar, Peter; Adams, Rick A.; McNab, Fiona; Rutledge, Robb B.; Dolan, Raymond J.

    2014-01-01

    By 2015, there will be an estimated two billion smartphone users worldwide. This technology presents exciting opportunities for cognitive science as a medium for rapid, large-scale experimentation and data collection. At present, cost and logistics limit most study populations to small samples, restricting the experimental questions that can be addressed. In this study we investigated whether the mass collection of experimental data using smartphone technology is valid, given the variability of data collection outside of a laboratory setting. We presented four classic experimental paradigms as short games, available as a free app and over the first month 20,800 users submitted data. We found that the large sample size vastly outweighed the noise inherent in collecting data outside a controlled laboratory setting, and show that for all four games canonical results were reproduced. For the first time, we provide experimental validation for the use of smartphones for data collection in cognitive science, which can lead to the collection of richer data sets and a significant cost reduction as well as provide an opportunity for efficient phenotypic screening of large populations. PMID:25025865

  9. Crowdsourcing for cognitive science--the utility of smartphones.

    PubMed

    Brown, Harriet R; Zeidman, Peter; Smittenaar, Peter; Adams, Rick A; McNab, Fiona; Rutledge, Robb B; Dolan, Raymond J

    2014-01-01

    By 2015, there will be an estimated two billion smartphone users worldwide. This technology presents exciting opportunities for cognitive science as a medium for rapid, large-scale experimentation and data collection. At present, cost and logistics limit most study populations to small samples, restricting the experimental questions that can be addressed. In this study we investigated whether the mass collection of experimental data using smartphone technology is valid, given the variability of data collection outside of a laboratory setting. We presented four classic experimental paradigms as short games, available as a free app and over the first month 20,800 users submitted data. We found that the large sample size vastly outweighed the noise inherent in collecting data outside a controlled laboratory setting, and show that for all four games canonical results were reproduced. For the first time, we provide experimental validation for the use of smartphones for data collection in cognitive science, which can lead to the collection of richer data sets and a significant cost reduction as well as provide an opportunity for efficient phenotypic screening of large populations.

  10. Transferring genomics to the clinic: distinguishing Burkitt and diffuse large B cell lymphomas.

    PubMed

    Sha, Chulin; Barrans, Sharon; Care, Matthew A; Cunningham, David; Tooze, Reuben M; Jack, Andrew; Westhead, David R

    2015-01-01

    Classifiers based on molecular criteria such as gene expression signatures have been developed to distinguish Burkitt lymphoma and diffuse large B cell lymphoma, which help to explore the intermediate cases where traditional diagnosis is difficult. Transfer of these research classifiers into a clinical setting is challenging because there are competing classifiers in the literature based on different methodology and gene sets with no clear best choice; classifiers based on one expression measurement platform may not transfer effectively to another; and, classifiers developed using fresh frozen samples may not work effectively with the commonly used and more convenient formalin fixed paraffin-embedded samples used in routine diagnosis. Here we thoroughly compared two published high profile classifiers developed on data from different Affymetrix array platforms and fresh-frozen tissue, examining their transferability and concordance. Based on this analysis, a new Burkitt and diffuse large B cell lymphoma classifier (BDC) was developed and employed on Illumina DASL data from our own paraffin-embedded samples, allowing comparison with the diagnosis made in a central haematopathology laboratory and evaluation of clinical relevance. We show that both previous classifiers can be recapitulated using very much smaller gene sets than originally employed, and that the classification result is closely dependent on the Burkitt lymphoma criteria applied in the training set. The BDC classification on our data exhibits high agreement (~95 %) with the original diagnosis. A simple outcome comparison in the patients presenting intermediate features on conventional criteria suggests that the cases classified as Burkitt lymphoma by BDC have worse response to standard diffuse large B cell lymphoma treatment than those classified as diffuse large B cell lymphoma. In this study, we comprehensively investigate two previous Burkitt lymphoma molecular classifiers, and implement a new gene expression classifier, BDC, that works effectively on paraffin-embedded samples and provides useful information for treatment decisions. The classifier is available as a free software package under the GNU public licence within the R statistical software environment through the link http://www.bioinformatics.leeds.ac.uk/labpages/softwares/ or on github https://github.com/Sharlene/BDC.

  11. Predictability of Circulation Transitions (Observed and Modeled): Non-diffusive Dynamics, Markov Chains and Error Growth.

    NASA Astrophysics Data System (ADS)

    Straus, D. M.

    2006-12-01

    The transitions between portions of the state space of the large-scale flow is studied from daily wintertime data over the Pacific North America region using the NCEP reanalysis data set (54 winters) and very large suites of hindcasts made with the COLA atmospheric GCM with observed SST (55 members for each of 18 winters). The partition of the large-scale state space is guided by cluster analysis, whose statistical significance and relationship to SST is reviewed (Straus and Molteni, 2004; Straus, Corti and Molteni, 2006). The determination of the global nature of the flow through state space is studied using Markov Chains (Crommelin, 2004). In particular the non-diffusive part of the flow is contrasted in nature (small data sample) and the AGCM (large data sample). The intrinsic error growth associated with different portions of the state space is studied through sets of identical twin AGCM simulations. The goal is to obtain realistic estimates of predictability times for large-scale transitions that should be useful in long-range forecasting.

  12. A novel method for semen collection and artificial insemination in large parrots (Psittaciformes)

    PubMed Central

    Lierz, Michael; Reinschmidt, Matthias; Müller, Heiner; Wink, Michael; Neumann, Daniel

    2013-01-01

    The paper described a novel technique for semen collection in large psittacines (patent pending), a procedure which was not routinely possible before. For the first time, a large set of semen samples is now available for analysis as well as for artificial insemination. Semen samples of more than 100 psittacine taxa were collected and analysed; data demonstrate large differences in the spermatological parameters between families, indicating an ecological relationship with breeding behaviour (polygamous versus monogamous birds). Using semen samples for artificial insemination resulted in the production of offspring in various families, such as Macaws and Cockatoos, for the first time ever. The present technique represents a breakthrough in species conservation programs and will enable future research into the ecology and environmental factors influencing endangered species. PMID:23797622

  13. The topology of large-scale structure. III - Analysis of observations

    NASA Astrophysics Data System (ADS)

    Gott, J. Richard, III; Miller, John; Thuan, Trinh X.; Schneider, Stephen E.; Weinberg, David H.; Gammie, Charles; Polk, Kevin; Vogeley, Michael; Jeffrey, Scott; Bhavsar, Suketu P.; Melott, Adrian L.; Giovanelli, Riccardo; Hayes, Martha P.; Tully, R. Brent; Hamilton, Andrew J. S.

    1989-05-01

    A recently developed algorithm for quantitatively measuring the topology of large-scale structures in the universe was applied to a number of important observational data sets. The data sets included an Abell (1958) cluster sample out to Vmax = 22,600 km/sec, the Giovanelli and Haynes (1985) sample out to Vmax = 11,800 km/sec, the CfA sample out to Vmax = 5000 km/sec, the Thuan and Schneider (1988) dwarf sample out to Vmax = 3000 km/sec, and the Tully (1987) sample out to Vmax = 3000 km/sec. It was found that, when the topology is studied on smoothing scales significantly larger than the correlation length (i.e., smoothing length, lambda, not below 1200 km/sec), the topology is spongelike and is consistent with the standard model in which the structure seen today has grown from small fluctuations caused by random noise in the early universe. When the topology is studied on the scale of lambda of about 600 km/sec, a small shift is observed in the genus curve in the direction of a 'meatball' topology.

  14. The topology of large-scale structure. III - Analysis of observations. [in universe

    NASA Technical Reports Server (NTRS)

    Gott, J. Richard, III; Weinberg, David H.; Miller, John; Thuan, Trinh X.; Schneider, Stephen E.

    1989-01-01

    A recently developed algorithm for quantitatively measuring the topology of large-scale structures in the universe was applied to a number of important observational data sets. The data sets included an Abell (1958) cluster sample out to Vmax = 22,600 km/sec, the Giovanelli and Haynes (1985) sample out to Vmax = 11,800 km/sec, the CfA sample out to Vmax = 5000 km/sec, the Thuan and Schneider (1988) dwarf sample out to Vmax = 3000 km/sec, and the Tully (1987) sample out to Vmax = 3000 km/sec. It was found that, when the topology is studied on smoothing scales significantly larger than the correlation length (i.e., smoothing length, lambda, not below 1200 km/sec), the topology is spongelike and is consistent with the standard model in which the structure seen today has grown from small fluctuations caused by random noise in the early universe. When the topology is studied on the scale of lambda of about 600 km/sec, a small shift is observed in the genus curve in the direction of a 'meatball' topology.

  15. The effects of task difficulty, novelty and the size of the search space on intrinsically motivated exploration.

    PubMed

    Baranes, Adrien F; Oudeyer, Pierre-Yves; Gottlieb, Jacqueline

    2014-01-01

    Devising efficient strategies for exploration in large open-ended spaces is one of the most difficult computational problems of intelligent organisms. Because the available rewards are ambiguous or unknown during the exploratory phase, subjects must act in intrinsically motivated fashion. However, a vast majority of behavioral and neural studies to date have focused on decision making in reward-based tasks, and the rules guiding intrinsically motivated exploration remain largely unknown. To examine this question we developed a paradigm for systematically testing the choices of human observers in a free play context. Adult subjects played a series of short computer games of variable difficulty, and freely choose which game they wished to sample without external guidance or physical rewards. Subjects performed the task in three distinct conditions where they sampled from a small or a large choice set (7 vs. 64 possible levels of difficulty), and where they did or did not have the possibility to sample new games at a constant level of difficulty. We show that despite the absence of external constraints, the subjects spontaneously adopted a structured exploration strategy whereby they (1) started with easier games and progressed to more difficult games, (2) sampled the entire choice set including extremely difficult games that could not be learnt, (3) repeated moderately and high difficulty games much more frequently than was predicted by chance, and (4) had higher repetition rates and chose higher speeds if they could generate new sequences at a constant level of difficulty. The results suggest that intrinsically motivated exploration is shaped by several factors including task difficulty, novelty and the size of the choice set, and these come into play to serve two internal goals-maximize the subjects' knowledge of the available tasks (exploring the limits of the task set), and maximize their competence (performance and skills) across the task set.

  16. Characteristics and Pathways of Long-Stay Patients in High and Medium Secure Settings in England; A Secondary Publication From a Large Mixed-Methods Study.

    PubMed

    Völlm, Birgit A; Edworthy, Rachel; Huband, Nick; Talbot, Emily; Majid, Shazmin; Holley, Jessica; Furtado, Vivek; Weaver, Tim; McDonald, Ruth; Duggan, Conor

    2018-01-01

    Background: Many patients experience extended stays within forensic care, but the characteristics of long-stay patients are poorly understood. Aims: To describe the characteristics of long-stay patients in high and medium secure settings in England. Method: Detailed file reviews provided clinical, offending and risk data for a large representative sample of 401 forensic patients from 2 of the 3 high secure settings and from 23 of the 57 medium secure settings in England on 1 April 2013. The threshold for long-stay status was defined as 5 years in medium secure care or 10 years in high secure care, or 15 years in a combination of high and medium secure settings. Results: 22% of patients in high security and 18% in medium security met the definition for "long-stay," with 20% staying longer than 20 years. Of the long-stay sample, 58% were violent offenders (22% both sexual and violent), 27% had been convicted for violent or sexual offences whilst in an institutional setting, and 26% had committed a serious assault on staff in the last 5 years. The most prevalent diagnosis was schizophrenia (60%) followed by personality disorder (47%, predominantly antisocial and borderline types); 16% were categorised as having an intellectual disability. Overall, 7% of the long-stay sample had never been convicted of any offence, and 16.5% had no index offence prompting admission. Although some significant differences were found between the high and medium secure samples, there were more similarities than contrasts between these two levels of security. The treatment pathways of these long-stay patients involved multiple moves between settings. An unsuccessful referral to a setting of lower security was recorded over the last 5 years for 33% of the sample. Conclusions: Long-stay patients accounted for one fifth of the forensic inpatient population in England in this representative sample. A significant proportion of this group remain unsettled. High levels of personality pathology and the risk of assaults on staff and others within the care setting are likely to impact on treatment and management. Further research into the treatment pathways of longer stay patients is warranted to understand the complex trajectories of this group.

  17. Characteristics and Pathways of Long-Stay Patients in High and Medium Secure Settings in England; A Secondary Publication From a Large Mixed-Methods Study

    PubMed Central

    Völlm, Birgit A.; Edworthy, Rachel; Huband, Nick; Talbot, Emily; Majid, Shazmin; Holley, Jessica; Furtado, Vivek; Weaver, Tim; McDonald, Ruth; Duggan, Conor

    2018-01-01

    Background: Many patients experience extended stays within forensic care, but the characteristics of long-stay patients are poorly understood. Aims: To describe the characteristics of long-stay patients in high and medium secure settings in England. Method: Detailed file reviews provided clinical, offending and risk data for a large representative sample of 401 forensic patients from 2 of the 3 high secure settings and from 23 of the 57 medium secure settings in England on 1 April 2013. The threshold for long-stay status was defined as 5 years in medium secure care or 10 years in high secure care, or 15 years in a combination of high and medium secure settings. Results: 22% of patients in high security and 18% in medium security met the definition for “long-stay,” with 20% staying longer than 20 years. Of the long-stay sample, 58% were violent offenders (22% both sexual and violent), 27% had been convicted for violent or sexual offences whilst in an institutional setting, and 26% had committed a serious assault on staff in the last 5 years. The most prevalent diagnosis was schizophrenia (60%) followed by personality disorder (47%, predominantly antisocial and borderline types); 16% were categorised as having an intellectual disability. Overall, 7% of the long-stay sample had never been convicted of any offence, and 16.5% had no index offence prompting admission. Although some significant differences were found between the high and medium secure samples, there were more similarities than contrasts between these two levels of security. The treatment pathways of these long-stay patients involved multiple moves between settings. An unsuccessful referral to a setting of lower security was recorded over the last 5 years for 33% of the sample. Conclusions: Long-stay patients accounted for one fifth of the forensic inpatient population in England in this representative sample. A significant proportion of this group remain unsettled. High levels of personality pathology and the risk of assaults on staff and others within the care setting are likely to impact on treatment and management. Further research into the treatment pathways of longer stay patients is warranted to understand the complex trajectories of this group. PMID:29713294

  18. Density-Dependent Quantized Least Squares Support Vector Machine for Large Data Sets.

    PubMed

    Nan, Shengyu; Sun, Lei; Chen, Badong; Lin, Zhiping; Toh, Kar-Ann

    2017-01-01

    Based on the knowledge that input data distribution is important for learning, a data density-dependent quantization scheme (DQS) is proposed for sparse input data representation. The usefulness of the representation scheme is demonstrated by using it as a data preprocessing unit attached to the well-known least squares support vector machine (LS-SVM) for application on big data sets. Essentially, the proposed DQS adopts a single shrinkage threshold to obtain a simple quantization scheme, which adapts its outputs to input data density. With this quantization scheme, a large data set is quantized to a small subset where considerable sample size reduction is generally obtained. In particular, the sample size reduction can save significant computational cost when using the quantized subset for feature approximation via the Nyström method. Based on the quantized subset, the approximated features are incorporated into LS-SVM to develop a data density-dependent quantized LS-SVM (DQLS-SVM), where an analytic solution is obtained in the primal solution space. The developed DQLS-SVM is evaluated on synthetic and benchmark data with particular emphasis on large data sets. Extensive experimental results show that the learning machine incorporating DQS attains not only high computational efficiency but also good generalization performance.

  19. A Phylogenomic Approach Based on PCR Target Enrichment and High Throughput Sequencing: Resolving the Diversity within the South American Species of Bartsia L. (Orobanchaceae)

    PubMed Central

    Tank, David C.

    2016-01-01

    Advances in high-throughput sequencing (HTS) have allowed researchers to obtain large amounts of biological sequence information at speeds and costs unimaginable only a decade ago. Phylogenetics, and the study of evolution in general, is quickly migrating towards using HTS to generate larger and more complex molecular datasets. In this paper, we present a method that utilizes microfluidic PCR and HTS to generate large amounts of sequence data suitable for phylogenetic analyses. The approach uses the Fluidigm Access Array System (Fluidigm, San Francisco, CA, USA) and two sets of PCR primers to simultaneously amplify 48 target regions across 48 samples, incorporating sample-specific barcodes and HTS adapters (2,304 unique amplicons per Access Array). The final product is a pooled set of amplicons ready to be sequenced, and thus, there is no need to construct separate, costly genomic libraries for each sample. Further, we present a bioinformatics pipeline to process the raw HTS reads to either generate consensus sequences (with or without ambiguities) for every locus in every sample or—more importantly—recover the separate alleles from heterozygous target regions in each sample. This is important because it adds allelic information that is well suited for coalescent-based phylogenetic analyses that are becoming very common in conservation and evolutionary biology. To test our approach and bioinformatics pipeline, we sequenced 576 samples across 96 target regions belonging to the South American clade of the genus Bartsia L. in the plant family Orobanchaceae. After sequencing cleanup and alignment, the experiment resulted in ~25,300bp across 486 samples for a set of 48 primer pairs targeting the plastome, and ~13,500bp for 363 samples for a set of primers targeting regions in the nuclear genome. Finally, we constructed a combined concatenated matrix from all 96 primer combinations, resulting in a combined aligned length of ~40,500bp for 349 samples. PMID:26828929

  20. Determining cereal starch amylose content using a dual wavelength iodine binding 96 well plate assay

    USDA-ARS?s Scientific Manuscript database

    Cereal starch amylose/amylopectin (AM/AP) ratios are critical in functional properties for food and industrial applications. Conventional determination of AM/AP of cereal starches are very time consuming and labor intensive making it very difficult to screen large sample sets. Studying these large...

  1. DNA Everywhere. A Guide for Simplified Environmental Genomic DNA Extraction Suitable for Use in Remote Areas

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gabrielle N. Pecora; Francine C. Reid; Lauren M. Tom

    2016-05-01

    Collecting field samples from remote or geographically distant areas can be a financially and logistically challenging. With participation of a local organization where the samples are originated from, gDNA samples can be extracted from the field and shipped to a research institution for further processing and analysis. The ability to set up gDNA extraction capabilities in the field can drastically reduce cost and time when running long-term microbial studies with a large sample set. The method outlined here has developed a compact and affordable method for setting up a “laboratory” and extracting and shipping gDNA samples from anywhere in themore » world. This white paper explains the process of setting up the “laboratory”, choosing and training individuals with no prior scientific experience how to perform gDNA extractions and safe methods for shipping extracts to any research institution. All methods have been validated by the Andersen group at Lawrence Berkeley National Laboratory using the Berkeley Lab PhyloChip.« less

  2. Object Classification With Joint Projection and Low-Rank Dictionary Learning.

    PubMed

    Foroughi, Homa; Ray, Nilanjan; Hong Zhang

    2018-02-01

    For an object classification system, the most critical obstacles toward real-world applications are often caused by large intra-class variability, arising from different lightings, occlusion, and corruption, in limited sample sets. Most methods in the literature would fail when the training samples are heavily occluded, corrupted or have significant illumination or viewpoint variations. Besides, most of the existing methods and especially deep learning-based methods, need large training sets to achieve a satisfactory recognition performance. Although using the pre-trained network on a generic large-scale data set and fine-tune it to the small-sized target data set is a widely used technique, this would not help when the content of base and target data sets are very different. To address these issues simultaneously, we propose a joint projection and low-rank dictionary learning method using dual graph constraints. Specifically, a structured class-specific dictionary is learned in the low-dimensional space, and the discrimination is further improved by imposing a graph constraint on the coding coefficients, that maximizes the intra-class compactness and inter-class separability. We enforce structural incoherence and low-rank constraints on sub-dictionaries to reduce the redundancy among them, and also make them robust to variations and outliers. To preserve the intrinsic structure of data, we introduce a supervised neighborhood graph into the framework to make the proposed method robust to small-sized and high-dimensional data sets. Experimental results on several benchmark data sets verify the superior performance of our method for object classification of small-sized data sets, which include a considerable amount of different kinds of variation, and may have high-dimensional feature vectors.

  3. Modification of the Sandwich Estimator in Generalized Estimating Equations with Correlated Binary Outcomes in Rare Event and Small Sample Settings

    PubMed Central

    Rogers, Paul; Stoner, Julie

    2016-01-01

    Regression models for correlated binary outcomes are commonly fit using a Generalized Estimating Equations (GEE) methodology. GEE uses the Liang and Zeger sandwich estimator to produce unbiased standard error estimators for regression coefficients in large sample settings even when the covariance structure is misspecified. The sandwich estimator performs optimally in balanced designs when the number of participants is large, and there are few repeated measurements. The sandwich estimator is not without drawbacks; its asymptotic properties do not hold in small sample settings. In these situations, the sandwich estimator is biased downwards, underestimating the variances. In this project, a modified form for the sandwich estimator is proposed to correct this deficiency. The performance of this new sandwich estimator is compared to the traditional Liang and Zeger estimator as well as alternative forms proposed by Morel, Pan and Mancl and DeRouen. The performance of each estimator was assessed with 95% coverage probabilities for the regression coefficient estimators using simulated data under various combinations of sample sizes and outcome prevalence values with an Independence (IND), Autoregressive (AR) and Compound Symmetry (CS) correlation structure. This research is motivated by investigations involving rare-event outcomes in aviation data. PMID:26998504

  4. Predicting Reading and Mathematics from Neural Activity for Feedback Learning

    ERIC Educational Resources Information Center

    Peters, Sabine; Van der Meulen, Mara; Zanolie, Kiki; Crone, Eveline A.

    2017-01-01

    Although many studies use feedback learning paradigms to study the process of learning in laboratory settings, little is known about their relevance for real-world learning settings such as school. In a large developmental sample (N = 228, 8-25 years), we investigated whether performance and neural activity during a feedback learning task…

  5. Progressive Sampling Technique for Efficient and Robust Uncertainty and Sensitivity Analysis of Environmental Systems Models: Stability and Convergence

    NASA Astrophysics Data System (ADS)

    Sheikholeslami, R.; Hosseini, N.; Razavi, S.

    2016-12-01

    Modern earth and environmental models are usually characterized by a large parameter space and high computational cost. These two features prevent effective implementation of sampling-based analysis such as sensitivity and uncertainty analysis, which require running these computationally expensive models several times to adequately explore the parameter/problem space. Therefore, developing efficient sampling techniques that scale with the size of the problem, computational budget, and users' needs is essential. In this presentation, we propose an efficient sequential sampling strategy, called Progressive Latin Hypercube Sampling (PLHS), which provides an increasingly improved coverage of the parameter space, while satisfying pre-defined requirements. The original Latin hypercube sampling (LHS) approach generates the entire sample set in one stage; on the contrary, PLHS generates a series of smaller sub-sets (also called `slices') while: (1) each sub-set is Latin hypercube and achieves maximum stratification in any one dimensional projection; (2) the progressive addition of sub-sets remains Latin hypercube; and thus (3) the entire sample set is Latin hypercube. Therefore, it has the capability to preserve the intended sampling properties throughout the sampling procedure. PLHS is deemed advantageous over the existing methods, particularly because it nearly avoids over- or under-sampling. Through different case studies, we show that PHLS has multiple advantages over the one-stage sampling approaches, including improved convergence and stability of the analysis results with fewer model runs. In addition, PLHS can help to minimize the total simulation time by only running the simulations necessary to achieve the desired level of quality (e.g., accuracy, and convergence rate).

  6. A simulative comparison of respondent driven sampling with incentivized snowball sampling--the "strudel effect".

    PubMed

    Gyarmathy, V Anna; Johnston, Lisa G; Caplinskiene, Irma; Caplinskas, Saulius; Latkin, Carl A

    2014-02-01

    Respondent driven sampling (RDS) and incentivized snowball sampling (ISS) are two sampling methods that are commonly used to reach people who inject drugs (PWID). We generated a set of simulated RDS samples on an actual sociometric ISS sample of PWID in Vilnius, Lithuania ("original sample") to assess if the simulated RDS estimates were statistically significantly different from the original ISS sample prevalences for HIV (9.8%), Hepatitis A (43.6%), Hepatitis B (Anti-HBc 43.9% and HBsAg 3.4%), Hepatitis C (87.5%), syphilis (6.8%) and Chlamydia (8.8%) infections and for selected behavioral risk characteristics. The original sample consisted of a large component of 249 people (83% of the sample) and 13 smaller components with 1-12 individuals. Generally, as long as all seeds were recruited from the large component of the original sample, the simulation samples simply recreated the large component. There were no significant differences between the large component and the entire original sample for the characteristics of interest. Altogether 99.2% of 360 simulation sample point estimates were within the confidence interval of the original prevalence values for the characteristics of interest. When population characteristics are reflected in large network components that dominate the population, RDS and ISS may produce samples that have statistically non-different prevalence values, even though some isolated network components may be under-sampled and/or statistically significantly different from the main groups. This so-called "strudel effect" is discussed in the paper. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  7. A simulative comparison of respondent driven sampling with incentivized snowball sampling – the “strudel effect”

    PubMed Central

    Gyarmathy, V. Anna; Johnston, Lisa G.; Caplinskiene, Irma; Caplinskas, Saulius; Latkin, Carl A.

    2014-01-01

    Background Respondent driven sampling (RDS) and Incentivized Snowball Sampling (ISS) are two sampling methods that are commonly used to reach people who inject drugs (PWID). Methods We generated a set of simulated RDS samples on an actual sociometric ISS sample of PWID in Vilnius, Lithuania (“original sample”) to assess if the simulated RDS estimates were statistically significantly different from the original ISS sample prevalences for HIV (9.8%), Hepatitis A (43.6%), Hepatitis B (Anti-HBc 43.9% and HBsAg 3.4%), Hepatitis C (87.5%), syphilis (6.8%) and Chlamydia (8.8%) infections and for selected behavioral risk characteristics. Results The original sample consisted of a large component of 249 people (83% of the sample) and 13 smaller components with 1 to 12 individuals. Generally, as long as all seeds were recruited from the large component of the original sample, the simulation samples simply recreated the large component. There were no significant differences between the large component and the entire original sample for the characteristics of interest. Altogether 99.2% of 360 simulation sample point estimates were within the confidence interval of the original prevalence values for the characteristics of interest. Conclusions When population characteristics are reflected in large network components that dominate the population, RDS and ISS may produce samples that have statistically non-different prevalence values, even though some isolated network components may be under-sampled and/or statistically significantly different from the main groups. This so-called “strudel effect” is discussed in the paper. PMID:24360650

  8. A method for feature selection of APT samples based on entropy

    NASA Astrophysics Data System (ADS)

    Du, Zhenyu; Li, Yihong; Hu, Jinsong

    2018-05-01

    By studying the known APT attack events deeply, this paper propose a feature selection method of APT sample and a logic expression generation algorithm IOCG (Indicator of Compromise Generate). The algorithm can automatically generate machine readable IOCs (Indicator of Compromise), to solve the existing IOCs logical relationship is fixed, the number of logical items unchanged, large scale and cannot generate a sample of the limitations of the expression. At the same time, it can reduce the redundancy and useless APT sample processing time consumption, and improve the sharing rate of information analysis, and actively respond to complex and volatile APT attack situation. The samples were divided into experimental set and training set, and then the algorithm was used to generate the logical expression of the training set with the IOC_ Aware plug-in. The contrast expression itself was different from the detection result. The experimental results show that the algorithm is effective and can improve the detection effect.

  9. A Hybrid Algorithm for Period Analysis from Multiband Data with Sparse and Irregular Sampling for Arbitrary Light-curve Shapes

    NASA Astrophysics Data System (ADS)

    Saha, Abhijit; Vivas, A. Katherina

    2017-12-01

    Ongoing and future surveys with repeat imaging in multiple bands are producing (or will produce) time-spaced measurements of brightness, resulting in the identification of large numbers of variable sources in the sky. A large fraction of these are periodic variables: compilations of these are of scientific interest for a variety of purposes. Unavoidably, the data sets from many such surveys not only have sparse sampling, but also have embedded frequencies in the observing cadence that beat against the natural periodicities of any object under investigation. Such limitations can make period determination ambiguous and uncertain. For multiband data sets with asynchronous measurements in multiple passbands, we wish to maximally use the information on periodicity in a manner that is agnostic of differences in the light-curve shapes across the different channels. Given large volumes of data, computational efficiency is also at a premium. This paper develops and presents a computationally economic method for determining periodicity that combines the results from two different classes of period-determination algorithms. The underlying principles are illustrated through examples. The effectiveness of this approach for combining asynchronously sampled measurements in multiple observables that share an underlying fundamental frequency is also demonstrated.

  10. TemperSAT: A new efficient fair-sampling random k-SAT solver

    NASA Astrophysics Data System (ADS)

    Fang, Chao; Zhu, Zheng; Katzgraber, Helmut G.

    The set membership problem is of great importance to many applications and, in particular, database searches for target groups. Recently, an approach to speed up set membership searches based on the NP-hard constraint-satisfaction problem (random k-SAT) has been developed. However, the bottleneck of the approach lies in finding the solution to a large SAT formula efficiently and, in particular, a large number of independent solutions is needed to reduce the probability of false positives. Unfortunately, traditional random k-SAT solvers such as WalkSAT are biased when seeking solutions to the Boolean formulas. By porting parallel tempering Monte Carlo to the sampling of binary optimization problems, we introduce a new algorithm (TemperSAT) whose performance is comparable to current state-of-the-art SAT solvers for large k with the added benefit that theoretically it can find many independent solutions quickly. We illustrate our results by comparing to the currently fastest implementation of WalkSAT, WalkSATlm.

  11. Nuclear Forensic Inferences Using Iterative Multidimensional Statistics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Robel, M; Kristo, M J; Heller, M A

    2009-06-09

    Nuclear forensics involves the analysis of interdicted nuclear material for specific material characteristics (referred to as 'signatures') that imply specific geographical locations, production processes, culprit intentions, etc. Predictive signatures rely on expert knowledge of physics, chemistry, and engineering to develop inferences from these material characteristics. Comparative signatures, on the other hand, rely on comparison of the material characteristics of the interdicted sample (the 'questioned sample' in FBI parlance) with those of a set of known samples. In the ideal case, the set of known samples would be a comprehensive nuclear forensics database, a database which does not currently exist. Inmore » fact, our ability to analyze interdicted samples and produce an extensive list of precise materials characteristics far exceeds our ability to interpret the results. Therefore, as we seek to develop the extensive databases necessary for nuclear forensics, we must also develop the methods necessary to produce the necessary inferences from comparison of our analytical results with these large, multidimensional sets of data. In the work reported here, we used a large, multidimensional dataset of results from quality control analyses of uranium ore concentrate (UOC, sometimes called 'yellowcake'). We have found that traditional multidimensional techniques, such as principal components analysis (PCA), are especially useful for understanding such datasets and drawing relevant conclusions. In particular, we have developed an iterative partial least squares-discriminant analysis (PLS-DA) procedure that has proven especially adept at identifying the production location of unknown UOC samples. By removing classes which fell far outside the initial decision boundary, and then rebuilding the PLS-DA model, we have consistently produced better and more definitive attributions than with a single pass classification approach. Performance of the iterative PLS-DA method compared favorably to that of classification and regression tree (CART) and k nearest neighbor (KNN) algorithms, with the best combination of accuracy and robustness, as tested by classifying samples measured independently in our laboratories against the vendor QC based reference set.« less

  12. Examination of the MMPI-2 restructured form (MMPI-2-RF) validity scales in civil forensic settings: findings from simulation and known group samples.

    PubMed

    Wygant, Dustin B; Ben-Porath, Yossef S; Arbisi, Paul A; Berry, David T R; Freeman, David B; Heilbronner, Robert L

    2009-11-01

    The current study examined the effectiveness of the MMPI-2 Restructured Form (MMPI-2-RF; Ben-Porath and Tellegen, 2008) over-reporting indicators in civil forensic settings. The MMPI-2-RF includes three revised MMPI-2 over-reporting validity scales and a new scale to detect over-reported somatic complaints. Participants dissimulated medical and neuropsychological complaints in two simulation samples, and a known-groups sample used symptom validity tests as a response bias criterion. Results indicated large effect sizes for the MMPI-2-RF validity scales, including a Cohen's d of .90 for Fs in a head injury simulation sample, 2.31 for FBS-r, 2.01 for F-r, and 1.97 for Fs in a medical simulation sample, and 1.45 for FBS-r and 1.30 for F-r in identifying poor effort on SVTs. Classification results indicated good sensitivity and specificity for the scales across the samples. This study indicates that the MMPI-2-RF over-reporting validity scales are effective at detecting symptom over-reporting in civil forensic settings.

  13. Automated Classification and Analysis of Non-metallic Inclusion Data Sets

    NASA Astrophysics Data System (ADS)

    Abdulsalam, Mohammad; Zhang, Tongsheng; Tan, Jia; Webler, Bryan A.

    2018-05-01

    The aim of this study is to utilize principal component analysis (PCA), clustering methods, and correlation analysis to condense and examine large, multivariate data sets produced from automated analysis of non-metallic inclusions. Non-metallic inclusions play a major role in defining the properties of steel and their examination has been greatly aided by automated analysis in scanning electron microscopes equipped with energy dispersive X-ray spectroscopy. The methods were applied to analyze inclusions on two sets of samples: two laboratory-scale samples and four industrial samples from a near-finished 4140 alloy steel components with varying machinability. The laboratory samples had well-defined inclusions chemistries, composed of MgO-Al2O3-CaO, spinel (MgO-Al2O3), and calcium aluminate inclusions. The industrial samples contained MnS inclusions as well as (Ca,Mn)S + calcium aluminate oxide inclusions. PCA could be used to reduce inclusion chemistry variables to a 2D plot, which revealed inclusion chemistry groupings in the samples. Clustering methods were used to automatically classify inclusion chemistry measurements into groups, i.e., no user-defined rules were required.

  14. A high-throughput microRNA expression profiling system.

    PubMed

    Guo, Yanwen; Mastriano, Stephen; Lu, Jun

    2014-01-01

    As small noncoding RNAs, microRNAs (miRNAs) regulate diverse biological functions, including physiological and pathological processes. The expression and deregulation of miRNA levels contain rich information with diagnostic and prognostic relevance and can reflect pharmacological responses. The increasing interest in miRNA-related research demands global miRNA expression profiling on large numbers of samples. We describe here a robust protocol that supports high-throughput sample labeling and detection on hundreds of samples simultaneously. This method employs 96-well-based miRNA capturing from total RNA samples and on-site biochemical reactions, coupled with bead-based detection in 96-well format for hundreds of miRNAs per sample. With low-cost, high-throughput, high detection specificity, and flexibility to profile both small and large numbers of samples, this protocol can be adapted in a wide range of laboratory settings.

  15. NTS radiological assessment project: comparison of delta-surface interpolation with kriging for the Frenchman Lake region of area 5

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Foley, T.A. Jr.

    The primary objective of this report is to compare the results of delta surface interpolation with kriging on four large sets of radiological data sampled in the Frenchman Lake region at the Nevada Test Site. The results of kriging, described in Barnes, Giacomini, Reiman, and Elliott, are very similar to those using the delta surface interpolant. The other topic studied is in reducing the number of sample points and obtaining results similar to those using all of the data. The positive results here suggest that great savings of time and money can be made. Furthermore, the delta surface interpolant ismore » viewed as a contour map and as a three dimensional surface. These graphical representations help in the analysis of the large sets of radiological data.« less

  16. Modeling Alaska boreal forests with a controlled trend surface approach

    Treesearch

    Mo Zhou; Jingjing Liang

    2012-01-01

    An approach of Controlled Trend Surface was proposed to simultaneously take into consideration large-scale spatial trends and nonspatial effects. A geospatial model of the Alaska boreal forest was developed from 446 permanent sample plots, which addressed large-scale spatial trends in recruitment, diameter growth, and mortality. The model was tested on two sets of...

  17. Rapid Sampling of Hydrogen Bond Networks for Computational Protein Design.

    PubMed

    Maguire, Jack B; Boyken, Scott E; Baker, David; Kuhlman, Brian

    2018-05-08

    Hydrogen bond networks play a critical role in determining the stability and specificity of biomolecular complexes, and the ability to design such networks is important for engineering novel structures, interactions, and enzymes. One key feature of hydrogen bond networks that makes them difficult to rationally engineer is that they are highly cooperative and are not energetically favorable until the hydrogen bonding potential has been satisfied for all buried polar groups in the network. Existing computational methods for protein design are ill-equipped for creating these highly cooperative networks because they rely on energy functions and sampling strategies that are focused on pairwise interactions. To enable the design of complex hydrogen bond networks, we have developed a new sampling protocol in the molecular modeling program Rosetta that explicitly searches for sets of amino acid mutations that can form self-contained hydrogen bond networks. For a given set of designable residues, the protocol often identifies many alternative sets of mutations/networks, and we show that it can readily be applied to large sets of residues at protein-protein interfaces or in the interior of proteins. The protocol builds on a recently developed method in Rosetta for designing hydrogen bond networks that has been experimentally validated for small symmetric systems but was not extensible to many larger protein structures and complexes. The sampling protocol we describe here not only recapitulates previously validated designs with performance improvements but also yields viable hydrogen bond networks for cases where the previous method fails, such as the design of large, asymmetric interfaces relevant to engineering protein-based therapeutics.

  18. Online Low-Rank Representation Learning for Joint Multi-subspace Recovery and Clustering.

    PubMed

    Li, Bo; Liu, Risheng; Cao, Junjie; Zhang, Jie; Lai, Yu-Kun; Liua, Xiuping

    2017-10-06

    Benefiting from global rank constraints, the lowrank representation (LRR) method has been shown to be an effective solution to subspace learning. However, the global mechanism also means that the LRR model is not suitable for handling large-scale data or dynamic data. For large-scale data, the LRR method suffers from high time complexity, and for dynamic data, it has to recompute a complex rank minimization for the entire data set whenever new samples are dynamically added, making it prohibitively expensive. Existing attempts to online LRR either take a stochastic approach or build the representation purely based on a small sample set and treat new input as out-of-sample data. The former often requires multiple runs for good performance and thus takes longer time to run, and the latter formulates online LRR as an out-ofsample classification problem and is less robust to noise. In this paper, a novel online low-rank representation subspace learning method is proposed for both large-scale and dynamic data. The proposed algorithm is composed of two stages: static learning and dynamic updating. In the first stage, the subspace structure is learned from a small number of data samples. In the second stage, the intrinsic principal components of the entire data set are computed incrementally by utilizing the learned subspace structure, and the low-rank representation matrix can also be incrementally solved by an efficient online singular value decomposition (SVD) algorithm. The time complexity is reduced dramatically for large-scale data, and repeated computation is avoided for dynamic problems. We further perform theoretical analysis comparing the proposed online algorithm with the batch LRR method. Finally, experimental results on typical tasks of subspace recovery and subspace clustering show that the proposed algorithm performs comparably or better than batch methods including the batch LRR, and significantly outperforms state-of-the-art online methods.

  19. A self-sampling method to obtain large volumes of undiluted cervicovaginal secretions.

    PubMed

    Boskey, Elizabeth R; Moench, Thomas R; Hees, Paul S; Cone, Richard A

    2003-02-01

    Studies of vaginal physiology and pathophysiology sometime require larger volumes of undiluted cervicovaginal secretions than can be obtained by current methods. A convenient method for self-sampling these secretions outside a clinical setting can facilitate such studies of reproductive health. The goal was to develop a vaginal self-sampling method for collecting large volumes of undiluted cervicovaginal secretions. A menstrual collection device (the Instead cup) was inserted briefly into the vagina to collect secretions that were then retrieved from the cup by centrifugation in a 50-ml conical tube. All 16 women asked to perform this procedure found it feasible and acceptable. Among 27 samples, an average of 0.5 g of secretions (range, 0.1-1.5 g) was collected. This is a rapid and convenient self-sampling method for obtaining relatively large volumes of undiluted cervicovaginal secretions. It should prove suitable for a wide range of assays, including those involving sexually transmitted diseases, microbicides, vaginal physiology, immunology, and pathophysiology.

  20. An algorithm for deciding the number of clusters and validating using simulated data with application to exploring crop population structure

    USDA-ARS?s Scientific Manuscript database

    A first step in exploring population structure in crop plants and other organisms is to define the number of subpopulations that exist for a given data set. The genetic marker data sets being generated have become increasingly large over time and commonly are the high-dimension, low sample size (HDL...

  1. Perceived climate in physical activity settings.

    PubMed

    Gill, Diane L; Morrow, Ronald G; Collins, Karen E; Lucey, Allison B; Schultz, Allison M

    2010-01-01

    This study focused on the perceived climate for LGBT youth and other minority groups in physical activity settings. A large sample of undergraduates and a selected sample including student teachers/interns and a campus Pride group completed a school climate survey and rated the climate in three physical activity settings (physical education, organized sport, exercise). Overall, school climate survey results paralleled the results with national samples revealing high levels of homophobic remarks and low levels of intervention. Physical activity climate ratings were mid-range, but multivariate analysis of variation test (MANOVA) revealed clear differences with all settings rated more inclusive for racial/ethnic minorities and most exclusive for gays/lesbians and people with disabilities. The results are in line with national surveys and research suggesting sexual orientation and physical characteristics are often the basis for harassment and exclusion in sport and physical activity. The current results also indicate that future physical activity professionals recognize exclusion, suggesting they could benefit from programs that move beyond awareness to skills and strategies for creating more inclusive programs.

  2. Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data.

    PubMed

    Paulson, Joseph N; Chen, Cho-Yi; Lopes-Ramos, Camila M; Kuijjer, Marieke L; Platig, John; Sonawane, Abhijeet R; Fagny, Maud; Glass, Kimberly; Quackenbush, John

    2017-10-03

    Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data - critical first steps for any subsequent analysis. We find that analysis of large RNA-Seq data sets requires both careful quality control and the need to account for sparsity due to the heterogeneity intrinsic in multi-group studies. We developed Yet Another RNA Normalization software pipeline (YARN), that includes quality control and preprocessing, gene filtering, and normalization steps designed to facilitate downstream analysis of large, heterogeneous RNA-Seq data sets and we demonstrate its use with data from the Genotype-Tissue Expression (GTEx) project. An R package instantiating YARN is available at http://bioconductor.org/packages/yarn .

  3. OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets.

    PubMed

    García-Pedrajas, Nicolás; Perez-Rodríguez, Javier; de Haro-García, Aida

    2013-02-01

    In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.

  4. A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets.

    PubMed

    Zuo, Chandler; Chen, Kailei; Keleş, Sündüz

    2017-06-01

    Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each data set (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared toward multisample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq data sets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, Zuo et al. developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq data sets. Although this versatile framework estimates both the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization-based estimation structure hinders its applicability with large number of loci and samples. We address this limitation by developing MAP-based asymptotic derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm that converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparison with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq data sets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.

  5. A similarity based learning framework for interim analysis of outcome prediction of acupuncture for neck pain.

    PubMed

    Zhang, Gang; Liang, Zhaohui; Yin, Jian; Fu, Wenbin; Li, Guo-Zheng

    2013-01-01

    Chronic neck pain is a common morbid disorder in modern society. Acupuncture has been administered for treating chronic pain as an alternative therapy for a long time, with its effectiveness supported by the latest clinical evidence. However, the potential effective difference in different syndrome types is questioned due to the limits of sample size and statistical methods. We applied machine learning methods in an attempt to solve this problem. Through a multi-objective sorting of subjective measurements, outstanding samples are selected to form the base of our kernel-oriented model. With calculation of similarities between the concerned sample and base samples, we are able to make full use of information contained in the known samples, which is especially effective in the case of a small sample set. To tackle the parameters selection problem in similarity learning, we propose an ensemble version of slightly different parameter setting to obtain stronger learning. The experimental result on a real data set shows that compared to some previous well-known methods, the proposed algorithm is capable of discovering the underlying difference among different syndrome types and is feasible for predicting the effective tendency in clinical trials of large samples.

  6. Choosing the Most Effective Pattern Classification Model under Learning-Time Constraint.

    PubMed

    Saito, Priscila T M; Nakamura, Rodrigo Y M; Amorim, Willian P; Papa, João P; de Rezende, Pedro J; Falcão, Alexandre X

    2015-01-01

    Nowadays, large datasets are common and demand faster and more effective pattern analysis techniques. However, methodologies to compare classifiers usually do not take into account the learning-time constraints required by applications. This work presents a methodology to compare classifiers with respect to their ability to learn from classification errors on a large learning set, within a given time limit. Faster techniques may acquire more training samples, but only when they are more effective will they achieve higher performance on unseen testing sets. We demonstrate this result using several techniques, multiple datasets, and typical learning-time limits required by applications.

  7. Geologic setting of the apollo 14 samples

    USGS Publications Warehouse

    Swann, G.A.; Trask, N.J.; Hait, M.H.; Sutton, R.L.

    1971-01-01

    The apollo 14 lunar module landed in a region of the lunar highlands that is part of a widespread blanket of ejecta surrounding the Mare Imbrium basin. Samples were collected from the regolith developed on a nearly level plain, a ridge 100 meters high, and a blacky ejecta deposit around a young crater. Large boulders in the vicinity of the landing site are coherent fragmental rocks as are some of the returned samples.

  8. Four hundred or more participants needed for stable contingency table estimates of clinical prediction rule performance.

    PubMed

    Kent, Peter; Boyle, Eleanor; Keating, Jennifer L; Albert, Hanne B; Hartvigsen, Jan

    2017-02-01

    To quantify variability in the results of statistical analyses based on contingency tables and discuss the implications for the choice of sample size for studies that derive clinical prediction rules. An analysis of three pre-existing sets of large cohort data (n = 4,062-8,674) was performed. In each data set, repeated random sampling of various sample sizes, from n = 100 up to n = 2,000, was performed 100 times at each sample size and the variability in estimates of sensitivity, specificity, positive and negative likelihood ratios, posttest probabilities, odds ratios, and risk/prevalence ratios for each sample size was calculated. There were very wide, and statistically significant, differences in estimates derived from contingency tables from the same data set when calculated in sample sizes below 400 people, and typically, this variability stabilized in samples of 400-600 people. Although estimates of prevalence also varied significantly in samples below 600 people, that relationship only explains a small component of the variability in these statistical parameters. To reduce sample-specific variability, contingency tables should consist of 400 participants or more when used to derive clinical prediction rules or test their performance. Copyright © 2016 Elsevier Inc. All rights reserved.

  9. A place meaning scale for tropical marine settings.

    PubMed

    Wynveen, Christopher J; Kyle, Gerard T

    2015-01-01

    Over the past 20 years, most of the worldwide hectares set aside for environmental protection have been added to marine protected areas. Moreover, these areas are under tremendous pressure from negative anthropogenic impacts. Given this growth and pressure, there is a need to increase the understanding of the connection between people and marine environments in order to better manage the resource. One construct that researchers have used to understand human-environment connections is place meanings. Place meanings reflect the value and significance of a setting to individuals. Most investigations of place meanings have been confined to terrestrial settings. Moreover, most studies have had small sample sizes or have used place attachment scales as a proxy to gage the meanings individuals ascribe to a setting. Hence, it has become necessary to develop a place meaning scale for use with large samples and for use by those who are concerned about the management of marine environments. Therefore, the purpose of this investigation was to develop a scale to measure the importance people associate with the meanings they ascribe to tropical marine settings and empirically test the scale using two independent samples; that is, Great Barrier Reef Marine Park and the Florida Keys National Marine Sanctuary stakeholders.

  10. Adhesion scratch testing - A round-robin experiment

    NASA Technical Reports Server (NTRS)

    Perry, A. J.; Valli, J.; Steinmann, P. A.

    1988-01-01

    Six sets of samples, TiN coated by chemical or physical vapor deposition methods (CVD or PVD) onto cemented carbide or high-speed steel (HSS), and TiC coated by CVD onto cemented carbide have been scratch tested using three types of commercially available scratch adhesion tester. With exception of one cemented carbide set, the reproducibility of the critical loads for any given set with a given stylus is excellent, about + or - 5 percent, and is about + or - 20 percent for different styli. Any differences in critical loads recorded for any given sample set can be attributed to the condition of the stylus (clean, new, etc.), the instrument used, the stylus itself (friction coefficient, etc.), and the sample set itself. One CVD set showed remarkably large differences in critical loads for different styli, which is thought to be related to a mechanical interaction between stylus and coating which is enhanced by a plastic deformability in the film related to the coating microstructure. The critical load for TiN on HSS increases with coating thickness, and differences in frictional conditions led to a systematic variation in the critical loads depending on the stylus used.

  11. A Place Meaning Scale for Tropical Marine Settings

    NASA Astrophysics Data System (ADS)

    Wynveen, Christopher J.; Kyle, Gerard T.

    2015-01-01

    Over the past 20 years, most of the worldwide hectares set aside for environmental protection have been added to marine protected areas. Moreover, these areas are under tremendous pressure from negative anthropogenic impacts. Given this growth and pressure, there is a need to increase the understanding of the connection between people and marine environments in order to better manage the resource. One construct that researchers have used to understand human-environment connections is place meanings. Place meanings reflect the value and significance of a setting to individuals. Most investigations of place meanings have been confined to terrestrial settings. Moreover, most studies have had small sample sizes or have used place attachment scales as a proxy to gage the meanings individuals ascribe to a setting. Hence, it has become necessary to develop a place meaning scale for use with large samples and for use by those who are concerned about the management of marine environments. Therefore, the purpose of this investigation was to develop a scale to measure the importance people associate with the meanings they ascribe to tropical marine settings and empirically test the scale using two independent samples; that is, Great Barrier Reef Marine Park and the Florida Keys National Marine Sanctuary stakeholders.

  12. Classification of urine sediment based on convolution neural network

    NASA Astrophysics Data System (ADS)

    Pan, Jingjing; Jiang, Cunbo; Zhu, Tiantian

    2018-04-01

    By designing a new convolution neural network framework, this paper breaks the constraints of the original convolution neural network framework requiring large training samples and samples of the same size. Move and cropping the input images, generate the same size of the sub-graph. And then, the generated sub-graph uses the method of dropout, increasing the diversity of samples and preventing the fitting generation. Randomly select some proper subset in the sub-graphic set and ensure that the number of elements in the proper subset is same and the proper subset is not the same. The proper subsets are used as input layers for the convolution neural network. Through the convolution layer, the pooling, the full connection layer and output layer, we can obtained the classification loss rate of test set and training set. In the red blood cells, white blood cells, calcium oxalate crystallization classification experiment, the classification accuracy rate of 97% or more.

  13. Mathematics of Web science: structure, dynamics and incentives.

    PubMed

    Chayes, Jennifer

    2013-03-28

    Dr Chayes' talk described how, to a discrete mathematician, 'all the world's a graph, and all the people and domains merely vertices'. A graph is represented as a set of vertices V and a set of edges E, so that, for instance, in the World Wide Web, V is the set of pages and E the directed hyperlinks; in a social network, V is the people and E the set of relationships; and in the autonomous system Internet, V is the set of autonomous systems (such as AOL, Yahoo! and MSN) and E the set of connections. This means that mathematics can be used to study the Web (and other large graphs in the online world) in the following way: first, we can model online networks as large finite graphs; second, we can sample pieces of these graphs; third, we can understand and then control processes on these graphs; and fourth, we can develop algorithms for these graphs and apply them to improve the online experience.

  14. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hero, Alfred O.; Rajaratnam, Bala

    When can reliable inference be drawn in the ‘‘Big Data’’ context? This article presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large-scale inference. In large-scale data applications like genomics, connectomics, and eco-informatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for ‘‘Big Data.’’ Sample complexity, however, hasmore » received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely high-dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high-dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. We demonstrate various regimes of correlation mining based on the unifying perspective of high-dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.« less

  15. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

    DOE PAGES

    Hero, Alfred O.; Rajaratnam, Bala

    2015-12-09

    When can reliable inference be drawn in the ‘‘Big Data’’ context? This article presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large-scale inference. In large-scale data applications like genomics, connectomics, and eco-informatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for ‘‘Big Data.’’ Sample complexity, however, hasmore » received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely high-dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high-dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. We demonstrate various regimes of correlation mining based on the unifying perspective of high-dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.« less

  16. Large landslides from oceanic volcanoes

    USGS Publications Warehouse

    Holcomb, R.T.; Searle, R.C.

    1991-01-01

    Large landslides are ubiquitous around the submarine flanks of Hawaiian volcanoes, and GLORIA has also revealed large landslides offshore from Tristan da Cunha and El Hierro. On both of the latter islands, steep flanks formerly attributed to tilting or marine erosion have been reinterpreted as landslide headwalls mantled by younger lava flows. These landslides occur in a wide range of settings and probably represent only a small sample from a large population. They may explain the large volumes of archipelagic aprons and the stellate shapes of many oceanic volcanoes. Large landslides and associated tsunamis pose hazards to many islands. -from Authors

  17. Performance evaluation of an importance sampling technique in a Jackson network

    NASA Astrophysics Data System (ADS)

    brahim Mahdipour, E.; Masoud Rahmani, Amir; Setayeshi, Saeed

    2014-03-01

    Importance sampling is a technique that is commonly used to speed up Monte Carlo simulation of rare events. However, little is known regarding the design of efficient importance sampling algorithms in the context of queueing networks. The standard approach, which simulates the system using an a priori fixed change of measure suggested by large deviation analysis, has been shown to fail in even the simplest network settings. Estimating probabilities associated with rare events has been a topic of great importance in queueing theory, and in applied probability at large. In this article, we analyse the performance of an importance sampling estimator for a rare event probability in a Jackson network. This article carries out strict deadlines to a two-node Jackson network with feedback whose arrival and service rates are modulated by an exogenous finite state Markov process. We have estimated the probability of network blocking for various sets of parameters, and also the probability of missing the deadline of customers for different loads and deadlines. We have finally shown that the probability of total population overflow may be affected by various deadline values, service rates and arrival rates.

  18. Catch of channel catfish with tandem-set hoop nets and gill nets in lentic systems of Nebraska

    USGS Publications Warehouse

    Richters, Lindsey K.; Pope, Kevin L.

    2011-01-01

    Twenty-six Nebraska water bodies representing two ecosystem types (small standing waters and large standing waters) were surveyed during 2008 and 2009 with tandem-set hoop nets and experimental gill nets to determine if similar trends existed in catch rates and size structures of channel catfish Ictalurus punctatus captured with these gears. Gear efficiency was assessed as the number of sets (nets) that would be required to capture 100 channel catfish given observed catch per unit effort (CPUE). Efficiency of gill nets was not correlated with efficiency of hoop nets for capturing channel catfish. Small sample sizes prohibited estimation of proportional size distributions in most surveys; in the four surveys for which sample size was sufficient to quantify length-frequency distributions of captured channel catfish, distributions differed between gears. The CPUE of channel catfish did not differ between small and large water bodies for either gear. While catch rates of hoop nets were lower than rates recorded in previous studies, this gear was more efficient than gill nets at capturing channel catfish. However, comparisons of size structure between gears may be problematic.

  19. Large strain dynamic compression for soft materials using a direct impact experiment

    NASA Astrophysics Data System (ADS)

    Meenken, T.; Hiermaier, S.

    2006-08-01

    Measurement of strain rate dependent material data of low density low strength materials like polymeric foams and rubbers still poses challenges of a different kind to the experimental set up. For instance, in conventional Split Hopkinson Pressure Bar tests the impedance mismatch between the bars and the specimen makes strain measurement almost impossible. Application of viscoelastic bars poses new problems with wave dispersion. Also, maximum achievable strains and strain rates depend directly on the bar lengths, resulting in large experimental set ups in order to measure relevant data for automobile crash applications. In this paper a modified SHPB will be presented for testing low impedance materials. High strains can be achieved with nearly constant strain rate. A thin film stress measurement has been applied to the specimen/bar interfaces to investigate the initial sample ring up process. The process of stress homogeneity within the sample was investigated on EPDM and PU rubber.

  20. Two-Phase and Graph-Based Clustering Methods for Accurate and Efficient Segmentation of Large Mass Spectrometry Images.

    PubMed

    Dexter, Alex; Race, Alan M; Steven, Rory T; Barnes, Jennifer R; Hulme, Heather; Goodwin, Richard J A; Styles, Iain B; Bunch, Josephine

    2017-11-07

    Clustering is widely used in MSI to segment anatomical features and differentiate tissue types, but existing approaches are both CPU and memory-intensive, limiting their application to small, single data sets. We propose a new approach that uses a graph-based algorithm with a two-phase sampling method that overcomes this limitation. We demonstrate the algorithm on a range of sample types and show that it can segment anatomical features that are not identified using commonly employed algorithms in MSI, and we validate our results on synthetic MSI data. We show that the algorithm is robust to fluctuations in data quality by successfully clustering data with a designed-in variance using data acquired with varying laser fluence. Finally, we show that this method is capable of generating accurate segmentations of large MSI data sets acquired on the newest generation of MSI instruments and evaluate these results by comparison with histopathology.

  1. Next-Generation Pathology.

    PubMed

    Caie, Peter D; Harrison, David J

    2016-01-01

    The field of pathology is rapidly transforming from a semiquantitative and empirical science toward a big data discipline. Large data sets from across multiple omics fields may now be extracted from a patient's tissue sample. Tissue is, however, complex, heterogeneous, and prone to artifact. A reductionist view of tissue and disease progression, which does not take this complexity into account, may lead to single biomarkers failing in clinical trials. The integration of standardized multi-omics big data and the retention of valuable information on spatial heterogeneity are imperative to model complex disease mechanisms. Mathematical modeling through systems pathology approaches is the ideal medium to distill the significant information from these large, multi-parametric, and hierarchical data sets. Systems pathology may also predict the dynamical response of disease progression or response to therapy regimens from a static tissue sample. Next-generation pathology will incorporate big data with systems medicine in order to personalize clinical practice for both prognostic and predictive patient care.

  2. Design of Phase II Non-inferiority Trials.

    PubMed

    Jung, Sin-Ho

    2017-09-01

    With the development of inexpensive treatment regimens and less invasive surgical procedures, we are confronted with non-inferiority study objectives. A non-inferiority phase III trial requires a roughly four times larger sample size than that of a similar standard superiority trial. Because of the large required sample size, we often face feasibility issues to open a non-inferiority trial. Furthermore, due to lack of phase II non-inferiority trial design methods, we do not have an opportunity to investigate the efficacy of the experimental therapy through a phase II trial. As a result, we often fail to open a non-inferiority phase III trial and a large number of non-inferiority clinical questions still remain unanswered. In this paper, we want to develop some designs for non-inferiority randomized phase II trials with feasible sample sizes. At first, we review a design method for non-inferiority phase III trials. Subsequently, we propose three different designs for non-inferiority phase II trials that can be used under different settings. Each method is demonstrated with examples. Each of the proposed design methods is shown to require a reasonable sample size for non-inferiority phase II trials. The three different non-inferiority phase II trial designs are used under different settings, but require similar sample sizes that are typical for phase II trials.

  3. The Neo Personality Inventory-Revised: Factor Structure and Gender Invariance from Exploratory Structural Equation Modeling Analyses in a High-Stakes Setting

    ERIC Educational Resources Information Center

    Furnham, Adrian; Guenole, Nigel; Levine, Stephen Z.; Chamorro-Premuzic, Tomas

    2013-01-01

    This study presents new analyses of NEO Personality Inventory-Revised (NEO-PI-R) responses collected from a large British sample in a high-stakes setting. The authors show the appropriateness of the five-factor model underpinning these responses in a variety of new ways. Using the recently developed exploratory structural equation modeling (ESEM)…

  4. Personality Disorders in Substance Abusers: A Comparison of Patients Treated in a Prison Unit and Patients Treated in Inpatient Treatment

    ERIC Educational Resources Information Center

    Stefansson, Ragnar; Hesse, Morten

    2008-01-01

    A large body of literature has shown a high prevalence of personality disorders in substance abusers. We compared a sample of substance abusers treated in a prison setting with substance abusers treated in a non-prison inpatient setting rated with the Millon Clinical Multiaxial Inventory-III. Base-rate scores indicated a prevalence of 95% of…

  5. Understanding the role of conscientiousness in healthy aging: where does the brain come in?

    PubMed

    Patrick, Christopher J

    2014-05-01

    In reviewing this impressive series of articles, I was struck by 2 points in particular: (a) the fact that the empirically oriented articles focused on analyses of data from very large samples, with the articles by Friedman, Kern, Hampson, and Duckworth (2014) and Kern, Hampson, Goldbert, and Friedman (2014) highlighting an approach to merging existing data sets through use of "metric bridges" to address key questions not addressable through 1 data set alone, and (b) the fact that the articles as a whole included limited mention of neuroscientific (i.e., brain research) concepts, methods, and findings. One likely reason for the lack of reference to brain-oriented work is the persisting gap between smaller sample size lab-experimental and larger sample size multivariate-correlational approaches to psychological research. As a strategy for addressing this gap and bringing a distinct neuroscientific component to the National Institute on Aging's conscientiousness and health initiative, I suggest that the metric bridging approach highlighted by Friedman and colleagues could be used to connect existing large-scale data sets containing both neurophysiological variables and measures of individual difference constructs to other data sets containing richer arrays of nonphysiological variables-including data from longitudinal or twin studies focusing on personality and health-related outcomes (e.g., Terman Life Cycle study and Hawaii longitudinal studies, as described in the article by Kern et al., 2014). (PsycINFO Database Record (c) 2014 APA, all rights reserved).

  6. Gravel Transport Measured With Bedload Traps in Mountain Streams: Field Data Sets to be Published

    NASA Astrophysics Data System (ADS)

    Bunte, K.; Swingle, K. W.; Abt, S. R.; Ettema, R.; Cenderelli, D. A.

    2017-12-01

    Direct, accurate measurements of coarse bedload transport exist for only a few streams worldwide, because the task is laborious and requires a suitable device. However, sets of accurate field data would be useful for reference with unsampled sites and as a basis for model developments. The authors have carefully measured gravel transport and are compiling their data sets for publication. To ensure accurate measurements of gravel bedload in wadeable flow, the designed instrument consisted of an unflared aluminum frame (0.3 x 0.2 m) large enough for entry of cobbles. The attached 1 m or longer net with a 4 mm mesh held large bedload volumes. The frame was strapped onto a ground plate anchored onto the channel bed. This setup avoided involuntary sampler particle pick-up and enabled long sampling times, integrating over fluctuating transport. Beveled plates and frames facilitated easy particle entry. Accelerating flow over smooth plates compensated for deceleration within the net. Spacing multiple frames by 1 m enabled sampling much of the stream width. Long deployment, and storage of sampled bedload away from the frame's entrance, were attributes of traps rather than samplers; hence the name "bedload traps". The authors measured gravel transport with 4-6 bedload traps per cross-section at 10 mountain streams in CO, WY, and OR, accumulating 14 data sets (>1,350 samples). In 10 data sets, measurements covered much of the snowmelt high-flow season yielding 50-200 samples. Measurement time was typically 1 hour but ranged from 3 minutes to 3 hours, depending on transport intensity. Measuring back-to-back provided 6 to 10 samples over a 6 to 10-hour field day. Bedload transport was also measured with a 3-inch Helley-Smith sampler. The data set provides fractional (0.5 phi) transport rates in terms of particle mass and number for each bedload trap in the cross-section, the largest particle size, as well as total cross-sectional gravel transport rates. Ancillary field data include stage, discharge, long-term flow records if available, surface and subsurface sediment sizes, as well as longitudinal and cross-sectional site surveys. Besides transport relations, incipient motion conditions, hysteresis, and lateral variation, the data provide a reliable modeling basis to test insights and hypotheses regarding bedload transport.

  7. Mining big data sets of plankton images: a zero-shot learning approach to retrieve labels without training data

    NASA Astrophysics Data System (ADS)

    Orenstein, E. C.; Morgado, P. M.; Peacock, E.; Sosik, H. M.; Jaffe, J. S.

    2016-02-01

    Technological advances in instrumentation and computing have allowed oceanographers to develop imaging systems capable of collecting extremely large data sets. With the advent of in situ plankton imaging systems, scientists must now commonly deal with "big data" sets containing tens of millions of samples spanning hundreds of classes, making manual classification untenable. Automated annotation methods are now considered to be the bottleneck between collection and interpretation. Typically, such classifiers learn to approximate a function that predicts a predefined set of classes for which a considerable amount of labeled training data is available. The requirement that the training data span all the classes of concern is problematic for plankton imaging systems since they sample such diverse, rapidly changing populations. These data sets may contain relatively rare, sparsely distributed, taxa that will not have associated training data; a classifier trained on a limited set of classes will miss these samples. The computer vision community, leveraging advances in Convolutional Neural Networks (CNNs), has recently attempted to tackle such problems using "zero-shot" object categorization methods. Under a zero-shot framework, a classifier is trained to map samples onto a set of attributes rather than a class label. These attributes can include visual and non-visual information such as what an organism is made out of, where it is distributed globally, or how it reproduces. A second stage classifier is then used to extrapolate a class. In this work, we demonstrate a zero-shot classifier, implemented with a CNN, to retrieve out-of-training-set labels from images. This method is applied to data from two continuously imaging, moored instruments: the Scripps Plankton Camera System (SPCS) and the Imaging FlowCytobot (IFCB). Results from simulated deployment scenarios indicate zero-shot classifiers could be successful at recovering samples of rare taxa in image sets. This capability will allow ecologists to identify trends in the distribution of difficult to sample organisms in their data.

  8. A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification.

    PubMed

    Jiang, Wenyu; Simon, Richard

    2007-12-20

    This paper first provides a critical review on some existing methods for estimating the prediction error in classifying microarray data where the number of genes greatly exceeds the number of specimens. Special attention is given to the bootstrap-related methods. When the sample size n is small, we find that all the reviewed methods suffer from either substantial bias or variability. We introduce a repeated leave-one-out bootstrap (RLOOB) method that predicts for each specimen in the sample using bootstrap learning sets of size ln. We then propose an adjusted bootstrap (ABS) method that fits a learning curve to the RLOOB estimates calculated with different bootstrap learning set sizes. The ABS method is robust across the situations we investigate and provides a slightly conservative estimate for the prediction error. Even with small samples, it does not suffer from large upward bias as the leave-one-out bootstrap and the 0.632+ bootstrap, and it does not suffer from large variability as the leave-one-out cross-validation in microarray applications. Copyright (c) 2007 John Wiley & Sons, Ltd.

  9. Advantages and challenges in automated apatite fission track counting

    NASA Astrophysics Data System (ADS)

    Enkelmann, E.; Ehlers, T. A.

    2012-04-01

    Fission track thermochronometer data are often a core element of modern tectonic and denudation studies. Soon after the development of the fission track methods interest emerged for the developed an automated counting procedure to replace the time consuming labor of counting fission tracks under the microscope. Automated track counting became feasible in recent years with increasing improvements in computer software and hardware. One such example used in this study is the commercial automated fission track counting procedure from Autoscan Systems Pty that has been highlighted through several venues. We conducted experiments that are designed to reliably and consistently test the ability of this fully automated counting system to recognize fission tracks in apatite and a muscovite external detector. Fission tracks were analyzed in samples with a step-wise increase in sample complexity. The first set of experiments used a large (mm-size) slice of Durango apatite cut parallel to the prism plane. Second, samples with 80-200 μm large apatite grains of Fish Canyon Tuff were analyzed. This second sample set is characterized by complexities often found in apatites in different rock types. In addition to the automated counting procedure, the same samples were also analyzed using conventional counting procedures. We found for all samples that the fully automated fission track counting procedure using the Autoscan System yields a larger scatter in the fission track densities measured compared to conventional (manual) track counting. This scatter typically resulted from the false identification of tracks due surface and mineralogical defects, regardless of the image filtering procedure used. Large differences between track densities analyzed with the automated counting persisted between different grains analyzed in one sample as well as between different samples. As a result of these differences a manual correction of the fully automated fission track counts is necessary for each individual surface area and grain counted. This manual correction procedure significantly increases (up to four times) the time required to analyze a sample with the automated counting procedure compared to the conventional approach.

  10. The structure of Turkish trait-descriptive adjectives.

    PubMed

    Somer, O; Goldberg, L R

    1999-03-01

    This description of the Turkish lexical project reports some initial findings on the structure of Turkish personality-related variables. In addition, it provides evidence on the effects of target evaluative homogeneity vs. heterogeneity (e.g., samples of well-liked target individuals vs. samples of both liked and disliked targets) on the resulting factor structures, and thus it provides a first test of the conclusions reached by D. Peabody and L. R. Goldberg (1989) using English trait terms. In 2 separate studies, and in 2 types of data sets, clear versions of the Big Five factor structure were found. And both studies replicated and extended the findings of Peabody and Goldberg; virtually orthogonal factors of relatively equal size were found in the homogeneous samples, and a more highly correlated set of factors with relatively large Agreeableness and Conscientiousness dimensions was found in the heterogeneous samples.

  11. Multiplatform sampling (ship, aircraft, and satellite) of a Gulf Stream warm core ring

    NASA Technical Reports Server (NTRS)

    Smith, Raymond C.; Brown, Otis B.; Hoge, Frank E.; Baker, Karen S.; Evans, Robert H.

    1987-01-01

    The purpose of this paper is to demonstrate the ability to meet the need to measure distributions of physical and biological properties of the ocean over large areas synoptically and over long time periods by means of remote sensing utilizing contemporaneous buoy, ship, aircraft, and satellite (i.e., multiplatform) sampling strategies. A mapping of sea surface temperature and chlorophyll fields in a Gulf Stream warm core ring using the multiplatform approach is described. Sampling capabilities of each sensing system are discussed as background for the data collected by means of these three dissimilar methods. Commensurate space/time sample sets from each sensing system are compared, and their relative accuracies in space and time are determined. The three-dimensional composite maps derived from the data set provide a synoptic perspective unobtainable from single platforms alone.

  12. Gibbs sampling on large lattice with GMRF

    NASA Astrophysics Data System (ADS)

    Marcotte, Denis; Allard, Denis

    2018-02-01

    Gibbs sampling is routinely used to sample truncated Gaussian distributions. These distributions naturally occur when associating latent Gaussian fields to category fields obtained by discrete simulation methods like multipoint, sequential indicator simulation and object-based simulation. The latent Gaussians are often used in data assimilation and history matching algorithms. When the Gibbs sampling is applied on a large lattice, the computing cost can become prohibitive. The usual practice of using local neighborhoods is unsatisfying as it can diverge and it does not reproduce exactly the desired covariance. A better approach is to use Gaussian Markov Random Fields (GMRF) which enables to compute the conditional distributions at any point without having to compute and invert the full covariance matrix. As the GMRF is locally defined, it allows simultaneous updating of all points that do not share neighbors (coding sets). We propose a new simultaneous Gibbs updating strategy on coding sets that can be efficiently computed by convolution and applied with an acceptance/rejection method in the truncated case. We study empirically the speed of convergence, the effect of choice of boundary conditions, of the correlation range and of GMRF smoothness. We show that the convergence is slower in the Gaussian case on the torus than for the finite case studied in the literature. However, in the truncated Gaussian case, we show that short scale correlation is quickly restored and the conditioning categories at each lattice point imprint the long scale correlation. Hence our approach enables to realistically apply Gibbs sampling on large 2D or 3D lattice with the desired GMRF covariance.

  13. GARN: Sampling RNA 3D Structure Space with Game Theory and Knowledge-Based Scoring Strategies.

    PubMed

    Boudard, Mélanie; Bernauer, Julie; Barth, Dominique; Cohen, Johanne; Denise, Alain

    2015-01-01

    Cellular processes involve large numbers of RNA molecules. The functions of these RNA molecules and their binding to molecular machines are highly dependent on their 3D structures. One of the key challenges in RNA structure prediction and modeling is predicting the spatial arrangement of the various structural elements of RNA. As RNA folding is generally hierarchical, methods involving coarse-grained models hold great promise for this purpose. We present here a novel coarse-grained method for sampling, based on game theory and knowledge-based potentials. This strategy, GARN (Game Algorithm for RNa sampling), is often much faster than previously described techniques and generates large sets of solutions closely resembling the native structure. GARN is thus a suitable starting point for the molecular modeling of large RNAs, particularly those with experimental constraints. GARN is available from: http://garn.lri.fr/.

  14. Comparison of taxon-specific versus general locus sets for targeted sequence capture in plant phylogenomics.

    PubMed

    Chau, John H; Rahfeldt, Wolfgang A; Olmstead, Richard G

    2018-03-01

    Targeted sequence capture can be used to efficiently gather sequence data for large numbers of loci, such as single-copy nuclear loci. Most published studies in plants have used taxon-specific locus sets developed individually for a clade using multiple genomic and transcriptomic resources. General locus sets can also be developed from loci that have been identified as single-copy and have orthologs in large clades of plants. We identify and compare a taxon-specific locus set and three general locus sets (conserved ortholog set [COSII], shared single-copy nuclear [APVO SSC] genes, and pentatricopeptide repeat [PPR] genes) for targeted sequence capture in Buddleja (Scrophulariaceae) and outgroups. We evaluate their performance in terms of assembly success, sequence variability, and resolution and support of inferred phylogenetic trees. The taxon-specific locus set had the most target loci. Assembly success was high for all locus sets in Buddleja samples. For outgroups, general locus sets had greater assembly success. Taxon-specific and PPR loci had the highest average variability. The taxon-specific data set produced the best-supported tree, but all data sets showed improved resolution over previous non-sequence capture data sets. General locus sets can be a useful source of sequence capture targets, especially if multiple genomic resources are not available for a taxon.

  15. Automated Sample Preparation for Radiogenic and Non-Traditional Metal Isotopes: Removing an Analytical Barrier for High Sample Throughput

    NASA Astrophysics Data System (ADS)

    Field, M. Paul; Romaniello, Stephen; Gordon, Gwyneth W.; Anbar, Ariel D.; Herrmann, Achim; Martinez-Boti, Miguel A.; Anagnostou, Eleni; Foster, Gavin L.

    2014-05-01

    MC-ICP-MS has dramatically improved the analytical throughput for high-precision radiogenic and non-traditional isotope ratio measurements, compared to TIMS. The generation of large data sets, however, remains hampered by tedious manual drip chromatography required for sample purification. A new, automated chromatography system reduces the laboratory bottle neck and expands the utility of high-precision isotope analyses in applications where large data sets are required: geochemistry, forensic anthropology, nuclear forensics, medical research and food authentication. We have developed protocols to automate ion exchange purification for several isotopic systems (B, Ca, Fe, Cu, Zn, Sr, Cd, Pb and U) using the new prepFAST-MC™ (ESI, Nebraska, Omaha). The system is not only inert (all-flouropolymer flow paths), but is also very flexible and can easily facilitate different resins, samples, and reagent types. When programmed, precise and accurate user defined volumes and flow rates are implemented to automatically load samples, wash the column, condition the column and elute fractions. Unattended, the automated, low-pressure ion exchange chromatography system can process up to 60 samples overnight. Excellent reproducibility, reliability, recovery, with low blank and carry over for samples in a variety of different matrices, have been demonstrated to give accurate and precise isotopic ratios within analytical error for several isotopic systems (B, Ca, Fe, Cu, Zn, Sr, Cd, Pb and U). This illustrates the potential of the new prepFAST-MC™ (ESI, Nebraska, Omaha) as a powerful tool in radiogenic and non-traditional isotope research.

  16. A mixture model-based approach to the clustering of microarray expression data.

    PubMed

    McLachlan, G J; Bean, R W; Peel, D

    2002-03-01

    This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. EMMIX-GENE is available at http://www.maths.uq.edu.au/~gjm/emmix-gene/

  17. Fog collecting biomimetic surfaces: Influence of microstructure and wettability.

    PubMed

    Azad, M A K; Ellerbrok, D; Barthlott, W; Koch, K

    2015-01-19

    We analyzed the fog collection efficiency of three different sets of samples: replica (with and without microstructures), copper wire (smooth and microgrooved) and polyolefin mesh (hydrophilic, superhydrophilic and hydrophobic). The collection efficiency of the samples was compared in each set separately to investigate the influence of microstructures and/or the wettability of the surfaces on fog collection. Based on the controlled experimental conditions chosen here large differences in the efficiency were found. We found that microstructured plant replica samples collected 2-3 times higher amounts of water than that of unstructured (smooth) samples. Copper wire samples showed similar results. Moreover, microgrooved wires had a faster dripping of water droplets than that of smooth wires. The superhydrophilic mesh tested here was proved more efficient than any other mesh samples with different wettability. The amount of collected fog by superhydrophilic mesh was about 5 times higher than that of hydrophilic (untreated) mesh and was about 2 times higher than that of hydrophobic mesh.

  18. Bedload Rating and Flow Competence Curves Vary With Watershed and Bed Material Parameters

    NASA Astrophysics Data System (ADS)

    Bunte, K.; Abt, S. R.

    2003-12-01

    Bedload transport rating curves and flow competence curves (largest bedload size for specified flow) are usually not known for streams unless a large number of bedload samples has been collected and analyzed. However, this information is necessary for assessing instream flow needs and stream responses to watershed effects. This study therefore analyzed whether bedload transport rating and flow competence curves were related to stream parameters. Bedload transport rating curves and flow competence curves were obtained from extensive bedload sampling in six gravel- and cobble-bed mountain streams. Samples were collected using bedload traps and a large net sampler, both of which provide steep and relatively well-defined bedload rating and flow competence curves due to a long sampling duration, a large sampler opening and a large sampler capacity. The sampled streams have snowmelt regimes, steep (1-9%) gradients, and watersheds that are mainly forested and relatively undisturbed with basin area sizes of 8 to 105 km2. The channels are slightly incised and can contain flows of more than 1.5 times bankfull with little overbank flow. Exponents of bedload rating and flow competence curves obtained from these measurements were found to systematically increase with basin area size and decrease with the degree of channel armoring. By contrast, coefficients of bedload rating and flow competence curves decreased with basin size and increased with armoring. All of these relationships were well-defined (0.86 < r2 < 0.99). Data sets from other studies in coarse-bedded streams fit the indicated trend if the sampling device used allows measuring bedload transport rates over a wide range and if bedload supply is somewhat low. The existence of a general positive trend between bedload rating curve exponents and basin area, and a negative trend between coefficients and basin area, is confirmed by a large data set of bedload rating curves obtained from Helley-Smith samples. However, in this case, the trends only become visible as basin area sizes span a wide range (1 - 10,000 km2). The well-defined relationships obtained from the bedload trap and the large net sampler suggest that exponents and coefficients of bedload transport rating curves (and flow competence curves) are predictable from an easily obtainable parameter such as basin size. However, the relationships of bedload rating curve exponents and coefficients with basin size and armoring appear to be influenced by the sampling device used and the watershed sediment production.

  19. Diagnosing intramammary infections: evaluation of definitions based on a single milk sample.

    PubMed

    Dohoo, I R; Smith, J; Andersen, S; Kelton, D F; Godden, S

    2011-01-01

    Criteria for diagnosing intramammary infections (IMI) have been debated for many years. Factors that may be considered in making a diagnosis include the organism of interest being found on culture, the number of colonies isolated, whether or not the organism was recovered in pure or mixed culture, and whether or not concurrent evidence of inflammation existed (often measured by somatic cell count). However, research using these criteria has been hampered by the lack of a "gold standard" test (i.e., a perfect test against which the criteria can be evaluated) and the need for very large data sets of culture results to have sufficient numbers of quarters with infections with a variety of organisms. This manuscript used 2 large data sets of culture results to evaluate several definitions (sets of criteria) for classifying a quarter as having, or not having an IMI by comparing the results from a single culture to a gold standard diagnosis based on a set of 3 milk samples. The first consisted of 38,376 milk samples from which 25,886 triplicate sets of milk samples taken 1 wk apart were extracted. The second consisted of 784 quarters that were classified as infected or not based on a set of 3 milk samples collected at 2-d intervals. From these quarters, a total of 3,136 additional samples were evaluated. A total of 12 definitions (named A to L) based on combinations of the number of colonies isolated, whether or not the organism was recovered in pure or mixed culture, and the somatic cell count were evaluated for each organism (or group of organisms) with sufficient data. The sensitivity (ability of a definition to detect IMI) and the specificity (Sp; ability of a definition to correctly classify noninfected quarters) were both computed. For all species, except Staphylococcus aureus, the sensitivity of all definitions was <90% (and in many cases<50%). Consequently, if identifying as many existing infections as possible is important, then the criteria for considering a quarter positive should be a single colony (from a 0.01-mL milk sample) isolated (definition A). With the exception of "any organism" and coagulase-negative staphylococci, all Sp estimates were over 94% in the daily data and over 97% in the weekly data, suggesting that for most species, definition A may be acceptable. For coagulase-negative staphylococci, definitions B (2 colonies from a 0.01-mL milk sample) raised the Sp to 92 and 95% in the daily and weekly data, respectively. For "any organism," using definition B raised the Sp to 88 and 93% in the 2 data sets, respectively. The final choice of definition will depend on the objectives of study or control program for which the sample was collected. Copyright © 2011 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  20. Differential relationships between set-shifting abilities and dimensions of insight in schizophrenia.

    PubMed

    Diez-Martin, J; Moreno-Ortega, M; Bagney, A; Rodriguez-Jimenez, R; Padilla-Torres, D; Sanchez-Morla, E M; Santos, J L; Palomo, T; Jimenez-Arriero, M A

    2014-01-01

    To assess insight in a large sample of patients with schizophrenia and to study its relationship with set shifting as an executive function. The insight of a sample of 161 clinically stable, community-dwelling patients with schizophrenia was evaluated by means of the Scale to Assess Unawareness of Mental Disorder (SUMD). Set shifting was measured using the Trail-Making Test time required to complete part B minus the time required to complete part A (TMT B-A). Linear regression analyses were performed to investigate the relationships of TMT B-A with different dimensions of general insight. Regression analyses revealed a significant association between TMT B-A and two of the SUMD general components: 'awareness of mental disorder' and 'awareness of the efficacy of treatment'. The 'awareness of social consequences' component was not significantly associated with set shifting. Our results show a significant relation between set shifting and insight, but not in the same manner for the different components of the SUMD general score. Copyright © 2013 S. Karger AG, Basel.

  1. Downslope coarsening in aeolian grainflows of the Navajo Sandstone

    NASA Astrophysics Data System (ADS)

    Loope, David B.; Elder, James F.; Sweeney, Mark R.

    2012-07-01

    Downslope coarsening in grainflows has been observed on present-day dunes and generated in labs, but few previous studies have examined vertical sorting in ancient aeolian grainflows. We studied the grainflow strata of the Jurassic Navajo Sandstone in the southern Utah portion of its outcrop belt from Zion National Park (west) to Coyote Buttes and The Dive (east). At each study site, thick sets of grainflow-dominated cross-strata that were deposited by large transverse dunes comprise the bulk of the Navajo Sandstone. We studied three stratigraphic columns, one per site, composed almost exclusively of aeolian cross-strata. For each column, samples were obtained from one grainflow stratum in each consecutive set of the column, for a total of 139 samples from thirty-two sets of cross-strata. To investigate grading perpendicular to bedding within individual grainflows, we collected fourteen samples from four superimposed grainflow strata at The Dive. Samples were analyzed with a Malvern Mastersizer 2000 laser diffraction particle analyser. The median grain size of grainflow samples ranges from fine sand (164 μm) to coarse sand (617 μm). Using Folk and Ward criteria, samples are well-sorted to moderately-well-sorted. All but one of the twenty-eight sets showed at least slight downslope coarsening, but in general, downslope coarsening was not as well-developed or as consistent as that reported in laboratory subaqueous grainflows. Because coarse sand should be quickly sequestered within preserved cross-strata when bedforms climb, grain-size studies may help to test hypotheses for the stacking of sets of cross-strata.

  2. Characterization and Analysis of Liquid Waste from Marcellus Shale Gas Development.

    PubMed

    Shih, Jhih-Shyang; Saiers, James E; Anisfeld, Shimon C; Chu, Ziyan; Muehlenbachs, Lucija A; Olmstead, Sheila M

    2015-08-18

    Hydraulic fracturing of shale for gas production in Pennsylvania generates large quantities of wastewater, the composition of which has been inadequately characterized. We compiled a unique data set from state-required wastewater generator reports filed in 2009-2011. The resulting data set, comprising 160 samples of flowback, produced water, and drilling wastes, analyzed for 84 different chemicals, is the most comprehensive available to date for Marcellus Shale wastewater. We analyzed the data set using the Kaplan-Meier method to deal with the high prevalence of nondetects for some analytes, and compared wastewater characteristics with permitted effluent limits and ambient monitoring limits and capacity. Major-ion concentrations suggested that most wastewater samples originated from dilution of brines, although some of our samples were more concentrated than any Marcellus brines previously reported. One problematic aspect of this wastewater was the very high concentrations of soluble constituents such as chloride, which are poorly removed by wastewater treatment plants; the vast majority of samples exceeded relevant water quality thresholds, generally by 2-3 orders of magnitude. We also examine the capacity of regional regulatory monitoring to assess and control these risks.

  3. Comparative Initial and Sustained Engagement in Web-based Training by Behavioral Healthcare Providers in New York State.

    PubMed

    Talley, Rachel; Chiang, I-Chin; Covell, Nancy H; Dixon, Lisa

    2018-06-01

    Improved dissemination is critical to implementation of evidence-based practice in community behavioral healthcare settings. Web-based training modalities are a promising strategy for dissemination of evidence-based practice in community behavioral health settings. Initial and sustained engagement of these modalities in large, multidisciplinary community provider samples is not well understood. This study evaluates comparative engagement and user preferences by provider type in a web-based training platform in a large, multidisciplinary community sample of behavioral health staff in New York State. Workforce make-up among platform registrants was compared to the general NYS behavioral health workforce. Training completion by functional job type was compared to characterize user engagement and preferences. Frequently completed modules were classified by credit and requirement incentives. High initial training engagement across professional role was demonstrated, with significant differences in initial and sustained engagement by professional role. The most frequently completed modules across functional job types contained credit or requirement incentives. The analysis demonstrated that high engagement of a web-based training in a multidisciplinary provider audience can be achieved without tailoring content to specific professional roles. Overlap between frequently completed modules and incentives suggests a role for incentives in promoting engagement of web-based training. These findings further the understanding of strategies to promote large-scale dissemination of evidence-based practice in community behavioral health settings.

  4. HiQuant: Rapid Postquantification Analysis of Large-Scale MS-Generated Proteomics Data.

    PubMed

    Bryan, Kenneth; Jarboui, Mohamed-Ali; Raso, Cinzia; Bernal-Llinares, Manuel; McCann, Brendan; Rauch, Jens; Boldt, Karsten; Lynn, David J

    2016-06-03

    Recent advances in mass-spectrometry-based proteomics are now facilitating ambitious large-scale investigations of the spatial and temporal dynamics of the proteome; however, the increasing size and complexity of these data sets is overwhelming current downstream computational methods, specifically those that support the postquantification analysis pipeline. Here we present HiQuant, a novel application that enables the design and execution of a postquantification workflow, including common data-processing steps, such as assay normalization and grouping, and experimental replicate quality control and statistical analysis. HiQuant also enables the interpretation of results generated from large-scale data sets by supporting interactive heatmap analysis and also the direct export to Cytoscape and Gephi, two leading network analysis platforms. HiQuant may be run via a user-friendly graphical interface and also supports complete one-touch automation via a command-line mode. We evaluate HiQuant's performance by analyzing a large-scale, complex interactome mapping data set and demonstrate a 200-fold improvement in the execution time over current methods. We also demonstrate HiQuant's general utility by analyzing proteome-wide quantification data generated from both a large-scale public tyrosine kinase siRNA knock-down study and an in-house investigation into the temporal dynamics of the KSR1 and KSR2 interactomes. Download HiQuant, sample data sets, and supporting documentation at http://hiquant.primesdb.eu .

  5. Multiplexed resequencing analysis to identify rare variants in pooled DNA with barcode indexing using next-generation sequencer.

    PubMed

    Mitsui, Jun; Fukuda, Yoko; Azuma, Kyo; Tozaki, Hirokazu; Ishiura, Hiroyuki; Takahashi, Yuji; Goto, Jun; Tsuji, Shoji

    2010-07-01

    We have recently found that multiple rare variants of the glucocerebrosidase gene (GBA) confer a robust risk for Parkinson disease, supporting the 'common disease-multiple rare variants' hypothesis. To develop an efficient method of identifying rare variants in a large number of samples, we applied multiplexed resequencing using a next-generation sequencer to identification of rare variants of GBA. Sixteen sets of pooled DNAs from six pooled DNA samples were prepared. Each set of pooled DNAs was subjected to polymerase chain reaction to amplify the target gene (GBA) covering 6.5 kb, pooled into one tube with barcode indexing, and then subjected to extensive sequence analysis using the SOLiD System. Individual samples were also subjected to direct nucleotide sequence analysis. With the optimization of data processing, we were able to extract all the variants from 96 samples with acceptable rates of false-positive single-nucleotide variants.

  6. ANALYZING CORRELATIONS BETWEEN STREAM AND WATERSHED ATTRIBUTES

    EPA Science Inventory

    Bivariate correlation analysis has been widely used to explore relationships between stream and watershed attributes that have all been measured on the same set of watersheds or sampling locations. Researchers routinely test H0: =0 for each correlation in a large table and then ...

  7. Using Realist Synthesis to Develop an Evidence Base from an Identified Data Set on Enablers and Barriers for Alcohol and Drug Program Implementation

    ERIC Educational Resources Information Center

    Hunter, Barbara; MacLean, Sarah; Berends, Lynda

    2012-01-01

    The purpose of this paper is to show how "realist synthesis" methodology (Pawson, 2002) was adapted to review a large sample of community based projects addressing alcohol and drug use problems. Our study drew on a highly varied sample of 127 projects receiving funding from a national non-government organisation in Australia between 2002…

  8. Scalable population estimates using spatial-stream-network (SSN) models, fish density surveys, and national geospatial database frameworks for streams

    Treesearch

    Daniel J. Isaak; Jay M. Ver Hoef; Erin E. Peterson; Dona L. Horan; David E. Nagel

    2017-01-01

    Population size estimates for stream fishes are important for conservation and management, but sampling costs limit the extent of most estimates to small portions of river networks that encompass 100s–10 000s of linear kilometres. However, the advent of large fish density data sets, spatial-stream-network (SSN) models that benefit from nonindependence among samples,...

  9. Investigation of rare and low-frequency variants using high-throughput sequencing with pooled DNA samples

    PubMed Central

    Wang, Jingwen; Skoog, Tiina; Einarsdottir, Elisabet; Kaartokallio, Tea; Laivuori, Hannele; Grauers, Anna; Gerdhem, Paul; Hytönen, Marjo; Lohi, Hannes; Kere, Juha; Jiao, Hong

    2016-01-01

    High-throughput sequencing using pooled DNA samples can facilitate genome-wide studies on rare and low-frequency variants in a large population. Some major questions concerning the pooling sequencing strategy are whether rare and low-frequency variants can be detected reliably, and whether estimated minor allele frequencies (MAFs) can represent the actual values obtained from individually genotyped samples. In this study, we evaluated MAF estimates using three variant detection tools with two sets of pooled whole exome sequencing (WES) and one set of pooled whole genome sequencing (WGS) data. Both GATK and Freebayes displayed high sensitivity, specificity and accuracy when detecting rare or low-frequency variants. For the WGS study, 56% of the low-frequency variants in Illumina array have identical MAFs and 26% have one allele difference between sequencing and individual genotyping data. The MAF estimates from WGS correlated well (r = 0.94) with those from Illumina arrays. The MAFs from the pooled WES data also showed high concordance (r = 0.88) with those from the individual genotyping data. In conclusion, the MAFs estimated from pooled DNA sequencing data reflect the MAFs in individually genotyped samples well. The pooling strategy can thus be a rapid and cost-effective approach for the initial screening in large-scale association studies. PMID:27633116

  10. MicroRNA signatures in B-cell lymphomas

    PubMed Central

    Di Lisio, L; Sánchez-Beato, M; Gómez-López, G; Rodríguez, M E; Montes-Moreno, S; Mollejo, M; Menárguez, J; Martínez, M A; Alves, F J; Pisano, D G; Piris, M A; Martínez, N

    2012-01-01

    Accurate lymphoma diagnosis, prognosis and therapy still require additional markers. We explore the potential relevance of microRNA (miRNA) expression in a large series that included all major B-cell non-Hodgkin lymphoma (NHL) types. The data generated were also used to identify miRNAs differentially expressed in Burkitt lymphoma (BL) and diffuse large B-cell lymphoma (DLBCL) samples. A series of 147 NHL samples and 15 controls were hybridized on a human miRNA one-color platform containing probes for 470 human miRNAs. Each lymphoma type was compared against the entire set of NHLs. BL was also directly compared with DLBCL, and 43 preselected miRNAs were analyzed in a new series of routinely processed samples of 28 BLs and 43 DLBCLs using quantitative reverse transcription-polymerase chain reaction. A signature of 128 miRNAs enabled the characterization of lymphoma neoplasms, reflecting the lymphoma type, cell of origin and/or discrete oncogene alterations. Comparative analysis of BL and DLBCL yielded 19 differentially expressed miRNAs, which were confirmed in a second confirmation series of 71 paraffin-embedded samples. The set of differentially expressed miRNAs found here expands the range of potential diagnostic markers for lymphoma diagnosis, especially when differential diagnosis of BL and DLBCL is required. PMID:22829247

  11. Estimating clinical chemistry reference values based on an existing data set of unselected animals.

    PubMed

    Dimauro, Corrado; Bonelli, Piero; Nicolussi, Paola; Rassu, Salvatore P G; Cappio-Borlino, Aldo; Pulina, Giuseppe

    2008-11-01

    In an attempt to standardise the determination of biological reference values, the International Federation of Clinical Chemistry (IFCC) has published a series of recommendations on developing reference intervals. The IFCC recommends the use of an a priori sampling of at least 120 healthy individuals. However, such a high number of samples and laboratory analysis is expensive, time-consuming and not always feasible, especially in veterinary medicine. In this paper, an alternative (a posteriori) method is described and is used to determine reference intervals for biochemical parameters of farm animals using an existing laboratory data set. The method used was based on the detection and removal of outliers to obtain a large sample of animals likely to be healthy from the existing data set. This allowed the estimation of reliable reference intervals for biochemical parameters in Sarda dairy sheep. This method may also be useful for the determination of reference intervals for different species, ages and gender.

  12. Design of shared instruments to utilize simulated gravities generated by a large-gradient, high-field superconducting magnet.

    PubMed

    Wang, Y; Yin, D C; Liu, Y M; Shi, J Z; Lu, H M; Shi, Z H; Qian, A R; Shang, P

    2011-03-01

    A high-field superconducting magnet can provide both high-magnetic fields and large-field gradients, which can be used as a special environment for research or practical applications in materials processing, life science studies, physical and chemical reactions, etc. To make full use of a superconducting magnet, shared instruments (the operating platform, sample holders, temperature controller, and observation system) must be prepared as prerequisites. This paper introduces the design of a set of sample holders and a temperature controller in detail with an emphasis on validating the performance of the force and temperature sensors in the high-magnetic field.

  13. Design of shared instruments to utilize simulated gravities generated by a large-gradient, high-field superconducting magnet

    NASA Astrophysics Data System (ADS)

    Wang, Y.; Yin, D. C.; Liu, Y. M.; Shi, J. Z.; Lu, H. M.; Shi, Z. H.; Qian, A. R.; Shang, P.

    2011-03-01

    A high-field superconducting magnet can provide both high-magnetic fields and large-field gradients, which can be used as a special environment for research or practical applications in materials processing, life science studies, physical and chemical reactions, etc. To make full use of a superconducting magnet, shared instruments (the operating platform, sample holders, temperature controller, and observation system) must be prepared as prerequisites. This paper introduces the design of a set of sample holders and a temperature controller in detail with an emphasis on validating the performance of the force and temperature sensors in the high-magnetic field.

  14. Use of portable electronic devices in a hospital setting and their potential for bacterial colonization.

    PubMed

    Khan, Amber; Rao, Amitha; Reyes-Sacin, Carlos; Hayakawa, Kayoko; Szpunar, Susan; Riederer, Kathleen; Kaye, Keith; Fishbain, Joel T; Levine, Diane

    2015-03-01

    Portable electronic devices are increasingly being used in the hospital setting. As with other fomites, these devices represent a potential reservoir for the transmission of pathogens. We conducted a convenience sampling of devices in 2 large medical centers to identify bacterial colonization rates and potential risk factors. Copyright © 2015 Association for Professionals in Infection Control and Epidemiology, Inc. Published by Elsevier Inc. All rights reserved.

  15. Visual Word Recognition Across the Adult Lifespan

    PubMed Central

    Cohen-Shikora, Emily R.; Balota, David A.

    2016-01-01

    The current study examines visual word recognition in a large sample (N = 148) across the adult lifespan and across a large set of stimuli (N = 1187) in three different lexical processing tasks (pronunciation, lexical decision, and animacy judgments). Although the focus of the present study is on the influence of word frequency, a diverse set of other variables are examined as the system ages and acquires more experience with language. Computational models and conceptual theories of visual word recognition and aging make differing predictions for age-related changes in the system. However, these have been difficult to assess because prior studies have produced inconsistent results, possibly due to sample differences, analytic procedures, and/or task-specific processes. The current study confronts these potential differences by using three different tasks, treating age and word variables as continuous, and exploring the influence of individual differences such as vocabulary, vision, and working memory. The primary finding is remarkable stability in the influence of a diverse set of variables on visual word recognition across the adult age spectrum. This pattern is discussed in reference to previous inconsistent findings in the literature and implications for current models of visual word recognition. PMID:27336629

  16. The use of single-date MODIS imagery for estimating large-scale urban impervious surface fraction with spectral mixture analysis and machine learning techniques

    NASA Astrophysics Data System (ADS)

    Deng, Chengbin; Wu, Changshan

    2013-12-01

    Urban impervious surface information is essential for urban and environmental applications at the regional/national scales. As a popular image processing technique, spectral mixture analysis (SMA) has rarely been applied to coarse-resolution imagery due to the difficulty of deriving endmember spectra using traditional endmember selection methods, particularly within heterogeneous urban environments. To address this problem, we derived endmember signatures through a least squares solution (LSS) technique with known abundances of sample pixels, and integrated these endmember signatures into SMA for mapping large-scale impervious surface fraction. In addition, with the same sample set, we carried out objective comparative analyses among SMA (i.e. fully constrained and unconstrained SMA) and machine learning (i.e. Cubist regression tree and Random Forests) techniques. Analysis of results suggests three major conclusions. First, with the extrapolated endmember spectra from stratified random training samples, the SMA approaches performed relatively well, as indicated by small MAE values. Second, Random Forests yields more reliable results than Cubist regression tree, and its accuracy is improved with increased sample sizes. Finally, comparative analyses suggest a tentative guide for selecting an optimal approach for large-scale fractional imperviousness estimation: unconstrained SMA might be a favorable option with a small number of samples, while Random Forests might be preferred if a large number of samples are available.

  17. Constructing DNA Barcode Sets Based on Particle Swarm Optimization.

    PubMed

    Wang, Bin; Zheng, Xuedong; Zhou, Shihua; Zhou, Changjun; Wei, Xiaopeng; Zhang, Qiang; Wei, Ziqi

    2018-01-01

    Following the completion of the human genome project, a large amount of high-throughput bio-data was generated. To analyze these data, massively parallel sequencing, namely next-generation sequencing, was rapidly developed. DNA barcodes are used to identify the ownership between sequences and samples when they are attached at the beginning or end of sequencing reads. Constructing DNA barcode sets provides the candidate DNA barcodes for this application. To increase the accuracy of DNA barcode sets, a particle swarm optimization (PSO) algorithm has been modified and used to construct the DNA barcode sets in this paper. Compared with the extant results, some lower bounds of DNA barcode sets are improved. The results show that the proposed algorithm is effective in constructing DNA barcode sets.

  18. Gas sorption and barrier properties of polymeric membranes from molecular dynamics and Monte Carlo simulations.

    PubMed

    Cozmuta, Ioana; Blanco, Mario; Goddard, William A

    2007-03-29

    It is important for many industrial processes to design new materials with improved selective permeability properties. Besides diffusion, the molecule's solubility contributes largely to the overall permeation process. This study presents a method to calculate solubility coefficients of gases such as O2, H2O (vapor), N2, and CO2 in polymeric matrices from simulation methods (Molecular Dynamics and Monte Carlo) using first principle predictions. The generation and equilibration (annealing) of five polymer models (polypropylene, polyvinyl alcohol, polyvinyl dichloride, polyvinyl chloride-trifluoroethylene, and polyethylene terephtalate) are extensively described. For each polymer, the average density and Hansen solubilities over a set of ten samples compare well with experimental data. For polyethylene terephtalate, the average properties between a small (n = 10) and a large (n = 100) set are compared. Boltzmann averages and probability density distributions of binding and strain energies indicate that the smaller set is biased in sampling configurations with higher energies. However, the sample with the lowest cohesive energy density from the smaller set is representative of the average of the larger set. Density-wise, low molecular weight polymers tend to have on average lower densities. Infinite molecular weight samples do however provide a very good representation of the experimental density. Solubility constants calculated with two ensembles (grand canonical and Henry's constant) are equivalent within 20%. For each polymer sample, the solubility constant is then calculated using the faster (10x) Henry's constant ensemble (HCE) from 150 ps of NPT dynamics of the polymer matrix. The influence of various factors (bad contact fraction, number of iterations) on the accuracy of Henry's constant is discussed. To validate the calculations against experimental results, the solubilities of nitrogen and carbon dioxide in polypropylene are examined over a range of temperatures between 250 and 650 K. The magnitudes of the calculated solubilities agree well with experimental results, and the trends with temperature are predicted correctly. The HCE method is used to predict the solubility constants at 298 K of water vapor and oxygen. The water vapor solubilities follow more closely the experimental trend of permeabilities, both ranging over 4 orders of magnitude. For oxygen, the calculated values do not follow entirely the experimental trend of permeabilities, most probably because at this temperature some of the polymers are in the glassy regime and thus are diffusion dominated. Our study also concludes large confidence limits are associated with the calculated Henry's constants. By investigating several factors (terminal ends of the polymer chains, void distribution, etc.), we conclude that the large confidence limits are intimately related to the polymer's conformational changes caused by thermal fluctuations and have to be regarded--at least at microscale--as a characteristic of each polymer and the nature of its interaction with the solute. Reducing the mobility of the polymer matrix as well as controlling the distribution of the free (occupiable) volume would act as mechanisms toward lowering both the gas solubility and the diffusion coefficients.

  19. Analysis of training sample selection strategies for regression-based quantitative landslide susceptibility mapping methods

    NASA Astrophysics Data System (ADS)

    Erener, Arzu; Sivas, A. Abdullah; Selcuk-Kestel, A. Sevtap; Düzgün, H. Sebnem

    2017-07-01

    All of the quantitative landslide susceptibility mapping (QLSM) methods requires two basic data types, namely, landslide inventory and factors that influence landslide occurrence (landslide influencing factors, LIF). Depending on type of landslides, nature of triggers and LIF, accuracy of the QLSM methods differs. Moreover, how to balance the number of 0 (nonoccurrence) and 1 (occurrence) in the training set obtained from the landslide inventory and how to select which one of the 1's and 0's to be included in QLSM models play critical role in the accuracy of the QLSM. Although performance of various QLSM methods is largely investigated in the literature, the challenge of training set construction is not adequately investigated for the QLSM methods. In order to tackle this challenge, in this study three different training set selection strategies along with the original data set is used for testing the performance of three different regression methods namely Logistic Regression (LR), Bayesian Logistic Regression (BLR) and Fuzzy Logistic Regression (FLR). The first sampling strategy is proportional random sampling (PRS), which takes into account a weighted selection of landslide occurrences in the sample set. The second method, namely non-selective nearby sampling (NNS), includes randomly selected sites and their surrounding neighboring points at certain preselected distances to include the impact of clustering. Selective nearby sampling (SNS) is the third method, which concentrates on the group of 1's and their surrounding neighborhood. A randomly selected group of landslide sites and their neighborhood are considered in the analyses similar to NNS parameters. It is found that LR-PRS, FLR-PRS and BLR-Whole Data set-ups, with order, yield the best fits among the other alternatives. The results indicate that in QLSM based on regression models, avoidance of spatial correlation in the data set is critical for the model's performance.

  20. Assessment and improvement of statistical tools for comparative proteomics analysis of sparse data sets with few experimental replicates.

    PubMed

    Schwämmle, Veit; León, Ileana Rodríguez; Jensen, Ole Nørregaard

    2013-09-06

    Large-scale quantitative analyses of biological systems are often performed with few replicate experiments, leading to multiple nonidentical data sets due to missing values. For example, mass spectrometry driven proteomics experiments are frequently performed with few biological or technical replicates due to sample-scarcity or due to duty-cycle or sensitivity constraints, or limited capacity of the available instrumentation, leading to incomplete results where detection of significant feature changes becomes a challenge. This problem is further exacerbated for the detection of significant changes on the peptide level, for example, in phospho-proteomics experiments. In order to assess the extent of this problem and the implications for large-scale proteome analysis, we investigated and optimized the performance of three statistical approaches by using simulated and experimental data sets with varying numbers of missing values. We applied three tools, including standard t test, moderated t test, also known as limma, and rank products for the detection of significantly changing features in simulated and experimental proteomics data sets with missing values. The rank product method was improved to work with data sets containing missing values. Extensive analysis of simulated and experimental data sets revealed that the performance of the statistical analysis tools depended on simple properties of the data sets. High-confidence results were obtained by using the limma and rank products methods for analyses of triplicate data sets that exhibited more than 1000 features and more than 50% missing values. The maximum number of differentially represented features was identified by using limma and rank products methods in a complementary manner. We therefore recommend combined usage of these methods as a novel and optimal way to detect significantly changing features in these data sets. This approach is suitable for large quantitative data sets from stable isotope labeling and mass spectrometry experiments and should be applicable to large data sets of any type. An R script that implements the improved rank products algorithm and the combined analysis is available.

  1. A data set from flash X-ray imaging of carboxysomes

    NASA Astrophysics Data System (ADS)

    Hantke, Max F.; Hasse, Dirk; Ekeberg, Tomas; John, Katja; Svenda, Martin; Loh, Duane; Martin, Andrew V.; Timneanu, Nicusor; Larsson, Daniel S. D.; van der Schot, Gijs; Carlsson, Gunilla H.; Ingelman, Margareta; Andreasson, Jakob; Westphal, Daniel; Iwan, Bianca; Uetrecht, Charlotte; Bielecki, Johan; Liang, Mengning; Stellato, Francesco; Deponte, Daniel P.; Bari, Sadia; Hartmann, Robert; Kimmel, Nils; Kirian, Richard A.; Seibert, M. Marvin; Mühlig, Kerstin; Schorb, Sebastian; Ferguson, Ken; Bostedt, Christoph; Carron, Sebastian; Bozek, John D.; Rolles, Daniel; Rudenko, Artem; Foucar, Lutz; Epp, Sascha W.; Chapman, Henry N.; Barty, Anton; Andersson, Inger; Hajdu, Janos; Maia, Filipe R. N. C.

    2016-08-01

    Ultra-intense femtosecond X-ray pulses from X-ray lasers permit structural studies on single particles and biomolecules without crystals. We present a large data set on inherently heterogeneous, polyhedral carboxysome particles. Carboxysomes are cell organelles that vary in size and facilitate up to 40% of Earth’s carbon fixation by cyanobacteria and certain proteobacteria. Variation in size hinders crystallization. Carboxysomes appear icosahedral in the electron microscope. A protein shell encapsulates a large number of Rubisco molecules in paracrystalline arrays inside the organelle. We used carboxysomes with a mean diameter of 115±26 nm from Halothiobacillus neapolitanus. A new aerosol sample-injector allowed us to record 70,000 low-noise diffraction patterns in 12 min. Every diffraction pattern is a unique structure measurement and high-throughput imaging allows sampling the space of structural variability. The different structures can be separated and phased directly from the diffraction data and open a way for accurate, high-throughput studies on structures and structural heterogeneity in biology and elsewhere.

  2. Quadtree of TIN: a new algorithm of dynamic LOD

    NASA Astrophysics Data System (ADS)

    Zhang, Junfeng; Fei, Lifan; Chen, Zhen

    2009-10-01

    Currently, Real-time visualization of large-scale digital elevation model mainly employs the regular structure of GRID based on quadtree and triangle simplification methods based on irregular triangulated network (TIN). TIN is a refined means to express the terrain surface in the computer science, compared with GRID. However, the data structure of TIN model is complex, and is difficult to realize view-dependence representation of level of detail (LOD) quickly. GRID is a simple method to realize the LOD of terrain, but contains more triangle count. A new algorithm, which takes full advantage of the two methods' merit, is presented in this paper. This algorithm combines TIN with quadtree structure to realize the view-dependence LOD controlling over the irregular sampling point sets, and holds the details through the distance of viewpoint and the geometric error of terrain. Experiments indicate that this approach can generate an efficient quadtree triangulation hierarchy over any irregular sampling point sets and achieve dynamic and visual multi-resolution performance of large-scale terrain at real-time.

  3. Computational tools for exact conditional logistic regression.

    PubMed

    Corcoran, C; Mehta, C; Patel, N; Senchaudhuri, P

    Logistic regression analyses are often challenged by the inability of unconditional likelihood-based approximations to yield consistent, valid estimates and p-values for model parameters. This can be due to sparseness or separability in the data. Conditional logistic regression, though useful in such situations, can also be computationally unfeasible when the sample size or number of explanatory covariates is large. We review recent developments that allow efficient approximate conditional inference, including Monte Carlo sampling and saddlepoint approximations. We demonstrate through real examples that these methods enable the analysis of significantly larger and more complex data sets. We find in this investigation that for these moderately large data sets Monte Carlo seems a better alternative, as it provides unbiased estimates of the exact results and can be executed in less CPU time than can the single saddlepoint approximation. Moreover, the double saddlepoint approximation, while computationally the easiest to obtain, offers little practical advantage. It produces unreliable results and cannot be computed when a maximum likelihood solution does not exist. Copyright 2001 John Wiley & Sons, Ltd.

  4. Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets.

    PubMed

    Lai, Yinglei; Zhang, Fanni; Nayak, Tapan K; Modarres, Reza; Lee, Norman H; McCaffrey, Timothy A

    2014-01-01

    Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment. We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a three-component multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets. We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method. This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple large-scale two-sample gene expression data sets.

  5. Quantitative Tools for Examining the Vocalizations of Juvenile Songbirds

    PubMed Central

    Wellock, Cameron D.; Reeke, George N.

    2012-01-01

    The singing of juvenile songbirds is highly variable and not well stereotyped, a feature that makes it difficult to analyze with existing computational techniques. We present here a method suitable for analyzing such vocalizations, windowed spectral pattern recognition (WSPR). Rather than performing pairwise sample comparisons, WSPR measures the typicality of a sample against a large sample set. We also illustrate how WSPR can be used to perform a variety of tasks, such as sample classification, song ontogeny measurement, and song variability measurement. Finally, we present a novel measure, based on WSPR, for quantifying the apparent complexity of a bird's singing. PMID:22701474

  6. Search for Gamma-Ray Emission from Galactic Novae using Fermi-LAT Pass 8

    NASA Astrophysics Data System (ADS)

    Buson, Sara; Franckowiak, Anna; Cheung, Teddy; Jean, Pierre; Fermi-LAT Collaboration

    2016-01-01

    Recently Galactic novae have been identified as a new class of GeV gamma-ray emitters, with 6 detected so far with the Fermi Large Area Telescope (Fermi-LAT) data. Based on optical observations we have compiled a catalog of ~70 Galactic novae, which peak (in optical) during the operations of the Fermi mission. Based on the properties of known gamma-ray novae we developed a search procedure that we apply to all novae in the catalog to detect these slow transient sources or set flux upper limits using the Fermi-LAT Pass 8 data set. This is the first time a large sample of Galactic novae has been uniformly studied.

  7. An interferometric fiber optic hydrophone with large upper limit of dynamic range

    NASA Astrophysics Data System (ADS)

    Zhang, Lei; Kan, Baoxi; Zheng, Baichao; Wang, Xuefeng; Zhang, Haiyan; Hao, Liangbin; Wang, Hailiang; Hou, Zhenxing; Yu, Wenpeng

    2017-10-01

    Interferometric fiber optic hydrophone based on heterodyne detection is used to measure the missile dropping point in the sea. The signal caused by the missile dropping in the water will be too large to be detected, so it is necessary to boost the upper limit of dynamic range (ULODR) of fiber optic hydrophone. In this article we analysis the factors which influence the ULODR of fiber optic hydrophone based on heterodyne detection, the ULODR is decided by the sampling frequency fsam and the heterodyne frequency Δf. The sampling frequency and the heterodyne frequency should be satisfied with the Nyquist sampling theorem which fsam will be two times larger than Δf, in this condition the ULODR is depended on the heterodyne frequency. In order to enlarge the ULODR, the Nyquist sampling theorem was broken, and we proposed a fiber optic hydrophone which the heterodyne frequency is larger than the sampling frequency. Both the simulation and experiment were done in this paper, the consequences are similar: When the sampling frequency is 100kHz, the ULODR of large heterodyne frequency fiber optic hydrophone is 2.6 times larger than that of the small heterodyne frequency fiber optic hydrophone. As the heterodyne frequency is larger than the sampling frequency, the ULODR is depended on the sampling frequency. If the sampling frequency was set at 2MHz, the ULODR of fiber optic hydrophone based on heterodyne detection will be boosted to 1000rad at 1kHz, and this large heterodyne fiber optic hydrophone can be applied to locate the drop position of the missile in the sea.

  8. Eucalyptus plantations for energy production in Hawaii. 1980 annual report, January 1980-December 1980

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Whitesell, C. D.

    1980-01-01

    In 1980 200 acres of eucalyptus trees were planted for a research and development biomass energy plantation bringing the total area under cultivation to 300 acres. Of this total acreage, 90 acres or 30% was planted in experimental plots. The remaining 70% of the cultivated area was closely monitored to determine the economic cost/benefit ratio of large scale biomass energy production. In the large scale plantings, standard field practices were set up for all phases of production: nursery, clearing, planting, weed control and fertilization. These practices were constantly evaluated for potential improvements in efficiency and reduced cost. Promising experimental treatmentsmore » were implemented on a large scale to test their effectiveness under field production conditions. In the experimental areas all scheduled data collection in 1980 has been completed and most measurements have been keypunched and analyzed. Soil samples and leaf samples have been analyzed for nutrient concentrations. Crop logging procedures have been set up to monitor tree growth through plant tissue analysis. An intensive computer search on biomass, nursery practices, harvesting equipment and herbicide applications has been completed through the services of the US Forest Service.« less

  9. MicroRNAs for Detection of Pancreatic Neoplasia

    PubMed Central

    Vila-Navarro, Elena; Vila-Casadesús, Maria; Moreira, Leticia; Duran-Sanchon, Saray; Sinha, Rupal; Ginés, Àngels; Fernández-Esparrach, Glòria; Miquel, Rosa; Cuatrecasas, Miriam; Castells, Antoni; Lozano, Juan José; Gironella, Meritxell

    2017-01-01

    Objective: The aim of our study was to analyze the miRNome of pancreatic ductal adenocarcinoma (PDAC) and its preneoplastic lesion intraductal papillary mucinous neoplasm (IPMN), to find new microRNA (miRNA)-based biomarkers for early detection of pancreatic neoplasia. Objective: Effective early detection methods for PDAC are needed. miRNAs are good biomarker candidates. Methods: Pancreatic tissues (n = 165) were obtained from patients with PDAC, IPMN, or from control individuals (C), from Hospital Clínic of Barcelona. Biomarker discovery was done using next-generation sequencing in a discovery set of 18 surgical samples (11 PDAC, 4 IPMN, 3 C). MiRNA validation was carried out by quantitative reverse transcriptase PCR in 2 different set of samples. Set 1—52 surgical samples (24 PDAC, 7 IPMN, 6 chronic pancreatitis, 15 C), and set 2—95 endoscopic ultrasound-guided fine-needle aspirations (60 PDAC, 9 IPMN, 26 C). Results: In all, 607 and 396 miRNAs were significantly deregulated in PDAC and IPMN versus C. Of them, 40 miRNAs commonly overexpressed in both PDAC and IPMN were selected for further validation. Among them, significant up-regulation of 31 and 30 miRNAs was confirmed by quantitative reverse transcriptase PCR in samples from set 1 and set 2, respectively. Conclusions: miRNome analysis shows that PDAC and IPMN have differential miRNA profiles with respect to C, with a large number of deregulated miRNAs shared by both neoplastic lesions. Indeed, we have identified and validated 30 miRNAs whose expression is significantly increased in PDAC and IPMN lesions. The feasibility of detecting these miRNAs in endoscopic ultrasound-guided fine-needle aspiration samples makes them good biomarker candidates for early detection of pancreatic cancer. PMID:27232245

  10. Set size and culture influence children's attention to number.

    PubMed

    Cantrell, Lisa; Kuwabara, Megumi; Smith, Linda B

    2015-03-01

    Much research evidences a system in adults and young children for approximately representing quantity. Here we provide evidence that the bias to attend to discrete quantity versus other dimensions may be mediated by set size and culture. Preschool-age English-speaking children in the United States and Japanese-speaking children in Japan were tested in a match-to-sample task where number was pitted against cumulative surface area in both large and small numerical set comparisons. Results showed that children from both cultures were biased to attend to the number of items for small sets. Large set responses also showed a general attention to number when ratio difficulty was easy. However, relative to the responses for small sets, attention to number decreased for both groups; moreover, both U.S. and Japanese children showed a significant bias to attend to total amount for difficult numerical ratio distances, although Japanese children shifted attention to total area at relatively smaller set sizes than U.S. children. These results add to our growing understanding of how quantity is represented and how such representation is influenced by context--both cultural and perceptual. Copyright © 2014 Elsevier Inc. All rights reserved.

  11. Validity of the SAT for Predicting First-Year Grades: 2008 SAT Validity Sample. Statistical Report No. 2011-5

    ERIC Educational Resources Information Center

    Patterson, Brian F.; Mattern, Krista D.

    2011-01-01

    The findings for the 2008 sample are largely consistent with the previous reports. SAT scores were found to be correlated with FYGPA (r = 0.54), with a magnitude similar to HSGPA (r = 0.56). The best set of predictors of FYGPA remains SAT scores and HSGPA (r = 0.63), as the addition of the SAT sections to the correlation of HSGPA alone with FYGPA…

  12. Astronaut John Young photographed collecting lunar samples

    NASA Technical Reports Server (NTRS)

    1972-01-01

    Astronaut John W. Young, commander of the Apollo 16 lunar landing mission, is photographed collecting lunar samples near North Ray crater during the third Apollo 16 extravehicular activity (EVA-3) at the Descartes landing site. This picture was taken by Astronaut Charles M. Duke Jr., lunar module pilot. Young is using the lunar surface rake and a set of tongs. The Lunar Roving Vehicle is parked in the field of large boulders in the background.

  13. Survey of Large Methane Emitters in North America

    NASA Astrophysics Data System (ADS)

    Deiker, S.

    2017-12-01

    It has been theorized that methane emissions in the oil and gas industry follow log normal or "fat tail" distributions, with large numbers of small sources for every very large source. Such distributions would have significant policy and operational implications. Unfortunately, by their very nature such distributions would require large sample sizes to verify. Until recently, such large-scale studies would be prohibitively expensive. The largest public study to date sampled 450 wells, an order of magnitude too low to effectively constrain these models. During 2016 and 2017, Kairos Aerospace conducted a series of surveys the LeakSurveyor imaging spectrometer, mounted on light aircraft. This small, lightweight instrument was designed to rapidly locate large emission sources. The resulting survey covers over three million acres of oil and gas production. This includes over 100,000 wells, thousands of storage tanks and over 7,500 miles of gathering lines. This data set allows us to now probe the distribution of large methane emitters. Results of this survey, and implications for methane emission distribution, methane policy and LDAR will be discussed.

  14. Tiny videos: a large data set for nonparametric video retrieval and frame classification.

    PubMed

    Karpenko, Alexandre; Aarabi, Parham

    2011-03-01

    In this paper, we present a large database of over 50,000 user-labeled videos collected from YouTube. We develop a compact representation called "tiny videos" that achieves high video compression rates while retaining the overall visual appearance of the video as it varies over time. We show that frame sampling using affinity propagation-an exemplar-based clustering algorithm-achieves the best trade-off between compression and video recall. We use this large collection of user-labeled videos in conjunction with simple data mining techniques to perform related video retrieval, as well as classification of images and video frames. The classification results achieved by tiny videos are compared with the tiny images framework [24] for a variety of recognition tasks. The tiny images data set consists of 80 million images collected from the Internet. These are the largest labeled research data sets of videos and images available to date. We show that tiny videos are better suited for classifying scenery and sports activities, while tiny images perform better at recognizing objects. Furthermore, we demonstrate that combining the tiny images and tiny videos data sets improves classification precision in a wider range of categories.

  15. Estimation of the rain signal in the presence of large surface clutter

    NASA Technical Reports Server (NTRS)

    Ahamad, Atiq; Moore, Richard K.

    1994-01-01

    The principal limitation for the use of a spaceborne imaging SAR as a rain radar is the surface-clutter problem. Signals may be estimated in the presence of noise by averaging large numbers of independent samples. This method was applied to obtain an estimate of the rain echo by averaging a set of N(sub c) samples of the clutter in a separate measurement and subtracting the clutter estimate from the combined estimate. The number of samples required for successful estimation (within 10-20%) for off-vertical angles of incidence appears to be prohibitively large. However, by appropriately degrading the resolution in both range and azimuth, the required number of samples can be obtained. For vertical incidence, the number of samples required for successful estimation is reasonable. In estimating the clutter it was assumed that the surface echo is the same outside the rain volume as it is within the rain volume. This may be true for the forest echo, but for convective storms over the ocean the surface echo outside the rain volume is very different from that within. It is suggested that the experiment be performed with vertical incidence over forest to overcome this limitation.

  16. Characterization and electron-energy-loss spectroscopy on NiV and NiMo superlattices

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mahmood, S.H.

    1986-01-01

    NiV superlattices with periods (A) ranging from 15 to 80 A, and NiMo superlattices with from 14 to 110 A were studied using X-ray Diffraction (XRD), Electron Diffraction (ED), Energy-Dispersive X-Ray (EDX) microanalysis, and Electron Energy Loss Spectroscopy (EELS). Both of these systems have sharp superlattice-to-amorphous (S-A) transitions at about empty set = 17A. Superlattices with empty set around the S-A boundary were found to have large local variations in the in-plane grain sizes. Except for a few isolated regions, the chemical composition of the samples were found to be uniform. In samples prepared at Argonne National Laboratory (ANL), mostmore » places studied with EELS showed changes in the EELS spectrum with decreasing empty set. An observed growth in a plasmon peak at approx. 10ev in both NiV and NiMo as empty set decreased down to 19 A is attributed to excitation of interface plasmons. Consistent with this attribution, the peak height shrank in the amorphous samples. The width of this peak is consistent with the theory. The sift in this peak down to 9 ev with decreasing empty set in NiMo is not understood.« less

  17. The ESO Diffuse Interstellar Band Large Exploration Survey (EDIBLES)

    NASA Astrophysics Data System (ADS)

    Cami, J.; Cox, N. L.; Farhang, A.; Smoker, J.; Elyajouri, M.; Lallement, R.; Bacalla, X.; Bhatt, N. H.; Bron, E.; Cordiner, M. A.; de Koter, A..; Ehrenfreund, P.; Evans, C.; Foing, B. H.; Javadi, A.; Joblin, C.; Kaper, L.; Khosroshahi, H. G.; Laverick, M.; Le Petit, F..; Linnartz, H.; Marshall, C. C.; Monreal-Ibero, A.; Mulas, G.; Roueff, E.; Royer, P.; Salama, F.; Sarre, P. J.; Smith, K. T.; Spaans, M.; van Loon, J. T..; Wade, G.

    2018-03-01

    The ESO Diffuse Interstellar Band Large Exploration Survey (EDIBLES) is a Large Programme that is collecting high-signal-to-noise (S/N) spectra with UVES of a large sample of O and B-type stars covering a large spectral range. The goal of the programme is to extract a unique sample of high-quality interstellar spectra from these data, representing different physical and chemical environments, and to characterise these environments in great detail. An important component of interstellar spectra is the diffuse interstellar bands (DIBs), a set of hundreds of unidentified interstellar absorption lines. With the detailed line-of-sight information and the high-quality spectra, EDIBLES will derive strong constraints on the potential DIB carrier molecules. EDIBLES will thus guide the laboratory experiments necessary to identify these interstellar “mystery molecules”, and turn DIBs into powerful diagnostics of their environments in our Milky Way Galaxy and beyond. We present some preliminary results showing the unique capabilities of the EDIBLES programme.

  18. Scanning electron microscope comparative surface evaluation of glazed-lithium disilicate ceramics under different irradiation settings of Nd:YAG and Er:YAG lasers.

    PubMed

    Viskic, Josko; Jokic, Drazen; Jakovljevic, Suzana; Bergman, Lana; Ortolan, Sladana Milardovic; Mestrovic, Senka; Mehulic, Ketij

    2018-01-01

    To evaluate the surface of glazed lithium disilicate dental ceramics after irradiation under different irradiation settings of Nd:YAG and Er:YAG lasers using a scanning electron microscope (SEM). Three glazed-press lithium disilicate ceramic discs were treated with HF, Er:YAG, and Nd:YAG, respectively. The laser-setting variables tested were laser mode, repetition rate (Hz), power (W), time of exposure (seconds), and laser energy (mJ). Sixteen different variable settings were tested for each laser type, and all the samples were analyzed by SEM at 500× and 1000× magnification. Surface analysis of the HF-treated sample showed a typical surface texture with a homogenously rough pattern and exposed ceramic crystals. Er:YAG showed no effect on the surface under any irradiation setting. The surface of Nd:YAG-irradiated samples showed cracking, melting, and resolidifying of the ceramic glaze. These changes became more pronounced as the power increased. At the highest power setting (2.25 W), craters on the surface with large areas of melted or resolidified glaze surrounded by globules were visible. However, there was little to no exposure of ceramic crystals or visible regular surface roughening. Neither Er:YAG nor Nd:YAG dental lasers exhibited adequate surface modification for bonding of orthodontic brackets on glazed lithium disilicate ceramics compared with the control treated with 9.5% HF.

  19. Tracing the trajectory of skill learning with a very large sample of online game players.

    PubMed

    Stafford, Tom; Dewar, Michael

    2014-02-01

    In the present study, we analyzed data from a very large sample (N = 854,064) of players of an online game involving rapid perception, decision making, and motor responding. Use of game data allowed us to connect, for the first time, rich details of training history with measures of performance from participants engaged for a sustained amount of time in effortful practice. We showed that lawful relations exist between practice amount and subsequent performance, and between practice spacing and subsequent performance. Our methodology allowed an in situ confirmation of results long established in the experimental literature on skill acquisition. Additionally, we showed that greater initial variation in performance is linked to higher subsequent performance, a result we link to the exploration/exploitation trade-off from the computational framework of reinforcement learning. We discuss the benefits and opportunities of behavioral data sets with very large sample sizes and suggest that this approach could be particularly fecund for studies of skill acquisition.

  20. A hard-to-read font reduces the framing effect in a large sample.

    PubMed

    Korn, Christoph W; Ries, Juliane; Schalk, Lennart; Oganian, Yulia; Saalbach, Henrik

    2018-04-01

    How can apparent decision biases, such as the framing effect, be reduced? Intriguing findings within recent years indicate that foreign language settings reduce framing effects, which has been explained in terms of deeper cognitive processing. Because hard-to-read fonts have been argued to trigger deeper cognitive processing, so-called cognitive disfluency, we tested whether hard-to-read fonts reduce framing effects. We found no reliable evidence for an effect of hard-to-read fonts on four framing scenarios in a laboratory (final N = 158) and an online study (N = 271). However, in a preregistered online study with a rather large sample (N = 732), a hard-to-read font reduced the framing effect in the classic "Asian disease" scenario (in a one-sided test). This suggests that hard-read-fonts can modulate decision biases-albeit with rather small effect sizes. Overall, our findings stress the importance of large samples for the reliability and replicability of modulations of decision biases.

  1. Deriving photometric redshifts using fuzzy archetypes and self-organizing maps - I. Methodology

    NASA Astrophysics Data System (ADS)

    Speagle, Joshua S.; Eisenstein, Daniel J.

    2017-07-01

    We propose a method to substantially increase the flexibility and power of template fitting-based photometric redshifts by transforming a large number of galaxy spectral templates into a corresponding collection of 'fuzzy archetypes' using a suitable set of perturbative priors designed to account for empirical variation in dust attenuation and emission-line strengths. To bypass widely separated degeneracies in parameter space (e.g. the redshift-reddening degeneracy), we train self-organizing maps (SOMs) on large 'model catalogues' generated from Monte Carlo sampling of our fuzzy archetypes to cluster the predicted observables in a topologically smooth fashion. Subsequent sampling over the SOM then allows full reconstruction of the relevant probability distribution functions (PDFs). This combined approach enables the multimodal exploration of known variation among galaxy spectral energy distributions with minimal modelling assumptions. We demonstrate the power of this approach to recover full redshift PDFs using discrete Markov chain Monte Carlo sampling methods combined with SOMs constructed from Large Synoptic Survey Telescope ugrizY and Euclid YJH mock photometry.

  2. Photometric Redshifts for the Large-Area Stripe 82X Multiwavelength Survey

    NASA Astrophysics Data System (ADS)

    Tasnim Ananna, Tonima; Salvato, Mara; Urry, C. Megan; LaMassa, Stephanie M.; STRIPE 82X

    2016-06-01

    The Stripe 82X survey currently includes 6000 X-ray sources in 31.3 square degrees of XMM-Newton and Chandra X-ray coverage, most of which are AGN. Using a maximum-likelihood approach, we identified optical and infrared counterparts in the SDSS, VHS K-band and WISE W1-band catalogs. 1200 objects which had different best associations in different catalogs were checked by eye. Our most recent paper provided the multiwavelength catalogs for this sample. More than 1000 counterparts have spectroscopic redshifts, either from SDSS spectroscopy or our own follow-up program. Using the extensive multiwavelength data in this field, we provide photometric redshift estimates for most of the remaining sources, which are 80-90% accurate according to the training set. Our sample has a large number of candidates that are very faint in optical and bright in IR. We expect a large fraction of these objects to be the obscured AGN sample we need to complete the census on black hole growth at a range of redshifts.

  3. A comparison of shoreline seines with fyke nets for sampling littoral fish communities in floodplain lakes

    USGS Publications Warehouse

    Clark, S.J.; Jackson, J.R.; Lochmann, S.E.

    2007-01-01

    We compared shoreline seines with fyke nets in terms of their ability to sample fish species in the littoral zone of 22 floodplain lakes of the White River, Arkansas. Lakes ranged in size from less than 0.5 to 51.0 ha. Most contained large amounts of coarse woody debris within the littoral zone, thus making seining in shallow areas difficult. We sampled large lakes (>2 ha) using three fyke nets; small lakes (<2 ha) were sampled using two fyke nets. Fyke nets were set for 24 h. Large lakes were sampled with an average of 11 seine hauls/ lake and small lakes were sampled with an average of 3 seine hauls/lake, but exact shoreline seining effort varied among lakes depending on the amount of open shoreline. Fyke nets collected more fish and produced greater species richness and diversity measures than did seining. Species evenness was similar for the two gear types. Two species were unique to seine samples, whereas 13 species and 3 families were unique to fyke-net samples. Although fyke nets collected more fish and more species than did shoreline seines, neither gear collected all the species present in the littoral zone of floodplain lakes. These results confirm the need for a multiple-gear approach to fully characterize the littoral fish assemblages in floodplain lakes. ?? Copyright by the American Fisheries Society 2007.

  4. Large Stratospheric IDPs: Chemical Compostion and Comparison with Smaller Stratospheric IDPs

    NASA Astrophysics Data System (ADS)

    Flynn, G. J.; Bajt, S.; Sutton, S. R.; Klock, W.

    1995-09-01

    Six large stratospheric IDPs, each greater than 35 microns, previously analyzed using the X-Ray Microprobe at the National Synchrotron Light Source showed an average volatile content consistent with CI or CM meteorites [1]. Seven additional large IDPs, ranging from 37x33 to 50x44 microns in size and having chondritic major element abundances, have been analyzed using the same instrument. Each of these 7 IDPs is depleted in Ca compared to CI (Avg. Ca = 0.48xCI), a feature also observed in the first set of 6, suggesting most or all of these IDPs are hydrated. The average trace element content of these 7 large IDPs is similar to the previous set of 6 (see Figure 1), though Mn and Cu are about 70% higher in this set. The average composition of these large IDPs is distinctly different from that of smaller IDPs (generally 10 to 20 microns), which show enrichments of the volatiles Cu, Zn, Ga, Ge, and Se by factors of 1.5 to 3 over CI [2]. This suggests large IDPs which are strong enough to resist fragmentation on collection are chemically different from typical smaller IDPs. This may reflect a difference in the source(s) being sampled by the two types of IDPs. A subgroup of the smaller IDPs (9 of 51 particles) have a composition similar to CI meteorites and these large IDPs [2]. Bromine is enriched in most of these large IDPs. Two Br-rich IDPs (Br >300 ppm) and one Br-poor IDP (Br ~5 ppm) were each analyzed twice. The two Br-rich IDPs showed about a factor of two Br loss between the first and second analyses, presumably due to sample heating during the first analysis. This suggests some of the Br is very weakly bound in these Br-rich IDPs, a possible signature of Br surface contamination. However, the Br contents measured in the second analyses were still ~50xCI. No loss of Cu, Zn, Ga, Ge or Se was detected in these IDPs, suggesting these elements are in more retentive sites. The Br-poor IDP (Br ~1.5xCI) showed no Br loss in the second analysis. Only one of these IDPs, L2008G10, showed a large Zn depletion (Zn/Fe <0.01xCI). This was accompanied by low contents of Ga, Ge and Br (see Figure 1). This pattern of Zn, Ge, Br and Ga depletions was previously seen in smaller IDPs which were severely heated, presumably on atmospheric entry [2]. Sulfur and K are also low in L2008G10, suggesting these elements are also lost during heating, but the Se content is 0.8xCI. A second particle, L2009C8, has a Zn/Fe=0.26xCI, possibly indicating less severe heating. The low fraction of severely heated IDPs, only one in this set of 7 and none in the set of 6 [1] suggests a very low atmospheric entry velocity for these large IDPs [3]. References: [1] Flynn G. J. et al. (1995) LPS XXVI, 407-408. [2] Flynn G. J. et al. (1993) LPS XXIV, 495-496. [3] Flynn G. J., this volume. Figure 1: Average Fe and CI normalized element abundances in 7 large IDPs, 6 different large IDPs [1], 51 smaller IDPs [2], and the single low-Zn IDP, L2008G10, included in the set of 7 large IDPs.

  5. Legal & ethical compliance when sharing biospecimen.

    PubMed

    Klingstrom, Tomas; Bongcam-Rudloff, Erik; Reichel, Jane

    2018-01-01

    When obtaining samples from biobanks, resolving ethical and legal concerns is a time-consuming task where researchers need to balance the needs of privacy, trust and scientific progress. The Biobanking and Biomolecular Resources Research Infrastructure-Large Prospective Cohorts project has resolved numerous such issues through intense communication between involved researchers and experts in its mission to unite large prospective study sets in Europe. To facilitate efficient communication, it is useful for nonexperts to have an at least basic understanding of the regulatory system for managing biological samples.Laws regulating research oversight are based on national law and normally share core principles founded on international charters. In interview studies among donors, chief concerns are privacy, efficient sample utilization and access to information generated from their samples. Despite a lack of clear evidence regarding which concern takes precedence, scientific as well as public discourse has largely focused on privacy concerns and the right of donors to control the usage of their samples.It is therefore important to proactively deal with ethical and legal issues to avoid complications that delay or prevent samples from being accessed. To help biobank professionals avoid making unnecessary mistakes, we have developed this basic primer covering the relationship between ethics and law, the concept of informed consent and considerations for returning findings to donors. © The Author 2017. Published by Oxford University Press.

  6. Legal & ethical compliance when sharing biospecimen

    PubMed Central

    Klingstrom, Tomas; Bongcam-Rudloff, Erik; Reichel, Jane

    2018-01-01

    Abstract When obtaining samples from biobanks, resolving ethical and legal concerns is a time-consuming task where researchers need to balance the needs of privacy, trust and scientific progress. The Biobanking and Biomolecular Resources Research Infrastructure-Large Prospective Cohorts project has resolved numerous such issues through intense communication between involved researchers and experts in its mission to unite large prospective study sets in Europe. To facilitate efficient communication, it is useful for nonexperts to have an at least basic understanding of the regulatory system for managing biological samples. Laws regulating research oversight are based on national law and normally share core principles founded on international charters. In interview studies among donors, chief concerns are privacy, efficient sample utilization and access to information generated from their samples. Despite a lack of clear evidence regarding which concern takes precedence, scientific as well as public discourse has largely focused on privacy concerns and the right of donors to control the usage of their samples. It is therefore important to proactively deal with ethical and legal issues to avoid complications that delay or prevent samples from being accessed. To help biobank professionals avoid making unnecessary mistakes, we have developed this basic primer covering the relationship between ethics and law, the concept of informed consent and considerations for returning findings to donors. PMID:28460118

  7. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data.

    PubMed

    Bhaskar, Anand; Wang, Y X Rachel; Song, Yun S

    2015-02-01

    With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions. © 2015 Bhaskar et al.; Published by Cold Spring Harbor Laboratory Press.

  8. A Structure-Adaptive Hybrid RBF-BP Classifier with an Optimized Learning Strategy

    PubMed Central

    Wen, Hui; Xie, Weixin; Pei, Jihong

    2016-01-01

    This paper presents a structure-adaptive hybrid RBF-BP (SAHRBF-BP) classifier with an optimized learning strategy. SAHRBF-BP is composed of a structure-adaptive RBF network and a BP network of cascade, where the number of RBF hidden nodes is adjusted adaptively according to the distribution of sample space, the adaptive RBF network is used for nonlinear kernel mapping and the BP network is used for nonlinear classification. The optimized learning strategy is as follows: firstly, a potential function is introduced into training sample space to adaptively determine the number of initial RBF hidden nodes and node parameters, and a form of heterogeneous samples repulsive force is designed to further optimize each generated RBF hidden node parameters, the optimized structure-adaptive RBF network is used for adaptively nonlinear mapping the sample space; then, according to the number of adaptively generated RBF hidden nodes, the number of subsequent BP input nodes can be determined, and the overall SAHRBF-BP classifier is built up; finally, different training sample sets are used to train the BP network parameters in SAHRBF-BP. Compared with other algorithms applied to different data sets, experiments show the superiority of SAHRBF-BP. Especially on most low dimensional and large number of data sets, the classification performance of SAHRBF-BP outperforms other training SLFNs algorithms. PMID:27792737

  9. Set Up of an Automatic Water Quality Sampling System in Irrigation Agriculture

    PubMed Central

    Heinz, Emanuel; Kraft, Philipp; Buchen, Caroline; Frede, Hans-Georg; Aquino, Eugenio; Breuer, Lutz

    2014-01-01

    We have developed a high-resolution automatic sampling system for continuous in situ measurements of stable water isotopic composition and nitrogen solutes along with hydrological information. The system facilitates concurrent monitoring of a large number of water and nutrient fluxes (ground, surface, irrigation and rain water) in irrigated agriculture. For this purpose we couple an automatic sampling system with a Wavelength-Scanned Cavity Ring Down Spectrometry System (WS-CRDS) for stable water isotope analysis (δ2H and δ18O), a reagentless hyperspectral UV photometer (ProPS) for monitoring nitrate content and various water level sensors for hydrometric information. The automatic sampling system consists of different sampling stations equipped with pumps, a switch cabinet for valve and pump control and a computer operating the system. The complete system is operated via internet-based control software, allowing supervision from nearly anywhere. The system is currently set up at the International Rice Research Institute (Los Baños, The Philippines) in a diversified rice growing system to continuously monitor water and nutrient fluxes. Here we present the system's technical set-up and provide initial proof-of-concept with results for the isotopic composition of different water sources and nitrate values from the 2012 dry season. PMID:24366178

  10. Distribution-Preserving Stratified Sampling for Learning Problems.

    PubMed

    Cervellera, Cristiano; Maccio, Danilo

    2017-06-09

    The need for extracting a small sample from a large amount of real data, possibly streaming, arises routinely in learning problems, e.g., for storage, to cope with computational limitations, obtain good training/test/validation sets, and select minibatches for stochastic gradient neural network training. Unless we have reasons to select the samples in an active way dictated by the specific task and/or model at hand, it is important that the distribution of the selected points is as similar as possible to the original data. This is obvious for unsupervised learning problems, where the goal is to gain insights on the distribution of the data, but it is also relevant for supervised problems, where the theory explains how the training set distribution influences the generalization error. In this paper, we analyze the technique of stratified sampling from the point of view of distances between probabilities. This allows us to introduce an algorithm, based on recursive binary partition of the input space, aimed at obtaining samples that are distributed as much as possible as the original data. A theoretical analysis is proposed, proving the (greedy) optimality of the procedure together with explicit error bounds. An adaptive version of the algorithm is also introduced to cope with streaming data. Simulation tests on various data sets and different learning tasks are also provided.

  11. Parallel k-Means Clustering for Quantitative Ecoregion Delineation Using Large Data Sets

    Treesearch

    Jitendra Kumar; Richard T. Mills; Forrest M Hoffman; William W Hargrove

    2011-01-01

    Identification of geographic ecoregions has long been of interest to environmental scientists and ecologists for identifying regions of similar ecological and environmental conditions. Such classifications are important for predicting suitable species ranges, for stratification of ecological samples, and to help prioritize habitat preservation and remediation efforts....

  12. Turbidity Threshold sampling in watershed research

    Treesearch

    Rand Eads; Jack Lewis

    2003-01-01

    Abstract - When monitoring suspended sediment for watershed research, reliable and accurate results may be a higher priority than in other settings. Timing and frequency of data collection are the most important factors influencing the accuracy of suspended sediment load estimates, and, in most watersheds, suspended sediment transport is dominated by a few, large...

  13. Iron metabolism in African American women in the second and third trimesters of high-risk pregnancies

    USDA-ARS?s Scientific Manuscript database

    Objective: To examine iron metabolism during the second and third trimesters in African American women with high-risk pregnancies. Design: Longitudinal pilot study. Setting: Large, university-based, urban Midwestern U.S. medical center. Participants: Convenience sample of 32 African American wome...

  14. Explaining Charter School Effectiveness. NBER Working Paper No. 17332

    ERIC Educational Resources Information Center

    Angrist, Joshua D.; Pathak, Parag A.; Walters, Christopher R.

    2011-01-01

    Estimates using admissions lotteries suggest that urban charter schools boost student achievement, while charter schools in other settings do not. We explore student-level and school-level explanations for these differences using a large sample of Massachusetts charter schools. Our results show that urban charter schools boost achievement well…

  15. CTEPP-OH DATA CHILD DAY CARE CENTER WEEKLY MENUS

    EPA Science Inventory

    This data set contains information on the weekly day care menus for CTEPP-OH. The day care centers provided menus up to three months prior to field sampling.

    The Children’s Total Exposure to Persistent Pesticides and Other Persistent Pollutant (CTEPP) study was one of the larg...

  16. Descriptive Statistical Attributes of Special Education Data Sets

    ERIC Educational Resources Information Center

    Felder, Valerie

    2013-01-01

    Micceri (1989) examined the distributional characteristics of 440 large-sample achievement and psychometric measures. All the distributions were found to be nonnormal at alpha = 0.01. Micceri indicated three factors that might contribute to a non-Gaussian error distribution in the population. The first factor is subpopulations within a target…

  17. 9 CFR 381.412 - Reference amounts customarily consumed per eating occasion.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... appropriate national food consumption surveys. (2) The Reference Amounts are calculated for an infant or child... are based on data set forth in appropriate national food consumption surveys. Such Reference Amounts... child under 4 years of age. (3) An appropriate national food consumption survey includes a large sample...

  18. 9 CFR 381.412 - Reference amounts customarily consumed per eating occasion.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... appropriate national food consumption surveys. (2) The Reference Amounts are calculated for an infant or child... are based on data set forth in appropriate national food consumption surveys. Such Reference Amounts... child under 4 years of age. (3) An appropriate national food consumption survey includes a large sample...

  19. 9 CFR 317.312 - Reference amounts customarily consumed per eating occasion.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... appropriate national food consumption surveys. (2) The Reference Amounts are calculated for an infant or child... are based on data set forth in appropriate national food consumption surveys. Such Reference Amounts... child under 4 years of age. (3) An appropriate national food consumption survey includes a large sample...

  20. 9 CFR 317.312 - Reference amounts customarily consumed per eating occasion.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... appropriate national food consumption surveys. (2) The Reference Amounts are calculated for an infant or child... are based on data set forth in appropriate national food consumption surveys. Such Reference Amounts... child under 4 years of age. (3) An appropriate national food consumption survey includes a large sample...

  1. 9 CFR 381.412 - Reference amounts customarily consumed per eating occasion.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... appropriate national food consumption surveys. (2) The Reference Amounts are calculated for an infant or child... are based on data set forth in appropriate national food consumption surveys. Such Reference Amounts... child under 4 years of age. (3) An appropriate national food consumption survey includes a large sample...

  2. 9 CFR 317.312 - Reference amounts customarily consumed per eating occasion.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... appropriate national food consumption surveys. (2) The Reference Amounts are calculated for an infant or child... are based on data set forth in appropriate national food consumption surveys. Such Reference Amounts... child under 4 years of age. (3) An appropriate national food consumption survey includes a large sample...

  3. A Web-Hosted R Workflow to Simplify and Automate the Analysis of 16S NGS Data

    EPA Science Inventory

    Next-Generation Sequencing (NGS) produces large data sets that include tens-of-thousands of sequence reads per sample. For analysis of bacterial diversity, 16S NGS sequences are typically analyzed in a workflow that containing best-of-breed bioinformatics packages that may levera...

  4. Tiered Evaluation in Large Ensemble Settings.

    ERIC Educational Resources Information Center

    Scott, David

    1998-01-01

    Discusses the use of a tiered evaluation system (TES) that allows students to work at different levels, enables teachers to assess progress objectively, and presents students with appropriate challenges in the music ensembles. Focuses on how TES works and its advantages, considers the challenges and flexibility of TES, and provides samples. (CMK)

  5. Development of a 96-well plate iodine binding assay for amylose content determination

    USDA-ARS?s Scientific Manuscript database

    Cereal starch amylose/amylopectin (AM/AP) ratios are critical in functional properties for food and industrial applications. Conventional methods for the determination of AM/AP of cereal starches are very time consuming and labor intensive making it very difficult to screen large sample sets. Stud...

  6. Robust estimation of microbial diversity in theory and in practice

    PubMed Central

    Haegeman, Bart; Hamelin, Jérôme; Moriarty, John; Neal, Peter; Dushoff, Jonathan; Weitz, Joshua S

    2013-01-01

    Quantifying diversity is of central importance for the study of structure, function and evolution of microbial communities. The estimation of microbial diversity has received renewed attention with the advent of large-scale metagenomic studies. Here, we consider what the diversity observed in a sample tells us about the diversity of the community being sampled. First, we argue that one cannot reliably estimate the absolute and relative number of microbial species present in a community without making unsupported assumptions about species abundance distributions. The reason for this is that sample data do not contain information about the number of rare species in the tail of species abundance distributions. We illustrate the difficulty in comparing species richness estimates by applying Chao's estimator of species richness to a set of in silico communities: they are ranked incorrectly in the presence of large numbers of rare species. Next, we extend our analysis to a general family of diversity metrics (‘Hill diversities'), and construct lower and upper estimates of diversity values consistent with the sample data. The theory generalizes Chao's estimator, which we retrieve as the lower estimate of species richness. We show that Shannon and Simpson diversity can be robustly estimated for the in silico communities. We analyze nine metagenomic data sets from a wide range of environments, and show that our findings are relevant for empirically-sampled communities. Hence, we recommend the use of Shannon and Simpson diversity rather than species richness in efforts to quantify and compare microbial diversity. PMID:23407313

  7. ARACNe-AP: Gene Network Reverse Engineering through Adaptive Partitioning inference of Mutual Information. | Office of Cancer Genomics

    Cancer.gov

    The accurate reconstruction of gene regulatory networks from large scale molecular profile datasets represents one of the grand challenges of Systems Biology. The Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) represents one of the most effective tools to accomplish this goal. However, the initial Fixed Bandwidth (FB) implementation is both inefficient and unable to deal with sample sets providing largely uneven coverage of the probability density space.

  8. National Databases for Neurosurgical Outcomes Research: Options, Strengths, and Limitations.

    PubMed

    Karhade, Aditya V; Larsen, Alexandra M G; Cote, David J; Dubois, Heloise M; Smith, Timothy R

    2017-08-05

    Quality improvement, value-based care delivery, and personalized patient care depend on robust clinical, financial, and demographic data streams of neurosurgical outcomes. The neurosurgical literature lacks a comprehensive review of large national databases. To assess the strengths and limitations of various resources for outcomes research in neurosurgery. A review of the literature was conducted to identify surgical outcomes studies using national data sets. The databases were assessed for the availability of patient demographics and clinical variables, longitudinal follow-up of patients, strengths, and limitations. The number of unique patients contained within each data set ranged from thousands (Quality Outcomes Database [QOD]) to hundreds of millions (MarketScan). Databases with both clinical and financial data included PearlDiver, Premier Healthcare Database, Vizient Clinical Data Base and Resource Manager, and the National Inpatient Sample. Outcomes collected by databases included patient-reported outcomes (QOD); 30-day morbidity, readmissions, and reoperations (National Surgical Quality Improvement Program); and disease incidence and disease-specific survival (Surveillance, Epidemiology, and End Results-Medicare). The strengths of large databases included large numbers of rare pathologies and multi-institutional nationally representative sampling; the limitations of these databases included variable data veracity, variable data completeness, and missing disease-specific variables. The improvement of existing large national databases and the establishment of new registries will be crucial to the future of neurosurgical outcomes research. Copyright © 2017 by the Congress of Neurological Surgeons

  9. Candidate-based proteomics in the search for biomarkers of cardiovascular disease

    PubMed Central

    Anderson, Leigh

    2005-01-01

    The key concept of proteomics (looking at many proteins at once) opens new avenues in the search for clinically useful biomarkers of disease, treatment response and ageing. As the number of proteins that can be detected in plasma or serum (the primary clinical diagnostic samples) increases towards 1000, a paradoxical decline has occurred in the number of new protein markers approved for diagnostic use in clinical laboratories. This review explores the limitations of current proteomics protein discovery platforms, and proposes an alternative approach, applicable to a range of biological/physiological problems, in which quantitative mass spectrometric methods developed for analytical chemistry are employed to measure limited sets of candidate markers in large sets of clinical samples. A set of 177 candidate biomarker proteins with reported associations to cardiovascular disease and stroke are presented as a starting point for such a ‘directed proteomics’ approach. PMID:15611012

  10. Cluster randomised crossover trials with binary data and unbalanced cluster sizes: application to studies of near-universal interventions in intensive care.

    PubMed

    Forbes, Andrew B; Akram, Muhammad; Pilcher, David; Cooper, Jamie; Bellomo, Rinaldo

    2015-02-01

    Cluster randomised crossover trials have been utilised in recent years in the health and social sciences. Methods for analysis have been proposed; however, for binary outcomes, these have received little assessment of their appropriateness. In addition, methods for determination of sample size are currently limited to balanced cluster sizes both between clusters and between periods within clusters. This article aims to extend this work to unbalanced situations and to evaluate the properties of a variety of methods for analysis of binary data, with a particular focus on the setting of potential trials of near-universal interventions in intensive care to reduce in-hospital mortality. We derive a formula for sample size estimation for unbalanced cluster sizes, and apply it to the intensive care setting to demonstrate the utility of the cluster crossover design. We conduct a numerical simulation of the design in the intensive care setting and for more general configurations, and we assess the performance of three cluster summary estimators and an individual-data estimator based on binomial-identity-link regression. For settings similar to the intensive care scenario involving large cluster sizes and small intra-cluster correlations, the sample size formulae developed and analysis methods investigated are found to be appropriate, with the unweighted cluster summary method performing well relative to the more optimal but more complex inverse-variance weighted method. More generally, we find that the unweighted and cluster-size-weighted summary methods perform well, with the relative efficiency of each largely determined systematically from the study design parameters. Performance of individual-data regression is adequate with small cluster sizes but becomes inefficient for large, unbalanced cluster sizes. When outcome prevalences are 6% or less and the within-cluster-within-period correlation is 0.05 or larger, all methods display sub-nominal confidence interval coverage, with the less prevalent the outcome the worse the coverage. As with all simulation studies, conclusions are limited to the configurations studied. We confined attention to detecting intervention effects on an absolute risk scale using marginal models and did not explore properties of binary random effects models. Cluster crossover designs with binary outcomes can be analysed using simple cluster summary methods, and sample size in unbalanced cluster size settings can be determined using relatively straightforward formulae. However, caution needs to be applied in situations with low prevalence outcomes and moderate to high intra-cluster correlations. © The Author(s) 2014.

  11. A new small-angle X-ray scattering set-up on the crystallography beamline I711 at MAX-lab.

    PubMed

    Knaapila, M; Svensson, C; Barauskas, J; Zackrisson, M; Nielsen, S S; Toft, K N; Vestergaard, B; Arleth, L; Olsson, U; Pedersen, J S; Cerenius, Y

    2009-07-01

    A small-angle X-ray scattering (SAXS) set-up has recently been developed at beamline I711 at the MAX II storage ring in Lund (Sweden). An overview of the required modifications is presented here together with a number of application examples. The accessible q range in a SAXS experiment is 0.009-0.3 A(-1) for the standard set-up but depends on the sample-to-detector distance, detector offset, beamstop size and wavelength. The SAXS camera has been designed to have a low background and has three collinear slit sets for collimating the incident beam. The standard beam size is about 0.37 mm x 0.37 mm (full width at half-maximum) at the sample position, with a flux of 4 x 10(10) photons s(-1) and lambda = 1.1 A. The vacuum is of the order of 0.05 mbar in the unbroken beam path from the first slits until the exit window in front of the detector. A large sample chamber with a number of lead-throughs allows different sample environments to be mounted. This station is used for measurements on weakly scattering proteins in solutions and also for colloids, polymers and other nanoscale structures. A special application supported by the beamline is the effort to establish a micro-fluidic sample environment for structural analysis of samples that are only available in limited quantities. Overall, this work demonstrates how a cost-effective SAXS station can be constructed on a multipurpose beamline.

  12. Conformational sampling with stochastic proximity embedding and self-organizing superimposition: establishing reasonable parameters for their practical use.

    PubMed

    Tresadern, Gary; Agrafiotis, Dimitris K

    2009-12-01

    Stochastic proximity embedding (SPE) and self-organizing superimposition (SOS) are two recently introduced methods for conformational sampling that have shown great promise in several application domains. Our previous validation studies aimed at exploring the limits of these methods and have involved rather exhaustive conformational searches producing a large number of conformations. However, from a practical point of view, such searches have become the exception rather than the norm. The increasing popularity of virtual screening has created a need for 3D conformational search methods that produce meaningful answers in a relatively short period of time and work effectively on a large scale. In this work, we examine the performance of these algorithms and the effects of different parameter settings at varying levels of sampling. Our goal is to identify search protocols that can produce a diverse set of chemically sensible conformations and have a reasonable probability of sampling biologically active space within a small number of trials. Our results suggest that both SPE and SOS are extremely competitive in this regard and produce very satisfactory results with as few as 500 conformations per molecule. The results improve even further when the raw conformations are minimized with a molecular mechanics force field to remove minor imperfections and any residual strain. These findings provide additional evidence that these methods are suitable for many everyday modeling tasks, both high- and low-throughput.

  13. Discerning some Tylenol brands using attenuated total reflection Fourier transform infrared data and multivariate analysis techniques.

    PubMed

    Msimanga, Huggins Z; Ollis, Robert J

    2010-06-01

    Principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA) were used to classify acetaminophen-containing medicines using their attenuated total reflection Fourier transform infrared (ATR-FT-IR) spectra. Four formulations of Tylenol (Arthritis Pain Relief, Extra Strength Pain Relief, 8 Hour Pain Relief, and Extra Strength Pain Relief Rapid Release) along with 98% pure acetaminophen were selected for this study because of the similarity of their spectral features, with correlation coefficients ranging from 0.9857 to 0.9988. Before acquiring spectra for the predictor matrix, the effects on spectral precision with respect to sample particle size (determined by sieve size opening), force gauge of the ATR accessory, sample reloading, and between-tablet variation were examined. Spectra were baseline corrected and normalized to unity before multivariate analysis. Analysis of variance (ANOVA) was used to study spectral precision. The large particles (35 mesh) showed large variance between spectra, while fine particles (120 mesh) indicated good spectral precision based on the F-test. Force gauge setting did not significantly affect precision. Sample reloading using the fine particle size and a constant force gauge setting of 50 units also did not compromise precision. Based on these observations, data acquisition for the predictor matrix was carried out with the fine particles (sieve size opening of 120 mesh) at a constant force gauge setting of 50 units. After removing outliers, PCA successfully classified the five samples in the first and second components, accounting for 45.0% and 24.5% of the variances, respectively. The four-component PLS-DA model (R(2)=0.925 and Q(2)=0.906) gave good test spectra predictions with an overall average of 0.961 +/- 7.1% RSD versus the expected 1.0 prediction for the 20 test spectra used.

  14. Desiderata for Healthcare Integrated Data Repositories Based on Architectural Comparison of Three Public Repositories

    PubMed Central

    Huser, Vojtech; Cimino, James J.

    2013-01-01

    Integrated data repositories (IDRs) are indispensable tools for numerous biomedical research studies. We compare three large IDRs (Informatics for Integrating Biology and the Bedside (i2b2), HMO Research Network’s Virtual Data Warehouse (VDW) and Observational Medical Outcomes Partnership (OMOP) repository) in order to identify common architectural features that enable efficient storage and organization of large amounts of clinical data. We define three high-level classes of underlying data storage models and we analyze each repository using this classification. We look at how a set of sample facts is represented in each repository and conclude with a list of desiderata for IDRs that deal with the information storage model, terminology model, data integration and value-sets management. PMID:24551366

  15. Desiderata for healthcare integrated data repositories based on architectural comparison of three public repositories.

    PubMed

    Huser, Vojtech; Cimino, James J

    2013-01-01

    Integrated data repositories (IDRs) are indispensable tools for numerous biomedical research studies. We compare three large IDRs (Informatics for Integrating Biology and the Bedside (i2b2), HMO Research Network's Virtual Data Warehouse (VDW) and Observational Medical Outcomes Partnership (OMOP) repository) in order to identify common architectural features that enable efficient storage and organization of large amounts of clinical data. We define three high-level classes of underlying data storage models and we analyze each repository using this classification. We look at how a set of sample facts is represented in each repository and conclude with a list of desiderata for IDRs that deal with the information storage model, terminology model, data integration and value-sets management.

  16. Complex extreme learning machine applications in terahertz pulsed signals feature sets.

    PubMed

    Yin, X-X; Hadjiloucas, S; Zhang, Y

    2014-11-01

    This paper presents a novel approach to the automatic classification of very large data sets composed of terahertz pulse transient signals, highlighting their potential use in biochemical, biomedical, pharmaceutical and security applications. Two different types of THz spectra are considered in the classification process. Firstly a binary classification study of poly-A and poly-C ribonucleic acid samples is performed. This is then contrasted with a difficult multi-class classification problem of spectra from six different powder samples that although have fairly indistinguishable features in the optical spectrum, they also possess a few discernable spectral features in the terahertz part of the spectrum. Classification is performed using a complex-valued extreme learning machine algorithm that takes into account features in both the amplitude as well as the phase of the recorded spectra. Classification speed and accuracy are contrasted with that achieved using a support vector machine classifier. The study systematically compares the classifier performance achieved after adopting different Gaussian kernels when separating amplitude and phase signatures. The two signatures are presented as feature vectors for both training and testing purposes. The study confirms the utility of complex-valued extreme learning machine algorithms for classification of the very large data sets generated with current terahertz imaging spectrometers. The classifier can take into consideration heterogeneous layers within an object as would be required within a tomographic setting and is sufficiently robust to detect patterns hidden inside noisy terahertz data sets. The proposed study opens up the opportunity for the establishment of complex-valued extreme learning machine algorithms as new chemometric tools that will assist the wider proliferation of terahertz sensing technology for chemical sensing, quality control, security screening and clinic diagnosis. Furthermore, the proposed algorithm should also be very useful in other applications requiring the classification of very large datasets. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  17. Moisture content and gas sampling device

    NASA Technical Reports Server (NTRS)

    Krieg, H. C., Jr. (Inventor)

    1985-01-01

    An apparatus is described for measuring minute quantities of moisture and other contaminants within sealed enclosures such as electronic assemblies which may be subject to large external atmospheric pressure variations. An array of vacuum quality valves is arranged to permit cleansing of the test apparatus of residual atmospheric components from a vacuum source. This purging operation evacuates a gas sample bottle, which is then connected by valve settings to provide the drive for withdrawing a gas sample from the sealed enclosure under test into the sample bottle through a colometric detector tube (Drager tube) which indicates moisture content. The sample bottle may be disconnected and its contents (drawn from the test enclosure) separately subjected to mass spectrograph analysis.

  18. Laboratory and airborne techniques for measuring fluoresence of natural surfaces

    NASA Technical Reports Server (NTRS)

    Stoertz, G. E.; Hemphill, W. R.

    1972-01-01

    Techniques are described for obtaining fluorescence spectra from samples of natural surfaces that can be used to predict spectral regions in which these surfaces would emit solar-stimulated or laser-stimulated fluorescence detectable by remote sensor. Scattered or reflected stray light caused large errors in spectrofluorometer analysis or natural sample surfaces. Most spurious light components can be eliminated by recording successive fluorescence spectra for each sample, using identical instrument settings, first with an appropriate glass or gelatin filter on the excitation side of the sample, and subsequently with the same filter on the emission side of the sample. This technique appears more accurate than any alternative technique for testing the fluorescence of natural surfaces.

  19. Effects of sample survey design on the accuracy of classification tree models in species distribution models

    USGS Publications Warehouse

    Edwards, T.C.; Cutler, D.R.; Zimmermann, N.E.; Geiser, L.; Moisen, Gretchen G.

    2006-01-01

    We evaluated the effects of probabilistic (hereafter DESIGN) and non-probabilistic (PURPOSIVE) sample surveys on resultant classification tree models for predicting the presence of four lichen species in the Pacific Northwest, USA. Models derived from both survey forms were assessed using an independent data set (EVALUATION). Measures of accuracy as gauged by resubstitution rates were similar for each lichen species irrespective of the underlying sample survey form. Cross-validation estimates of prediction accuracies were lower than resubstitution accuracies for all species and both design types, and in all cases were closer to the true prediction accuracies based on the EVALUATION data set. We argue that greater emphasis should be placed on calculating and reporting cross-validation accuracy rates rather than simple resubstitution accuracy rates. Evaluation of the DESIGN and PURPOSIVE tree models on the EVALUATION data set shows significantly lower prediction accuracy for the PURPOSIVE tree models relative to the DESIGN models, indicating that non-probabilistic sample surveys may generate models with limited predictive capability. These differences were consistent across all four lichen species, with 11 of the 12 possible species and sample survey type comparisons having significantly lower accuracy rates. Some differences in accuracy were as large as 50%. The classification tree structures also differed considerably both among and within the modelled species, depending on the sample survey form. Overlap in the predictor variables selected by the DESIGN and PURPOSIVE tree models ranged from only 20% to 38%, indicating the classification trees fit the two evaluated survey forms on different sets of predictor variables. The magnitude of these differences in predictor variables throws doubt on ecological interpretation derived from prediction models based on non-probabilistic sample surveys. ?? 2006 Elsevier B.V. All rights reserved.

  20. Bayesian Modal Estimation of the Four-Parameter Item Response Model in Real, Realistic, and Idealized Data Sets.

    PubMed

    Waller, Niels G; Feuerstahler, Leah

    2017-01-01

    In this study, we explored item and person parameter recovery of the four-parameter model (4PM) in over 24,000 real, realistic, and idealized data sets. In the first analyses, we fit the 4PM and three alternative models to data from three Minnesota Multiphasic Personality Inventory-Adolescent form factor scales using Bayesian modal estimation (BME). Our results indicated that the 4PM fits these scales better than simpler item Response Theory (IRT) models. Next, using the parameter estimates from these real data analyses, we estimated 4PM item parameters in 6,000 realistic data sets to establish minimum sample size requirements for accurate item and person parameter recovery. Using a factorial design that crossed discrete levels of item parameters, sample size, and test length, we also fit the 4PM to an additional 18,000 idealized data sets to extend our parameter recovery findings. Our combined results demonstrated that 4PM item parameters and parameter functions (e.g., item response functions) can be accurately estimated using BME in moderate to large samples (N ⩾ 5, 000) and person parameters can be accurately estimated in smaller samples (N ⩾ 1, 000). In the supplemental files, we report annotated [Formula: see text] code that shows how to estimate 4PM item and person parameters in [Formula: see text] (Chalmers, 2012 ).

  1. Sampling from complex networks using distributed learning automata

    NASA Astrophysics Data System (ADS)

    Rezvanian, Alireza; Rahmati, Mohammad; Meybodi, Mohammad Reza

    2014-02-01

    A complex network provides a framework for modeling many real-world phenomena in the form of a network. In general, a complex network is considered as a graph of real world phenomena such as biological networks, ecological networks, technological networks, information networks and particularly social networks. Recently, major studies are reported for the characterization of social networks due to a growing trend in analysis of online social networks as dynamic complex large-scale graphs. Due to the large scale and limited access of real networks, the network model is characterized using an appropriate part of a network by sampling approaches. In this paper, a new sampling algorithm based on distributed learning automata has been proposed for sampling from complex networks. In the proposed algorithm, a set of distributed learning automata cooperate with each other in order to take appropriate samples from the given network. To investigate the performance of the proposed algorithm, several simulation experiments are conducted on well-known complex networks. Experimental results are compared with several sampling methods in terms of different measures. The experimental results demonstrate the superiority of the proposed algorithm over the others.

  2. Navigating complex sample analysis using national survey data.

    PubMed

    Saylor, Jennifer; Friedmann, Erika; Lee, Hyeon Joo

    2012-01-01

    The National Center for Health Statistics conducts the National Health and Nutrition Examination Survey and other national surveys with probability-based complex sample designs. Goals of national surveys are to provide valid data for the population of the United States. Analyses of data from population surveys present unique challenges in the research process but are valuable avenues to study the health of the United States population. The aim of this study was to demonstrate the importance of using complex data analysis techniques for data obtained with complex multistage sampling design and provide an example of analysis using the SPSS Complex Samples procedure. Illustration of challenges and solutions specific to secondary data analysis of national databases are described using the National Health and Nutrition Examination Survey as the exemplar. Oversampling of small or sensitive groups provides necessary estimates of variability within small groups. Use of weights without complex samples accurately estimates population means and frequency from the sample after accounting for over- or undersampling of specific groups. Weighting alone leads to inappropriate population estimates of variability, because they are computed as if the measures were from the entire population rather than a sample in the data set. The SPSS Complex Samples procedure allows inclusion of all sampling design elements, stratification, clusters, and weights. Use of national data sets allows use of extensive, expensive, and well-documented survey data for exploratory questions but limits analysis to those variables included in the data set. The large sample permits examination of multiple predictors and interactive relationships. Merging data files, availability of data in several waves of surveys, and complex sampling are techniques used to provide a representative sample but present unique challenges. In sophisticated data analysis techniques, use of these data is optimized.

  3. Establishing an academic biobank in a resource-challenged environment.

    PubMed

    Soo, Cassandra Claire; Mukomana, Freedom; Hazelhurst, Scott; Ramsay, Michele

    2017-05-24

    Past practices of informal sample collections and spreadsheets for data and sample management fall short of best-practice models for biobanking, and are neither cost effective nor efficient to adequately serve the needs of large research studies. The biobank of the Sydney Brenner Institute for Molecular Bioscience serves as a bioresource for institutional, national and international research collaborations. It provides high-quality human biospecimens from African populations, secure data and sample curation and storage, as well as monitored sample handling and management processes, to promote both non-communicable and infectious-disease research. Best-practice guidelines have been adapted to align with a low-resource setting and have been instrumental in the development of a quality-management system, including standard operating procedures and a quality-control regimen. Here, we provide a summary of 10 important considerations for initiating and establishing an academic research biobank in a low-resource setting. These include addressing ethical, legal, technical, accreditation and/or certification concerns and financial sustainability.

  4. Establishing an academic biobank in a resource-challenged environment

    PubMed Central

    Soo, C C; Mukomana, F; Hazelhurst, S; Ramsay, M

    2018-01-01

    Past practices of informal sample collections and spreadsheets for data and sample management fall short of best-practice models for biobanking, and are neither cost effective nor efficient to adequately serve the needs of large research studies. The biobank of the Sydney Brenner Institute for Molecular Bioscience serves as a bioresource for institutional, national and international research collaborations. It provides high-quality human biospecimens from African populations, secure data and sample curation and storage, as well as monitored sample handling and management processes, to promote both non-communicable and infectious-disease research. Best-practice guidelines have been adapted to align with a low-resource setting and have been instrumental in the development of a quality-management system, including standard operating procedures and a quality-control regimen. Here, we provide a summary of 10 important considerations for initiating and establishing an academic research biobank in a low-resource setting. These include addressing ethical, legal, technical, accreditation and/or certification concerns and financial sustainability. PMID:28604319

  5. Impact of hindcast length on estimates of seasonal climate predictability.

    PubMed

    Shi, W; Schaller, N; MacLeod, D; Palmer, T N; Weisheimer, A

    2015-03-16

    It has recently been argued that single-model seasonal forecast ensembles are overdispersive, implying that the real world is more predictable than indicated by estimates of so-called perfect model predictability, particularly over the North Atlantic. However, such estimates are based on relatively short forecast data sets comprising just 20 years of seasonal predictions. Here we study longer 40 year seasonal forecast data sets from multimodel seasonal forecast ensemble projects and show that sampling uncertainty due to the length of the hindcast periods is large. The skill of forecasting the North Atlantic Oscillation during winter varies within the 40 year data sets with high levels of skill found for some subperiods. It is demonstrated that while 20 year estimates of seasonal reliability can show evidence of overdispersive behavior, the 40 year estimates are more stable and show no evidence of overdispersion. Instead, the predominant feature on these longer time scales is underdispersion, particularly in the tropics. Predictions can appear overdispersive due to hindcast length sampling errorLonger hindcasts are more robust and underdispersive, especially in the tropicsTwenty hindcasts are an inadequate sample size to assess seasonal forecast skill.

  6. Predicting vehicle fuel consumption patterns using floating vehicle data.

    PubMed

    Du, Yiman; Wu, Jianping; Yang, Senyan; Zhou, Liutong

    2017-09-01

    The status of energy consumption and air pollution in China is serious. It is important to analyze and predict the different fuel consumption of various types of vehicles under different influence factors. In order to fully describe the relationship between fuel consumption and the impact factors, massive amounts of floating vehicle data were used. The fuel consumption pattern and congestion pattern based on large samples of historical floating vehicle data were explored, drivers' information and vehicles' parameters from different group classification were probed, and the average velocity and average fuel consumption in the temporal dimension and spatial dimension were analyzed respectively. The fuel consumption forecasting model was established by using a Back Propagation Neural Network. Part of the sample set was used to train the forecasting model and the remaining part of the sample set was used as input to the forecasting model. Copyright © 2017. Published by Elsevier B.V.

  7. SORL1 variants and risk of late-onset Alzheimer's disease.

    PubMed

    Li, Yonghong; Rowland, Charles; Catanese, Joseph; Morris, John; Lovestone, Simon; O'Donovan, Michael C; Goate, Alison; Owen, Michael; Williams, Julie; Grupe, Andrew

    2008-02-01

    A recent study reported significant association of late-onset Alzheimer's disease (LOAD) with multiple single nucleotide polymorphisms (SNPs) and haplotypes in SORL1, a neuronal sortilin-related receptor protein known to be involved in the trafficking and processing of amyloid precursor protein. Here we attempted to validate this finding in three large, well characterized case-control series. Approximately 2000 samples from the three series were individually genotyped for 12 SNPs, including the 10 reported significant SNPs and 2 that constitute the reported significant haplotypes. A total of 25 allelic and haplotypic association tests were performed. One SNP rs2070045 was marginally replicated in the three sample sets combined (nominal P=0.035); however, this result does not remain significant when accounting for multiple comparisons. Further validation in other sample sets will be required to assess the true effects of SORL1 variants in LOAD.

  8. Assessing Agreement between Multiple Raters with Missing Rating Information, Applied to Breast Cancer Tumour Grading

    PubMed Central

    Ellis, Ian O.; Green, Andrew R.; Hanka, Rudolf

    2008-01-01

    Background We consider the problem of assessing inter-rater agreement when there are missing data and a large number of raters. Previous studies have shown only ‘moderate’ agreement between pathologists in grading breast cancer tumour specimens. We analyse a large but incomplete data-set consisting of 24177 grades, on a discrete 1–3 scale, provided by 732 pathologists for 52 samples. Methodology/Principal Findings We review existing methods for analysing inter-rater agreement for multiple raters and demonstrate two further methods. Firstly, we examine a simple non-chance-corrected agreement score based on the observed proportion of agreements with the consensus for each sample, which makes no allowance for missing data. Secondly, treating grades as lying on a continuous scale representing tumour severity, we use a Bayesian latent trait method to model cumulative probabilities of assigning grade values as functions of the severity and clarity of the tumour and of rater-specific parameters representing boundaries between grades 1–2 and 2–3. We simulate from the fitted model to estimate, for each rater, the probability of agreement with the majority. Both methods suggest that there are differences between raters in terms of rating behaviour, most often caused by consistent over- or under-estimation of the grade boundaries, and also considerable variability in the distribution of grades assigned to many individual samples. The Bayesian model addresses the tendency of the agreement score to be biased upwards for raters who, by chance, see a relatively ‘easy’ set of samples. Conclusions/Significance Latent trait models can be adapted to provide novel information about the nature of inter-rater agreement when the number of raters is large and there are missing data. In this large study there is substantial variability between pathologists and uncertainty in the identity of the ‘true’ grade of many of the breast cancer tumours, a fact often ignored in clinical studies. PMID:18698346

  9. From cacti to carnivores: Improved phylotranscriptomic sampling and hierarchical homology inference provide further insight into the evolution of Caryophyllales.

    PubMed

    Walker, Joseph F; Yang, Ya; Feng, Tao; Timoneda, Alfonso; Mikenas, Jessica; Hutchison, Vera; Edwards, Caroline; Wang, Ning; Ahluwalia, Sonia; Olivieri, Julia; Walker-Hale, Nathanael; Majure, Lucas C; Puente, Raúl; Kadereit, Gudrun; Lauterbach, Maximilian; Eggli, Urs; Flores-Olvera, Hilda; Ochoterena, Helga; Brockington, Samuel F; Moore, Michael J; Smith, Stephen A

    2018-03-01

    The Caryophyllales contain ~12,500 species and are known for their cosmopolitan distribution, convergence of trait evolution, and extreme adaptations. Some relationships within the Caryophyllales, like those of many large plant clades, remain unclear, and phylogenetic studies often recover alternative hypotheses. We explore the utility of broad and dense transcriptome sampling across the order for resolving evolutionary relationships in Caryophyllales. We generated 84 transcriptomes and combined these with 224 publicly available transcriptomes to perform a phylogenomic analysis of Caryophyllales. To overcome the computational challenge of ortholog detection in such a large data set, we developed an approach for clustering gene families that allowed us to analyze >300 transcriptomes and genomes. We then inferred the species relationships using multiple methods and performed gene-tree conflict analyses. Our phylogenetic analyses resolved many clades with strong support, but also showed significant gene-tree discordance. This discordance is not only a common feature of phylogenomic studies, but also represents an opportunity to understand processes that have structured phylogenies. We also found taxon sampling influences species-tree inference, highlighting the importance of more focused studies with additional taxon sampling. Transcriptomes are useful both for species-tree inference and for uncovering evolutionary complexity within lineages. Through analyses of gene-tree conflict and multiple methods of species-tree inference, we demonstrate that phylogenomic data can provide unparalleled insight into the evolutionary history of Caryophyllales. We also discuss a method for overcoming computational challenges associated with homolog clustering in large data sets. © 2018 The Authors. American Journal of Botany is published by Wiley Periodicals, Inc. on behalf of the Botanical Society of America.

  10. Statistical Searches for Microlensing Events in Large, Non-uniformly Sampled Time-Domain Surveys: A Test Using Palomar Transient Factory Data

    NASA Astrophysics Data System (ADS)

    Price-Whelan, Adrian M.; Agüeros, Marcel A.; Fournier, Amanda P.; Street, Rachel; Ofek, Eran O.; Covey, Kevin R.; Levitan, David; Laher, Russ R.; Sesar, Branimir; Surace, Jason

    2014-01-01

    Many photometric time-domain surveys are driven by specific goals, such as searches for supernovae or transiting exoplanets, which set the cadence with which fields are re-imaged. In the case of the Palomar Transient Factory (PTF), several sub-surveys are conducted in parallel, leading to non-uniform sampling over its ~20,000 deg2 footprint. While the median 7.26 deg2 PTF field has been imaged ~40 times in the R band, ~2300 deg2 have been observed >100 times. We use PTF data to study the trade off between searching for microlensing events in a survey whose footprint is much larger than that of typical microlensing searches, but with far-from-optimal time sampling. To examine the probability that microlensing events can be recovered in these data, we test statistics used on uniformly sampled data to identify variables and transients. We find that the von Neumann ratio performs best for identifying simulated microlensing events in our data. We develop a selection method using this statistic and apply it to data from fields with >10 R-band observations, 1.1 × 109 light curves, uncovering three candidate microlensing events. We lack simultaneous, multi-color photometry to confirm these as microlensing events. However, their number is consistent with predictions for the event rate in the PTF footprint over the survey's three years of operations, as estimated from near-field microlensing models. This work can help constrain all-sky event rate predictions and tests microlensing signal recovery in large data sets, which will be useful to future time-domain surveys, such as that planned with the Large Synoptic Survey Telescope.

  11. Distribution and sources of surfzone bacteria at Huntington Beach before and after disinfection on an ocean outfall-- a frequency-domain analysis.

    PubMed

    Noble, M A; Xu, J P; Robertson, G L; Rosenfeld, L K

    2006-06-01

    Fecal indicator bacteria (FIB) were measured approximately 5 days a week in ankle-depth water at 19 surfzone stations along Huntington Beach and Newport Beach, California, from 1998 to the end of 2003. These sampling periods span the time before and after treated sewage effluent, discharged into the coastal ocean from the local outfall, was disinfected. Bacterial samples were also taken in the vicinity of the outfall during the pre- and post-disinfection periods. Our analysis of the results from both data sets suggest that land-based sources, rather than the local outfall, were the source of the FIB responsible for the frequent closures and postings of local beaches in the summers of 2001 and 2002. Because the annual cycle is the dominant frequency in the fecal and total coliform data sets at most sampling stations, we infer that sources associated with local runoff were responsible for the majority of coliform contamination along wide stretches of the beach. The dominant fortnightly cycle in enterococci at many surfzone sampling stations suggests that the source for these relatively frequent bacteria contamination events in summer is related to the wetting and draining of the land due to the large tidal excursions found during spring tides. Along the most frequently closed section of the beach at stations 3N-15N, the fortnightly cycle is dominant in all FIBs. The strikingly different spatial and spectral patterns found in coliform and in enterococci suggest the presence of different sources, at least for large sections of beach. The presence of a relatively large enterococci fortnightly cycle along the beaches near Newport Harbor indicates that contamination sources similar to those found off Huntington Beach are present, though not at high enough levels to close the Newport beaches.

  12. A large-scale cryoelectronic system for biological sample banking

    NASA Astrophysics Data System (ADS)

    Shirley, Stephen G.; Durst, Christopher H. P.; Fuchs, Christian C.; Zimmermann, Heiko; Ihmig, Frank R.

    2009-11-01

    We describe a polymorphic electronic infrastructure for managing biological samples stored over liquid nitrogen. As part of this system we have developed new cryocontainers and carrier plates attached to Flash memory chips to have a redundant and portable set of data at each sample. Our experimental investigations show that basic Flash operation and endurance is adequate for the application down to liquid nitrogen temperatures. This identification technology can provide the best sample identification, documentation and tracking that brings added value to each sample. The first application of the system is in a worldwide collaborative research towards the production of an AIDS vaccine. The functionality and versatility of the system can lead to an essential optimization of sample and data exchange for global clinical studies.

  13. Digital robust active control law synthesis for large order flexible structure using parameter optimization

    NASA Technical Reports Server (NTRS)

    Mukhopadhyay, V.

    1988-01-01

    A generic procedure for the parameter optimization of a digital control law for a large-order flexible flight vehicle or large space structure modeled as a sampled data system is presented. A linear quadratic Guassian type cost function was minimized, while satisfying a set of constraints on the steady-state rms values of selected design responses, using a constrained optimization technique to meet multiple design requirements. Analytical expressions for the gradients of the cost function and the design constraints on mean square responses with respect to the control law design variables are presented.

  14. Ground-water quality beneath irrigated agriculture in the central High Plains aquifer, 1999-2000

    USGS Publications Warehouse

    Bruce, Breton W.; Becker, Mark F.; Pope, Larry M.; Gurdak, Jason J.

    2003-01-01

    In 1999 and 2000, 30 water-quality monitoring wells were installed in the central High Plains aquifer to evaluate the quality of recently recharged ground water in areas of irrigated agriculture and to identify the factors affecting ground-water quality. Wells were installed adjacent to irrigated agricultural fields with 10- or 20-foot screened intervals placed near the water table. Each well was sampled once for about 100 waterquality constituents associated with agricultural practices. Water samples from 70 percent of the wells (21 of 30 sites) contained nitrate concentrations larger than expected background concentrations (about 3 mg/L as N) and detectable pesticides. Atrazine or its metabolite, deethylatrazine, were detected with greater frequency than other pesticides and were present in all 21 samples where pesticides were detected. The 21 samples with detectable pesticides also contained tritium concentrations large enough to indicate that at least some part of the water sample had been recharged within about the last 50 years. These 21 ground-water samples are considered to show water-quality effects related to irrigated agriculture. The remaining 9 groundwater samples contained no pesticides, small tritium concentrations, and nitrate concentrations less than 3.45 milligrams per liter as nitrogen. These samples are considered unaffected by the irrigated agricultural land-use setting. Nitrogen isotope ratios indicate that commercial fertilizer was the dominant source of nitrate in 13 of the 21 samples affected by irrigated agriculture. Nitrogen isotope ratios for 4 of these 21 samples were indicative of an animal waste source. Dissolved-solids concentrations were larger in samples affected by irrigated agriculture, with large sulfate concentrations having strong correlation with large dissolved solids concentrations in these samples. A strong statistical correlation is shown between samples affected by irrigated agriculture and sites with large rates of pesticide and nitrogen applications and shallow depths to ground water.

  15. Adaptive Landscape Flattening Accelerates Sampling of Alchemical Space in Multisite λ Dynamics.

    PubMed

    Hayes, Ryan L; Armacost, Kira A; Vilseck, Jonah Z; Brooks, Charles L

    2017-04-20

    Multisite λ dynamics (MSλD) is a powerful emerging method in free energy calculation that allows prediction of relative free energies for a large set of compounds from very few simulations. Calculating free energy differences between substituents that constitute large volume or flexibility jumps in chemical space is difficult for free energy methods in general, and for MSλD in particular, due to large free energy barriers in alchemical space. This study demonstrates that a simple biasing potential can flatten these barriers and introduces an algorithm that determines system specific biasing potential coefficients. Two sources of error, deep traps at the end points and solvent disruption by hard-core potentials, are identified. Both scale with the size of the perturbed substituent and are removed by sharp biasing potentials and a new soft-core implementation, respectively. MSλD with landscape flattening is demonstrated on two sets of molecules: derivatives of the heat shock protein 90 inhibitor geldanamycin and derivatives of benzoquinone. In the benzoquinone system, landscape flattening leads to 2 orders of magnitude improvement in transition rates between substituents and robust solvation free energies. Landscape flattening opens up new applications for MSλD by enabling larger chemical perturbations to be sampled with improved precision and accuracy.

  16. Estimating Divergence Parameters With Small Samples From a Large Number of Loci

    PubMed Central

    Wang, Yong; Hey, Jody

    2010-01-01

    Most methods for studying divergence with gene flow rely upon data from many individuals at few loci. Such data can be useful for inferring recent population history but they are unlikely to contain sufficient information about older events. However, the growing availability of genome sequences suggests a different kind of sampling scheme, one that may be more suited to studying relatively ancient divergence. Data sets extracted from whole-genome alignments may represent very few individuals but contain a very large number of loci. To take advantage of such data we developed a new maximum-likelihood method for genomic data under the isolation-with-migration model. Unlike many coalescent-based likelihood methods, our method does not rely on Monte Carlo sampling of genealogies, but rather provides a precise calculation of the likelihood by numerical integration over all genealogies. We demonstrate that the method works well on simulated data sets. We also consider two models for accommodating mutation rate variation among loci and find that the model that treats mutation rates as random variables leads to better estimates. We applied the method to the divergence of Drosophila melanogaster and D. simulans and detected a low, but statistically significant, signal of gene flow from D. simulans to D. melanogaster. PMID:19917765

  17. Toward accelerating landslide mapping with interactive machine learning techniques

    NASA Astrophysics Data System (ADS)

    Stumpf, André; Lachiche, Nicolas; Malet, Jean-Philippe; Kerle, Norman; Puissant, Anne

    2013-04-01

    Despite important advances in the development of more automated methods for landslide mapping from optical remote sensing images, the elaboration of inventory maps after major triggering events still remains a tedious task. Image classification with expert defined rules typically still requires significant manual labour for the elaboration and adaption of rule sets for each particular case. Machine learning algorithm, on the contrary, have the ability to learn and identify complex image patterns from labelled examples but may require relatively large amounts of training data. In order to reduce the amount of required training data active learning has evolved as key concept to guide the sampling for applications such as document classification, genetics and remote sensing. The general underlying idea of most active learning approaches is to initialize a machine learning model with a small training set, and to subsequently exploit the model state and/or the data structure to iteratively select the most valuable samples that should be labelled by the user and added in the training set. With relatively few queries and labelled samples, an active learning strategy should ideally yield at least the same accuracy than an equivalent classifier trained with many randomly selected samples. Our study was dedicated to the development of an active learning approach for landslide mapping from VHR remote sensing images with special consideration of the spatial distribution of the samples. The developed approach is a region-based query heuristic that enables to guide the user attention towards few compact spatial batches rather than distributed points resulting in time savings of 50% and more compared to standard active learning techniques. The approach was tested with multi-temporal and multi-sensor satellite images capturing recent large scale triggering events in Brazil and China and demonstrated balanced user's and producer's accuracies between 74% and 80%. The assessment also included an experimental evaluation of the uncertainties of manual mappings from multiple experts and demonstrated strong relationships between the uncertainty of the experts and the machine learning model.

  18. Translational Genomics Research Institute (TGen): Quantified Cancer Cell Line Encyclopedia (CCLE) RNA-seq Data | Office of Cancer Genomics

    Cancer.gov

    Many applications analyze quantified transcript-level abundances to make inferences.  Having completed this computation across the large sample set, the CTD2 Center at the Translational Genomics Research Institute presents the quantified data in a straightforward, consolidated form for these types of analyses.

  19. Seeking the Cause of Correlations among Mental Abilities: Large Twin Analysis in a National Testing Program.

    ERIC Educational Resources Information Center

    Page, Ellis B.; Jarjoura, David

    1979-01-01

    A computer scan of ACT Assessment records identified 3,427 sets of twins. The Hardy-Weinberg rule was used to estimate the proportion of monozygotic twins in the sample. Matrices of genetic and environmental influences were produced. The heaviest loadings were clearly in the genetic matrix. (SJL)

  20. Height as a Measure of Success in Academe.

    ERIC Educational Resources Information Center

    Hensley, Wayne E.

    This paper presents the results of two studies at a large mid-Atlantic university that examined the height/success paradigm within the context of the university settings. Specifically, are the trends observed among taller persons in police and sales work equally valid for university professors? A random sample of faculty (N=90), revealed that…

  1. Bi-Factor MIRT Observed-Score Equating for Mixed-Format Tests

    ERIC Educational Resources Information Center

    Lee, Guemin; Lee, Won-Chan

    2016-01-01

    The main purposes of this study were to develop bi-factor multidimensional item response theory (BF-MIRT) observed-score equating procedures for mixed-format tests and to investigate relative appropriateness of the proposed procedures. Using data from a large-scale testing program, three types of pseudo data sets were formulated: matched samples,…

  2. Application of the National Assessment of Educational Progress Philosophy in San Bernardino City Unified School District.

    ERIC Educational Resources Information Center

    Bonney, Lewis A.

    the steps taken by a large urban school district to develop and implement an objectives-based curriculum with criterion-referenced assessment of student progress are described. These steps include: goal setting, development of curriculum objectives, construction of assessment exercises, matrix sampling in test administration, and reporting of…

  3. Longitudinal Study on Fluency among Novice Learners of Japanese

    ERIC Educational Resources Information Center

    Hirotani, Maki; Matsumoto, Kazumi; Fukada, Atsusi

    2012-01-01

    The present study examined various aspects of the development of learners' fluency in Japanese using a large set of speech samples collected over a long period, using an online speaking practice/assessment system called "Speak Everywhere." The purpose of the present study was to examine: (1) how the fluency related measures changed over…

  4. School Composition and Peer Effects in Distinctive Organizational Settings

    ERIC Educational Resources Information Center

    Marks, Helen M.

    2002-01-01

    This chapter reviews the research on school composition and peer effects from three comparative perspectives--Catholic and public schools, single-sex and coeducational schools, and small and large schools. Most of the research is sociological, focuses on high schools, and draws on national samples. The chapter seeks to discern cumulative trends in…

  5. MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.

    PubMed

    Bernstein, Matthew N; Doan, AnHai; Dewey, Colin N

    2017-09-15

    The NCBI's Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA. We present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline. cdewey@biostat.wisc.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  6. Fitting parametric random effects models in very large data sets with application to VHA national data

    PubMed Central

    2012-01-01

    Background With the current focus on personalized medicine, patient/subject level inference is often of key interest in translational research. As a result, random effects models (REM) are becoming popular for patient level inference. However, for very large data sets that are characterized by large sample size, it can be difficult to fit REM using commonly available statistical software such as SAS since they require inordinate amounts of computer time and memory allocations beyond what are available preventing model convergence. For example, in a retrospective cohort study of over 800,000 Veterans with type 2 diabetes with longitudinal data over 5 years, fitting REM via generalized linear mixed modeling using currently available standard procedures in SAS (e.g. PROC GLIMMIX) was very difficult and same problems exist in Stata’s gllamm or R’s lme packages. Thus, this study proposes and assesses the performance of a meta regression approach and makes comparison with methods based on sampling of the full data. Data We use both simulated and real data from a national cohort of Veterans with type 2 diabetes (n=890,394) which was created by linking multiple patient and administrative files resulting in a cohort with longitudinal data collected over 5 years. Methods and results The outcome of interest was mean annual HbA1c measured over a 5 years period. Using this outcome, we compared parameter estimates from the proposed random effects meta regression (REMR) with estimates based on simple random sampling and VISN (Veterans Integrated Service Networks) based stratified sampling of the full data. Our results indicate that REMR provides parameter estimates that are less likely to be biased with tighter confidence intervals when the VISN level estimates are homogenous. Conclusion When the interest is to fit REM in repeated measures data with very large sample size, REMR can be used as a good alternative. It leads to reasonable inference for both Gaussian and non-Gaussian responses if parameter estimates are homogeneous across VISNs. PMID:23095325

  7. Electrodynamics of the middle atmosphere: Superpressure balloon program

    NASA Technical Reports Server (NTRS)

    Holzworth, Robert H.

    1987-01-01

    In this experiment a comprehensive set of electrical parameters were measured during eight long duration flights in the southern hemisphere stratosphere. These flight resulted in the largest data set ever collected from the stratosphere. The stratosphere has never been electrodynamically sampled in the systematic manner before. New discoveries include short term variability in the planetary scale electric current system, the unexpected observation of stratospheric conductivity variations over thunderstorms and the observation of direct stratospheric conductivity variations following a relatively small solar flare. Major statistical studies were conducted of the large scale current systems, the stratospheric conductivity and the neutral gravity waves (from pressure and temperature data) using the entire data set.

  8. Beyond Linear Sequence Comparisons: The use of genome-levelcharacters for phylogenetic reconstruction

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Boore, Jeffrey L.

    2004-11-27

    Although the phylogenetic relationships of many organisms have been convincingly resolved by the comparisons of nucleotide or amino acid sequences, others have remained equivocal despite great effort. Now that large-scale genome sequencing projects are sampling many lineages, it is becoming feasible to compare large data sets of genome-level features and to develop this as a tool for phylogenetic reconstruction that has advantages over conventional sequence comparisons. Although it is unlikely that these will address a large number of evolutionary branch points across the broad tree of life due to the infeasibility of such sampling, they have great potential for convincinglymore » resolving many critical, contested relationships for which no other data seems promising. However, it is important that we recognize potential pitfalls, establish reasonable standards for acceptance, and employ rigorous methodology to guard against a return to earlier days of scenario-driven evolutionary reconstructions.« less

  9. A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets.

    PubMed

    Carrig, Madeline M; Manrique-Vallier, Daniel; Ranby, Krista W; Reiter, Jerome P; Hoyle, Rick H

    2015-01-01

    Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches.

  10. A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets

    PubMed Central

    Carrig, Madeline M.; Manrique-Vallier, Daniel; Ranby, Krista W.; Reiter, Jerome P.; Hoyle, Rick H.

    2015-01-01

    Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches. PMID:26257437

  11. Copy number variation signature to predict human ancestry

    PubMed Central

    2012-01-01

    Background Copy number variations (CNVs) are genomic structural variants that are found in healthy populations and have been observed to be associated with disease susceptibility. Existing methods for CNV detection are often performed on a sample-by-sample basis, which is not ideal for large datasets where common CNVs must be estimated by comparing the frequency of CNVs in the individual samples. Here we describe a simple and novel approach to locate genome-wide CNVs common to a specific population, using human ancestry as the phenotype. Results We utilized our previously published Genome Alteration Detection Analysis (GADA) algorithm to identify common ancestry CNVs (caCNVs) and built a caCNV model to predict population structure. We identified a 73 caCNV signature using a training set of 225 healthy individuals from European, Asian, and African ancestry. The signature was validated on an independent test set of 300 individuals with similar ancestral background. The error rate in predicting ancestry in this test set was 2% using the 73 caCNV signature. Among the caCNVs identified, several were previously confirmed experimentally to vary by ancestry. Our signature also contains a caCNV region with a single microRNA (MIR270), which represents the first reported variation of microRNA by ancestry. Conclusions We developed a new methodology to identify common CNVs and demonstrated its performance by building a caCNV signature to predict human ancestry with high accuracy. The utility of our approach could be extended to large case–control studies to identify CNV signatures for other phenotypes such as disease susceptibility and drug response. PMID:23270563

  12. The Impact of the Tree Prior on Molecular Dating of Data Sets Containing a Mixture of Inter- and Intraspecies Sampling.

    PubMed

    Ritchie, Andrew M; Lo, Nathan; Ho, Simon Y W

    2017-05-01

    In Bayesian phylogenetic analyses of genetic data, prior probability distributions need to be specified for the model parameters, including the tree. When Bayesian methods are used for molecular dating, available tree priors include those designed for species-level data, such as the pure-birth and birth-death priors, and coalescent-based priors designed for population-level data. However, molecular dating methods are frequently applied to data sets that include multiple individuals across multiple species. Such data sets violate the assumptions of both the speciation and coalescent-based tree priors, making it unclear which should be chosen and whether this choice can affect the estimation of node times. To investigate this problem, we used a simulation approach to produce data sets with different proportions of within- and between-species sampling under the multispecies coalescent model. These data sets were then analyzed under pure-birth, birth-death, constant-size coalescent, and skyline coalescent tree priors. We also explored the ability of Bayesian model testing to select the best-performing priors. We confirmed the applicability of our results to empirical data sets from cetaceans, phocids, and coregonid whitefish. Estimates of node times were generally robust to the choice of tree prior, but some combinations of tree priors and sampling schemes led to large differences in the age estimates. In particular, the pure-birth tree prior frequently led to inaccurate estimates for data sets containing a mixture of inter- and intraspecific sampling, whereas the birth-death and skyline coalescent priors produced stable results across all scenarios. Model testing provided an adequate means of rejecting inappropriate tree priors. Our results suggest that tree priors do not strongly affect Bayesian molecular dating results in most cases, even when severely misspecified. However, the choice of tree prior can be significant for the accuracy of dating results in the case of data sets with mixed inter- and intraspecies sampling. [Bayesian phylogenetic methods; model testing; molecular dating; node time; tree prior.]. © The authors 2016. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please e-mail: journals.permission@oup.com.

  13. GRAAL - Griggs-type Apparatus equipped with Acoustics in the Laboratory: a new instrument to explore the rheology of rocks at high pressure

    NASA Astrophysics Data System (ADS)

    Schubnel, A.; Champallier, R.; Precigout, J.; Pinquier, Y.; Ferrand, T. P.; Incel, S.; Hilairet, N.; Labrousse, L.; Renner, J.; Green, H. W., II; Stunitz, H.; Jolivet, L.

    2015-12-01

    Two new generation solid-medium Griggs-type apparatus have been set up at the Laboratoire de Géologie of ENS PARIS, and the Institut des Sciences de la Terre d'Orléans (ISTO). These new set-ups allow to perform controlled rock deformation experiments on large volume samples, up to 5 GPa and 1300°C. Careful pressure - stress calibration will be performed (using D-DIA and/or Paterson-type experiments as standards), strain-stress-pressure will be measured using modern techniques and state of the art salt assemblies. Focusing on rheology, the pressure vessel at ISTO has been designed in a goal of deforming large sample diameter (8 mm) at confining pressure of up to 3 GPa. Thanks to this large sample size, this new vessel will allow to explore the microstructures related to the deformation processes occurring at pressures of the deep lithosphere and in subduction zones. In this new apparatus, we moreover included a room below the pressure vessel in order to develop a basal load cell as close as possible to the sample. This new design, in progress, aims at significantly improving the accuracy of stress measurements in the Griggs-type apparatus. The ultimate goal is to set up a new technique able to routinely quantify the rheology of natural rocks between 0.5 and 5 GPa. Although fundamental to document the rheology of the lithosphere, such a technique is still missing in rock mechanics. Focusing on the evolution of physical and mechanical properties during mineral phase transformations, the vessel at ENS is equipped with continuous acoustic emission (AE) multi-sensor monitoring in order to "listen" to the sample during deformation. Indeed, these continuous recordings enable to detect regular AE like signals during dynamic crack propagation, as well as non-impulsive signals, which might be instrumental to identify laboratory analogs to non-volcanic tremor and low frequency earthquake signals. P and S elastic wave velocities will also be measured contemporaneously during deformation. Indeed, elastic wave velocities may be a good non-destructive proxy to track mineral reaction extent, under in-situ conditions. Attempts will also be performed to develop a tool to measure P and S wave anisotropy, at least along certain directions. Both data might prove of crucial interest to interpret the latest generation of tomographic imaging.

  14. Meta-analysis of gene expression profiles associated with histological classification and survival in 829 ovarian cancer samples.

    PubMed

    Fekete, Tibor; Rásó, Erzsébet; Pete, Imre; Tegze, Bálint; Liko, István; Munkácsy, Gyöngyi; Sipos, Norbert; Rigó, János; Györffy, Balázs

    2012-07-01

    Transcriptomic analysis of global gene expression in ovarian carcinoma can identify dysregulated genes capable to serve as molecular markers for histology subtypes and survival. The aim of our study was to validate previous candidate signatures in an independent setting and to identify single genes capable to serve as biomarkers for ovarian cancer progression. As several datasets are available in the GEO today, we were able to perform a true meta-analysis. First, 829 samples (11 datasets) were downloaded, and the predictive power of 16 previously published gene sets was assessed. Of these, eight were capable to discriminate histology subtypes, and none was capable to predict survival. To overcome the differences in previous studies, we used the 829 samples to identify new predictors. Then, we collected 64 ovarian cancer samples (median relapse-free survival 24.5 months) and performed TaqMan Real Time Polimerase Chain Reaction (RT-PCR) analysis for the best 40 genes associated with histology subtypes and survival. Over 90% of subtype-associated genes were confirmed. Overall survival was effectively predicted by hormone receptors (PGR and ESR2) and by TSPAN8. Relapse-free survival was predicted by MAPT and SNCG. In summary, we successfully validated several gene sets in a meta-analysis in large datasets of ovarian samples. Additionally, several individual genes identified were validated in a clinical cohort. Copyright © 2011 UICC.

  15. Inference and quantification of peptidoforms in large sample cohorts by SWATH-MS

    PubMed Central

    Röst, Hannes L; Ludwig, Christina; Buil, Alfonso; Bensimon, Ariel; Soste, Martin; Spector, Tim D; Dermitzakis, Emmanouil T; Collins, Ben C; Malmström, Lars; Aebersold, Ruedi

    2017-01-01

    The consistent detection and quantification of protein post-translational modifications (PTMs) across sample cohorts is an essential prerequisite for the functional analysis of biological processes. Data-independent acquisition (DIA), a bottom-up mass spectrometry based proteomic strategy, exemplified by SWATH-MS, provides complete precursor and fragment ion information of a sample and thus, in principle, the information to identify peptidoforms, the modified variants of a peptide. However, due to the convoluted structure of DIA data sets the confident and systematic identification and quantification of peptidoforms has remained challenging. Here we present IPF (Inference of PeptidoForms), a fully automated algorithm that uses spectral libraries to query, validate and quantify peptidoforms in DIA data sets. The method was developed on data acquired by SWATH-MS and benchmarked using a synthetic phosphopeptide reference data set and phosphopeptide-enriched samples. The data indicate that IPF reduced false site-localization by more than 7-fold in comparison to previous approaches, while recovering 85.4% of the true signals. IPF was applied to detect and quantify peptidoforms carrying ten different types of PTMs in DIA data acquired from more than 200 samples of undepleted blood plasma of a human twin cohort. The data approportioned, for the first time, the contribution of heritable, environmental and longitudinal effects on the observed quantitative variability of specific modifications in blood plasma of a human population. PMID:28604659

  16. Tail-scope: Using friends to estimate heavy tails of degree distributions in large-scale complex networks

    NASA Astrophysics Data System (ADS)

    Eom, Young-Ho; Jo, Hang-Hyun

    2015-05-01

    Many complex networks in natural and social phenomena have often been characterized by heavy-tailed degree distributions. However, due to rapidly growing size of network data and concerns on privacy issues about using these data, it becomes more difficult to analyze complete data sets. Thus, it is crucial to devise effective and efficient estimation methods for heavy tails of degree distributions in large-scale networks only using local information of a small fraction of sampled nodes. Here we propose a tail-scope method based on local observational bias of the friendship paradox. We show that the tail-scope method outperforms the uniform node sampling for estimating heavy tails of degree distributions, while the opposite tendency is observed in the range of small degrees. In order to take advantages of both sampling methods, we devise the hybrid method that successfully recovers the whole range of degree distributions. Our tail-scope method shows how structural heterogeneities of large-scale complex networks can be used to effectively reveal the network structure only with limited local information.

  17. Optimal number of features as a function of sample size for various classification rules.

    PubMed

    Hua, Jianping; Xiong, Zixiang; Lowey, James; Suh, Edward; Dougherty, Edward R

    2005-04-15

    Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features. Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there are a large number of error surfaces for the many cases. These are provided in full on a companion website, which is meant to serve as resource for those working with small-sample classification. For the companion website, please visit http://public.tgen.org/tamu/ofs/ e-dougherty@ee.tamu.edu.

  18. Air pollution source identification

    NASA Technical Reports Server (NTRS)

    Fordyce, J. S.

    1975-01-01

    The techniques available for source identification are reviewed: remote sensing, injected tracers, and pollutants themselves as tracers. The use of the large number of trace elements in the ambient airborne particulate matter as a practical means of identifying sources is discussed. Trace constituents are determined by sensitive, inexpensive, nondestructive, multielement analytical methods such as instrumental neutron activation and charged particle X-ray fluorescence. The application to a large data set of pairwise correlation, the more advanced pattern recognition-cluster analysis approach with and without training sets, enrichment factors, and pollutant concentration rose displays for each element is described. It is shown that elemental constituents are related to specific source types: earth crustal, automotive, metallurgical, and more specific industries. A field-ready source identification system based on time and wind direction resolved sampling is described.

  19. Rapid detection of frozen-then-thawed minced beef using multispectral imaging and Fourier transform infrared spectroscopy.

    PubMed

    Ropodi, Athina I; Panagou, Efstathios Z; Nychas, George-John E

    2018-01-01

    In recent years, fraud detection has become a major priority for food authorities, as fraudulent practices can have various economic and safety consequences. This work explores ways of identifying frozen-then-thawed minced beef labeled as fresh in a rapid, large-scale and cost-effective way. For this reason, freshly-ground beef was purchased from seven separate shops at different times, divided in fifteen portions and placed in Petri dishes. Multi-spectral images and FTIR spectra of the first five were immediately acquired while the remaining were frozen (-20°C) and stored for 7 and 32days (5 samples for each time interval). Samples were thawed and subsequently subjected to similar data acquisition. In total, 105 multispectral images and FTIR spectra were collected which were further analyzed using partial least-squares discriminant analysis and support vector machines. Two meat batches (30 samples) were reserved for independent validation and the remaining five batches were divided in training and test set (75 samples). Results showed 100% overall correct classification for test and external validation MSI data, while FTIR data yielded 93.3 and 96.7% overall correct classification for FTIR test set and external validation set respectively. Copyright © 2017 Elsevier Ltd. All rights reserved.

  20. Validation sampling can reduce bias in healthcare database studies: an illustration using influenza vaccination effectiveness

    PubMed Central

    Nelson, Jennifer C.; Marsh, Tracey; Lumley, Thomas; Larson, Eric B.; Jackson, Lisa A.; Jackson, Michael

    2014-01-01

    Objective Estimates of treatment effectiveness in epidemiologic studies using large observational health care databases may be biased due to inaccurate or incomplete information on important confounders. Study methods that collect and incorporate more comprehensive confounder data on a validation cohort may reduce confounding bias. Study Design and Setting We applied two such methods, imputation and reweighting, to Group Health administrative data (full sample) supplemented by more detailed confounder data from the Adult Changes in Thought study (validation sample). We used influenza vaccination effectiveness (with an unexposed comparator group) as an example and evaluated each method’s ability to reduce bias using the control time period prior to influenza circulation. Results Both methods reduced, but did not completely eliminate, the bias compared with traditional effectiveness estimates that do not utilize the validation sample confounders. Conclusion Although these results support the use of validation sampling methods to improve the accuracy of comparative effectiveness findings from healthcare database studies, they also illustrate that the success of such methods depends on many factors, including the ability to measure important confounders in a representative and large enough validation sample, the comparability of the full sample and validation sample, and the accuracy with which data can be imputed or reweighted using the additional validation sample information. PMID:23849144

  1. Estimating population trends with a linear model

    USGS Publications Warehouse

    Bart, Jonathan; Collins, Brian D.; Morrison, R.I.G.

    2003-01-01

    We describe a simple and robust method for estimating trends in population size. The method may be used with Breeding Bird Survey data, aerial surveys, point counts, or any other program of repeated surveys at permanent locations. Surveys need not be made at each location during each survey period. The method differs from most existing methods in being design based, rather than model based. The only assumptions are that the nominal sampling plan is followed and that sample size is large enough for use of the t-distribution. Simulations based on two bird data sets from natural populations showed that the point estimate produced by the linear model was essentially unbiased even when counts varied substantially and 25% of the complete data set was missing. The estimating-equation approach, often used to analyze Breeding Bird Survey data, performed similarly on one data set but had substantial bias on the second data set, in which counts were highly variable. The advantages of the linear model are its simplicity, flexibility, and that it is self-weighting. A user-friendly computer program to carry out the calculations is available from the senior author.

  2. Cancer classification through filtering progressive transductive support vector machine based on gene expression data

    NASA Astrophysics Data System (ADS)

    Lu, Xinguo; Chen, Dan

    2017-08-01

    Traditional supervised classifiers neglect a large amount of data which not have sufficient follow-up information, only work with labeled data. Consequently, the small sample size limits the advancement of design appropriate classifier. In this paper, a transductive learning method which combined with the filtering strategy in transductive framework and progressive labeling strategy is addressed. The progressive labeling strategy does not need to consider the distribution of labeled samples to evaluate the distribution of unlabeled samples, can effective solve the problem of evaluate the proportion of positive and negative samples in work set. Our experiment result demonstrate that the proposed technique have great potential in cancer prediction based on gene expression.

  3. How important are autonomy and work setting to nurse practitioners' job satisfaction?

    PubMed

    Athey, Erin K; Leslie, Mayri Sagady; Briggs, Linda A; Park, Jeongyoung; Falk, Nancy L; Pericak, Arlene; El-Banna, Majeda M; Greene, Jessica

    2016-06-01

    Nurse practitioners (NPs) have reported aspects of their jobs that they are more and less satisfied with. However, few studies have examined the factors that predict overall job satisfaction. This study uses a large national sample to examine the extent to which autonomy and work setting predict job satisfaction. The 2012 National Sample Survey of Nurse Practitioners (n = 8311) was used to examine bivariate and multivariate relationships between work setting and three autonomy variables (independent billing practices, having one's NP skills fully utilized, and relationship with physician), and job satisfaction. NPs working in primary care reported the highest levels of autonomy across all three autonomy measures, while those working in hospital surgical settings reported the lowest levels. Autonomy, specifically feeling one's NP skills were fully utilized, was the factor most predictive of satisfaction. In multivariate analyses, those who strongly agreed their skills were being fully utilized had satisfaction scores almost one point higher than those who strongly disagreed. Work setting was only marginally related to job satisfaction. In order to attract and retain NPs in the future, healthcare organizations should ensure that NPs' skills are being fully utilized. ©2015 American Association of Nurse Practitioners.

  4. OpenMSI Arrayed Analysis Toolkit: Analyzing Spatially Defined Samples Using Mass Spectrometry Imaging

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    de Raad, Markus; de Rond, Tristan; Rübel, Oliver

    Mass spectrometry imaging (MSI) has primarily been applied in localizing biomolecules within biological matrices. Although well-suited, the application of MSI for comparing thousands of spatially defined spotted samples has been limited. One reason for this is a lack of suitable and accessible data processing tools for the analysis of large arrayed MSI sample sets. In this paper, the OpenMSI Arrayed Analysis Toolkit (OMAAT) is a software package that addresses the challenges of analyzing spatially defined samples in MSI data sets. OMAAT is written in Python and is integrated with OpenMSI (http://openmsi.nersc.gov), a platform for storing, sharing, and analyzing MSI data.more » By using a web-based python notebook (Jupyter), OMAAT is accessible to anyone without programming experience yet allows experienced users to leverage all features. OMAAT was evaluated by analyzing an MSI data set of a high-throughput glycoside hydrolase activity screen comprising 384 samples arrayed onto a NIMS surface at a 450 μm spacing, decreasing analysis time >100-fold while maintaining robust spot-finding. The utility of OMAAT was demonstrated for screening metabolic activities of different sized soil particles, including hydrolysis of sugars, revealing a pattern of size dependent activities. Finally, these results introduce OMAAT as an effective toolkit for analyzing spatially defined samples in MSI. OMAAT runs on all major operating systems, and the source code can be obtained from the following GitHub repository: https://github.com/biorack/omaat.« less

  5. Toward a Principled Sampling Theory for Quasi-Orders

    PubMed Central

    Ünlü, Ali; Schrepp, Martin

    2016-01-01

    Quasi-orders, that is, reflexive and transitive binary relations, have numerous applications. In educational theories, the dependencies of mastery among the problems of a test can be modeled by quasi-orders. Methods such as item tree or Boolean analysis that mine for quasi-orders in empirical data are sensitive to the underlying quasi-order structure. These data mining techniques have to be compared based on extensive simulation studies, with unbiased samples of randomly generated quasi-orders at their basis. In this paper, we develop techniques that can provide the required quasi-order samples. We introduce a discrete doubly inductive procedure for incrementally constructing the set of all quasi-orders on a finite item set. A randomization of this deterministic procedure allows us to generate representative samples of random quasi-orders. With an outer level inductive algorithm, we consider the uniform random extensions of the trace quasi-orders to higher dimension. This is combined with an inner level inductive algorithm to correct the extensions that violate the transitivity property. The inner level correction step entails sampling biases. We propose three algorithms for bias correction and investigate them in simulation. It is evident that, on even up to 50 items, the new algorithms create close to representative quasi-order samples within acceptable computing time. Hence, the principled approach is a significant improvement to existing methods that are used to draw quasi-orders uniformly at random but cannot cope with reasonably large item sets. PMID:27965601

  6. Toward a Principled Sampling Theory for Quasi-Orders.

    PubMed

    Ünlü, Ali; Schrepp, Martin

    2016-01-01

    Quasi-orders, that is, reflexive and transitive binary relations, have numerous applications. In educational theories, the dependencies of mastery among the problems of a test can be modeled by quasi-orders. Methods such as item tree or Boolean analysis that mine for quasi-orders in empirical data are sensitive to the underlying quasi-order structure. These data mining techniques have to be compared based on extensive simulation studies, with unbiased samples of randomly generated quasi-orders at their basis. In this paper, we develop techniques that can provide the required quasi-order samples. We introduce a discrete doubly inductive procedure for incrementally constructing the set of all quasi-orders on a finite item set. A randomization of this deterministic procedure allows us to generate representative samples of random quasi-orders. With an outer level inductive algorithm, we consider the uniform random extensions of the trace quasi-orders to higher dimension. This is combined with an inner level inductive algorithm to correct the extensions that violate the transitivity property. The inner level correction step entails sampling biases. We propose three algorithms for bias correction and investigate them in simulation. It is evident that, on even up to 50 items, the new algorithms create close to representative quasi-order samples within acceptable computing time. Hence, the principled approach is a significant improvement to existing methods that are used to draw quasi-orders uniformly at random but cannot cope with reasonably large item sets.

  7. Geologic setting and petrology of Apollo 15 anorthosite /15415/.

    NASA Technical Reports Server (NTRS)

    Wilshire, H. G.; Schaber, G. G.; Jackson, E. D.; Silver, L. T.; Phinney, W. C.

    1972-01-01

    The geological setting, petrography and history of this Apollo 15 lunar rock sample are discussed, characterizing the sample as coarse-grained anorthosite composed largely of calcic plagioclase with small amounts of three pyroxene phases. The presence of shattered and granulated minerals in the texture of the rock is traced to two or more fragmentation events, and the presence of irregular bands of coarsely recrystallized plagioclase and minor pyroxene crossing larger plagioclase grains is traced to an earlier thermal metamorphic event. It is pointed out that any of these events may have affected apparent radiometric ages of elements in this rock. A comparative summarization of data suggests that this rock is the least-deformed member of a suite of similar rocks ejected from beneath the regolith at Spur crater.

  8. The Inverse Bagging Algorithm: Anomaly Detection by Inverse Bootstrap Aggregating

    NASA Astrophysics Data System (ADS)

    Vischia, Pietro; Dorigo, Tommaso

    2017-03-01

    For data sets populated by a very well modeled process and by another process of unknown probability density function (PDF), a desired feature when manipulating the fraction of the unknown process (either for enhancing it or suppressing it) consists in avoiding to modify the kinematic distributions of the well modeled one. A bootstrap technique is used to identify sub-samples rich in the well modeled process, and classify each event according to the frequency of it being part of such sub-samples. Comparisons with general MVA algorithms will be shown, as well as a study of the asymptotic properties of the method, making use of a public domain data set that models a typical search for new physics as performed at hadronic colliders such as the Large Hadron Collider (LHC).

  9. High-speed adaptive contact-mode atomic force microscopy imaging with near-minimum-force

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ren, Juan; Zou, Qingze, E-mail: qzzou@rci.rutgers.edu

    In this paper, an adaptive contact-mode imaging approach is proposed to replace the traditional contact-mode imaging by addressing the major concerns in both the speed and the force exerted to the sample. The speed of the traditional contact-mode imaging is largely limited by the need to maintain precision tracking of the sample topography over the entire imaged sample surface, while large image distortion and excessive probe-sample interaction force occur during high-speed imaging. In this work, first, the image distortion caused by the topography tracking error is accounted for in the topography quantification. Second, the quantified sample topography is utilized inmore » a gradient-based optimization method to adjust the cantilever deflection set-point for each scanline closely around the minimal level needed for maintaining stable probe-sample contact, and a data-driven iterative feedforward control that utilizes a prediction of the next-line topography is integrated to the topography feeedback loop to enhance the sample topography tracking. The proposed approach is demonstrated and evaluated through imaging a calibration sample of square pitches at both high speeds (e.g., scan rate of 75 Hz and 130 Hz) and large sizes (e.g., scan size of 30 μm and 80 μm). The experimental results show that compared to the traditional constant-force contact-mode imaging, the imaging speed can be increased by over 30 folds (with the scanning speed at 13 mm/s), and the probe-sample interaction force can be reduced by more than 15% while maintaining the same image quality.« less

  10. High-speed adaptive contact-mode atomic force microscopy imaging with near-minimum-force.

    PubMed

    Ren, Juan; Zou, Qingze

    2014-07-01

    In this paper, an adaptive contact-mode imaging approach is proposed to replace the traditional contact-mode imaging by addressing the major concerns in both the speed and the force exerted to the sample. The speed of the traditional contact-mode imaging is largely limited by the need to maintain precision tracking of the sample topography over the entire imaged sample surface, while large image distortion and excessive probe-sample interaction force occur during high-speed imaging. In this work, first, the image distortion caused by the topography tracking error is accounted for in the topography quantification. Second, the quantified sample topography is utilized in a gradient-based optimization method to adjust the cantilever deflection set-point for each scanline closely around the minimal level needed for maintaining stable probe-sample contact, and a data-driven iterative feedforward control that utilizes a prediction of the next-line topography is integrated to the topography feeedback loop to enhance the sample topography tracking. The proposed approach is demonstrated and evaluated through imaging a calibration sample of square pitches at both high speeds (e.g., scan rate of 75 Hz and 130 Hz) and large sizes (e.g., scan size of 30 μm and 80 μm). The experimental results show that compared to the traditional constant-force contact-mode imaging, the imaging speed can be increased by over 30 folds (with the scanning speed at 13 mm/s), and the probe-sample interaction force can be reduced by more than 15% while maintaining the same image quality.

  11. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pritychenko, B.

    The precision of double-beta ββ-decay experimental half lives and their uncertainties is reanalyzed. The method of Benford's distributions has been applied to nuclear reaction, structure and decay data sets. First-digit distribution trend for ββ-decay T 2v 1/2 is consistent with large nuclear reaction and structure data sets and provides validation of experimental half-lives. A complementary analysis of the decay uncertainties indicates deficiencies due to small size of statistical samples, and incomplete collection of experimental information. Further experimental and theoretical efforts would lead toward more precise values of-decay half-lives and nuclear matrix elements.

  12. A combinatory approach for analysis of protein sets in barley sieve-tube samples using EDTA-facilitated exudation and aphid stylectomy.

    PubMed

    Gaupels, Frank; Knauer, Torsten; van Bel, Aart J E

    2008-01-01

    This study investigated advantages and drawbacks of two sieve-tube sap sampling methods for comparison of phloem proteins in powdery mildew-infested vs. non-infested Hordeum vulgare plants. In one approach, sieve tube sap was collected by stylectomy. Aphid stylets were cut and immediately covered with silicon oil to prevent any contamination or modification of exudates. In this way, a maximum of 1muL pure phloem sap could be obtained per hour. Interestingly, after pathogen infection exudation from microcauterized stylets was reduced to less than 40% of control plants, suggesting that powdery mildew induced sieve tube-occlusion mechanisms. In contrast to the laborious stylectomy, facilitated exudation using EDTA to prevent calcium-mediated callose formation is quick and easy with a large volume yield. After two-dimensional (2D) electrophoresis, a digital overlay of the protein sets extracted from EDTA solutions and stylet exudates showed that some major spots were the same with both sampling techniques. However, EDTA exudates also contained large amounts of contaminative proteins of unknown origin. A combinatory approach may be most favourable for studies in which the protein composition of phloem sap is compared between control and pathogen-infected plants. Facilitated exudation may be applied for subtractive identification of differentially expressed proteins by 2D/mass spectrometry, which requires large amounts of protein. A reference gel loaded with pure phloem sap from stylectomy may be useful for confirmation of phloem origin of candidate spots by digital overlay. The method provides a novel opportunity to study differential expression of phloem proteins in monocotyledonous plant species.

  13. A sampling approach for predicting the eating quality of apples using visible-near infrared spectroscopy.

    PubMed

    Martínez Vega, Mabel V; Sharifzadeh, Sara; Wulfsohn, Dvoralai; Skov, Thomas; Clemmensen, Line Harder; Toldam-Andersen, Torben B

    2013-12-01

    Visible-near infrared spectroscopy remains a method of increasing interest as a fast alternative for the evaluation of fruit quality. The success of the method is assumed to be achieved by using large sets of samples to produce robust calibration models. In this study we used representative samples of an early and a late season apple cultivar to evaluate model robustness (in terms of prediction ability and error) on the soluble solids content (SSC) and acidity prediction, in the wavelength range 400-1100 nm. A total of 196 middle-early season and 219 late season apples (Malus domestica Borkh.) cvs 'Aroma' and 'Holsteiner Cox' samples were used to construct spectral models for SSC and acidity. Partial least squares (PLS), ridge regression (RR) and elastic net (EN) models were used to build prediction models. Furthermore, we compared three sub-sample arrangements for forming training and test sets ('smooth fractionator', by date of measurement after harvest and random). Using the 'smooth fractionator' sampling method, fewer spectral bands (26) and elastic net resulted in improved performance for SSC models of 'Aroma' apples, with a coefficient of variation CVSSC = 13%. The model showed consistently low errors and bias (PLS/EN: R(2) cal = 0.60/0.60; SEC = 0.88/0.88°Brix; Biascal = 0.00/0.00; R(2) val = 0.33/0.44; SEP = 1.14/1.03; Biasval = 0.04/0.03). However, the prediction acidity and for SSC (CV = 5%) of the late cultivar 'Holsteiner Cox' produced inferior results as compared with 'Aroma'. It was possible to construct local SSC and acidity calibration models for early season apple cultivars with CVs of SSC and acidity around 10%. The overall model performance of these data sets also depend on the proper selection of training and test sets. The 'smooth fractionator' protocol provided an objective method for obtaining training and test sets that capture the existing variability of the fruit samples for construction of visible-NIR prediction models. The implication is that by using such 'efficient' sampling methods for obtaining an initial sample of fruit that represents the variability of the population and for sub-sampling to form training and test sets it should be possible to use relatively small sample sizes to develop spectral predictions of fruit quality. Using feature selection and elastic net appears to improve the SSC model performance in terms of R(2), RMSECV and RMSEP for 'Aroma' apples. © 2013 Society of Chemical Industry.

  14. Application of Deep Learning in GLOBELAND30-2010 Product Refinement

    NASA Astrophysics Data System (ADS)

    Liu, T.; Chen, X.

    2018-04-01

    GlobeLand30, as one of the best Global Land Cover (GLC) product at 30-m resolution, has been widely used in many research fields. Due to the significant spectral confusion among different land cover types and limited textual information of Landsat data, the overall accuracy of GlobeLand30 is about 80 %. Although such accuracy is much higher than most other global land cover products, it cannot satisfy various applications. There is still a great need of an effective method to improve the quality of GlobeLand30. The explosive high-resolution satellite images and remarkable performance of Deep Learning on image classification provide a new opportunity to refine GlobeLand30. However, the performance of deep leaning depends on quality and quantity of training samples as well as model training strategy. Therefore, this paper 1) proposed an automatic training sample generation method via Google earth to build a large training sample set; and 2) explore the best training strategy for land cover classification using GoogleNet (Inception V3), one of the most widely used deep learning network. The result shows that the fine-tuning from first layer of Inception V3 using rough large sample set is the best strategy. The retrained network was then applied in one selected area from Xi'an city as a case study of GlobeLand30 refinement. The experiment results indicate that the proposed approach with Deep Learning and google earth imagery is a promising solution for further improving accuracy of GlobeLand30.

  15. A surrogate-based metaheuristic global search method for beam angle selection in radiation treatment planning.

    PubMed

    Zhang, H H; Gao, S; Chen, W; Shi, L; D'Souza, W D; Meyer, R R

    2013-03-21

    An important element of radiation treatment planning for cancer therapy is the selection of beam angles (out of all possible coplanar and non-coplanar angles in relation to the patient) in order to maximize the delivery of radiation to the tumor site and minimize radiation damage to nearby organs-at-risk. This category of combinatorial optimization problem is particularly difficult because direct evaluation of the quality of treatment corresponding to any proposed selection of beams requires the solution of a large-scale dose optimization problem involving many thousands of variables that represent doses delivered to volume elements (voxels) in the patient. However, if the quality of angle sets can be accurately estimated without expensive computation, a large number of angle sets can be considered, increasing the likelihood of identifying a very high quality set. Using a computationally efficient surrogate beam set evaluation procedure based on single-beam data extracted from plans employing equallyspaced beams (eplans), we have developed a global search metaheuristic process based on the nested partitions framework for this combinatorial optimization problem. The surrogate scoring mechanism allows us to assess thousands of beam set samples within a clinically acceptable time frame. Tests on difficult clinical cases demonstrate that the beam sets obtained via our method are of superior quality.

  16. A surrogate-based metaheuristic global search method for beam angle selection in radiation treatment planning

    PubMed Central

    Zhang, H H; Gao, S; Chen, W; Shi, L; D’Souza, W D; Meyer, R R

    2013-01-01

    An important element of radiation treatment planning for cancer therapy is the selection of beam angles (out of all possible coplanar and non-coplanar angles in relation to the patient) in order to maximize the delivery of radiation to the tumor site and minimize radiation damage to nearby organs-at-risk. This category of combinatorial optimization problem is particularly difficult because direct evaluation of the quality of treatment corresponding to any proposed selection of beams requires the solution of a large-scale dose optimization problem involving many thousands of variables that represent doses delivered to volume elements (voxels) in the patient. However, if the quality of angle sets can be accurately estimated without expensive computation, a large number of angle sets can be considered, increasing the likelihood of identifying a very high quality set. Using a computationally efficient surrogate beam set evaluation procedure based on single-beam data extracted from plans employing equally-spaced beams (eplans), we have developed a global search metaheuristic process based on the Nested Partitions framework for this combinatorial optimization problem. The surrogate scoring mechanism allows us to assess thousands of beam set samples within a clinically acceptable time frame. Tests on difficult clinical cases demonstrate that the beam sets obtained via our method are superior quality. PMID:23459411

  17. RBoost: Label Noise-Robust Boosting Algorithm Based on a Nonconvex Loss Function and the Numerically Stable Base Learners.

    PubMed

    Miao, Qiguang; Cao, Ying; Xia, Ge; Gong, Maoguo; Liu, Jiachen; Song, Jianfeng

    2016-11-01

    AdaBoost has attracted much attention in the machine learning community because of its excellent performance in combining weak classifiers into strong classifiers. However, AdaBoost tends to overfit to the noisy data in many applications. Accordingly, improving the antinoise ability of AdaBoost plays an important role in many applications. The sensitiveness to the noisy data of AdaBoost stems from the exponential loss function, which puts unrestricted penalties to the misclassified samples with very large margins. In this paper, we propose two boosting algorithms, referred to as RBoost1 and RBoost2, which are more robust to the noisy data compared with AdaBoost. RBoost1 and RBoost2 optimize a nonconvex loss function of the classification margin. Because the penalties to the misclassified samples are restricted to an amount less than one, RBoost1 and RBoost2 do not overfocus on the samples that are always misclassified by the previous base learners. Besides the loss function, at each boosting iteration, RBoost1 and RBoost2 use numerically stable ways to compute the base learners. These two improvements contribute to the robustness of the proposed algorithms to the noisy training and testing samples. Experimental results on the synthetic Gaussian data set, the UCI data sets, and a real malware behavior data set illustrate that the proposed RBoost1 and RBoost2 algorithms perform better when the training data sets contain noisy data.

  18. Improved high-dimensional prediction with Random Forests by the use of co-data.

    PubMed

    Te Beest, Dennis E; Mes, Steven W; Wilting, Saskia M; Brakenhoff, Ruud H; van de Wiel, Mark A

    2017-12-28

    Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting. Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.

  19. Benchmarking contactless acquisition sensor reproducibility for latent fingerprint trace evidence

    NASA Astrophysics Data System (ADS)

    Hildebrandt, Mario; Dittmann, Jana

    2015-03-01

    Optical, nano-meter range, contactless, non-destructive sensor devices are promising acquisition techniques in crime scene trace forensics, e.g. for digitizing latent fingerprint traces. Before new approaches are introduced in crime investigations, innovations need to be positively tested and quality ensured. In this paper we investigate sensor reproducibility by studying different scans from four sensors: two chromatic white light sensors (CWL600/CWL1mm), one confocal laser scanning microscope, and one NIR/VIS/UV reflection spectrometer. Firstly, we perform an intra-sensor reproducibility testing for CWL600 with a privacy conform test set of artificial-sweat printed, computer generated fingerprints. We use 24 different fingerprint patterns as original samples (printing samples/templates) for printing with artificial sweat (physical trace samples) and their acquisition with contactless sensory resulting in 96 sensor images, called scan or acquired samples. The second test set for inter-sensor reproducibility assessment consists of the first three patterns from the first test set, acquired in two consecutive scans using each device. We suggest using a simple feature space set in spatial and frequency domain known from signal processing and test its suitability for six different classifiers classifying scan data into small differences (reproducible) and large differences (non-reproducible). Furthermore, we suggest comparing the classification results with biometric verification scores (calculated with NBIS, with threshold of 40) as biometric reproducibility score. The Bagging classifier is nearly for all cases the most reliable classifier in our experiments and the results are also confirmed with the biometric matching rates.

  20. Protein and glycomic plasma markers for early detection of adenoma and colon cancer.

    PubMed

    Rho, Jung-Hyun; Ladd, Jon J; Li, Christopher I; Potter, John D; Zhang, Yuzheng; Shelley, David; Shibata, David; Coppola, Domenico; Yamada, Hiroyuki; Toyoda, Hidenori; Tada, Toshifumi; Kumada, Takashi; Brenner, Dean E; Hanash, Samir M; Lampe, Paul D

    2018-03-01

    To discover and confirm blood-based colon cancer early-detection markers. We created a high-density antibody microarray to detect differences in protein levels in plasma from individuals diagnosed with colon cancer <3 years after blood was drawn (ie, prediagnostic) and cancer-free, matched controls. Potential markers were tested on plasma samples from people diagnosed with adenoma or cancer, compared with controls. Components of an optimal 5-marker panel were tested via immunoblotting using a third sample set, Luminex assay in a large fourth sample set and immunohistochemistry (IHC) on tissue microarrays. In the prediagnostic samples, we found 78 significantly (t-test) increased proteins, 32 of which were confirmed in the diagnostic samples. From these 32, optimal 4-marker panels of BAG family molecular chaperone regulator 4 (BAG4), interleukin-6 receptor subunit beta (IL6ST), von Willebrand factor (VWF) and CD44 or epidermal growth factor receptor (EGFR) were established. Each panel member and the panels also showed increases in the diagnostic adenoma and cancer samples in independent third and fourth sample sets via immunoblot and Luminex, respectively. IHC results showed increased levels of BAG4, IL6ST and CD44 in adenoma and cancer tissues. Inclusion of EGFR and CD44 sialyl Lewis-A and Lewis-X content increased the panel performance. The protein/glycoprotein panel was statistically significantly higher in colon cancer samples, characterised by a range of area under the curves from 0.90 (95% CI 0.82 to 0.98) to 0.86 (95% CI 0.83 to 0.88), for the larger second and fourth sets, respectively. A panel including BAG4, IL6ST, VWF, EGFR and CD44 protein/glycomics performed well for detection of early stages of colon cancer and should be further examined in larger studies. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/.

  1. Evaluating Gene Set Enrichment Analysis Via a Hybrid Data Model

    PubMed Central

    Hua, Jianping; Bittner, Michael L.; Dougherty, Edward R.

    2014-01-01

    Gene set enrichment analysis (GSA) methods have been widely adopted by biological labs to analyze data and generate hypotheses for validation. Most of the existing comparison studies focus on whether the existing GSA methods can produce accurate P-values; however, practitioners are often more concerned with the correct gene-set ranking generated by the methods. The ranking performance is closely related to two critical goals associated with GSA methods: the ability to reveal biological themes and ensuring reproducibility, especially for small-sample studies. We have conducted a comprehensive simulation study focusing on the ranking performance of seven representative GSA methods. We overcome the limitation on the availability of real data sets by creating hybrid data models from existing large data sets. To build the data model, we pick a master gene from the data set to form the ground truth and artificially generate the phenotype labels. Multiple hybrid data models can be constructed from one data set and multiple data sets of smaller sizes can be generated by resampling the original data set. This approach enables us to generate a large batch of data sets to check the ranking performance of GSA methods. Our simulation study reveals that for the proposed data model, the Q2 type GSA methods have in general better performance than other GSA methods and the global test has the most robust results. The properties of a data set play a critical role in the performance. For the data sets with highly connected genes, all GSA methods suffer significantly in performance. PMID:24558298

  2. Translational Genomics Research Institute: Quantified Cancer Cell Line Encyclopedia (CCLE) RNA-seq Data | Office of Cancer Genomics

    Cancer.gov

    Many applications analyze quantified transcript-level abundances to make inferences.  Having completed this computation across the large sample set, the CTD2 Center at the Translational Genomics Research Institute presents the quantified data in a straightforward, consolidated form for these types of analyses.   Experimental Approaches  

  3. Balancing Healthy Meals and Busy Lives: Associations between Work, School, and Family Responsibilities and Perceived Time Constraints among Young Adults

    ERIC Educational Resources Information Center

    Pelletier, Jennifer E.; Laska, Melissa N.

    2012-01-01

    Objective: To characterize associations between perceived time constraints for healthy eating and work, school, and family responsibilities among young adults. Design: Cross-sectional survey. Setting: A large, Midwestern metropolitan region. Participants: A diverse sample of community college (n = 598) and public university (n = 603) students.…

  4. Brief Report: The Go/No-Go Task Online: Inhibitory Control Deficits in Autism in a Large Sample

    ERIC Educational Resources Information Center

    Uzefovsky, F.; Allison, C.; Smith, P.; Baron-Cohen, S.

    2016-01-01

    Autism Spectrum Conditions (ASC, also referred to as Autism Spectrum Disorders) entail difficulties with inhibition: inhibiting action, inhibiting one's own point of view, and inhibiting distractions that may interfere with a response set. However, the association between inhibitory control (IC) and ASC, especially in adulthood, is unclear. The…

  5. The Importance of Institutional Image to Student Satisfaction and Loyalty within Higher Education

    ERIC Educational Resources Information Center

    Brown, Robert M.; Mazzarol, Timothy William

    2009-01-01

    This paper outlines the findings of a study employing a partial least squares (PLS) structural equation methodology to test a customer satisfaction model of the drivers of student satisfaction and loyalty in higher education settings. Drawing upon a moderately large sample of students enrolled in four "types" of Australian universities,…

  6. Binge Drinking during the First Semester of College: Continuation and Desistance from High School Patterns

    ERIC Educational Resources Information Center

    Reifman, Alan; Watson, Wendy K.

    2003-01-01

    Students' first semester on campus may set the stage for their alcohol use/misuse throughout college. The authors surveyed 274 randomly sampled first-semester freshmen at a large southwestern university on their past 2 weeks' binge drinking, their high school binge drinking, and psychosocial factors possibly associated with drinking. They…

  7. Testing the Predictors of Boredom at School: Development and Validation of the Precursors to Boredom Scales

    ERIC Educational Resources Information Center

    Daschmann, Elena C.; Goetz, Thomas; Stupnisky, Robert H.

    2011-01-01

    Background: Boredom has been found to be an important emotion for students' learning processes and achievement outcomes; however, the precursors of this emotion remain largely unexplored. Aim: In the current study, scales assessing the precursors to boredom in academic achievement settings were developed and tested. Sample: Participants were 1,380…

  8. School Correlates of Academic Behaviors and Performance among McKinney-Vento Identified Youth

    ERIC Educational Resources Information Center

    Stone, Susan; Uretsky, Mathew

    2016-01-01

    We utilized a pooled sample of elementary, middle, and high school-aged children identified as homeless via definitions set forth by McKinney-Vento legislation in a large urban district in California to estimate the extent to which school factors contributed to student attendance, suspensions, test-taking behaviors, and performance on state…

  9. Social epidemiology of a large outbreak of chickenpox in the Colombian sugar cane producer region: a set theory-based analysis.

    PubMed

    Idrovo, Alvaro J; Albavera-Hernández, Cidronio; Rodríguez-Hernández, Jorge Martín

    2011-07-01

    There are few social epidemiologic studies on chickenpox outbreaks, although previous findings suggested the important role of social determinants. This study describes the context of a large outbreak of chickenpox in the Cauca Valley region, Colombia (2003 to 2007), with an emphasis on macro-determinants. We explored the temporal trends in chickenpox incidence in 42 municipalities to identify the places with higher occurrences. We analyzed municipal characteristics (education quality, vaccination coverage, performance of health care services, violence-related immigration, and area size of planted sugar cane) through analyses based on set theory. Edwards-Venn diagrams were used to present the main findings. The results indicated that three municipalities had higher incidences and that poor quality education was the attribute most prone to a higher incidence. Potential use of set theory for exploratory outbreak analyses is discussed. It is a tool potentially useful to contrast units when only small sample sizes are available.

  10. Probing the Physics of Active Galactic Nuclei

    NASA Technical Reports Server (NTRS)

    Peterson, Bradley M.

    2004-01-01

    As a result of a number of large multiwavelength monitoring campaigns that have taken place since the late 1980s, there are now several very large data sets on bright variable active galactic nuclei (AGNs) that are well-sampled in time and can be used to probe the physics of the AGN continuum source and the broad-line emitting region. Most of these data sets have been underutilized, as the emphasis thus far has been primarily on reverberation-mapping issues alone. Broader attempts at analysis have been made on some of the earlier IUE data sets (e.g., data from the 1989 campaign on NGC5 548) , but much of this analysis needs to be revisited now that improved versions of the data are now available from final archive processing. We propose to use the multiwavelength monitoring data that have been accumulated to undertake more thorough investigations of the AGN continuum and broad emission lines, including a more detailed study of line-profile variability, making use of constraints imposed by the reverberation results.

  11. The VLT-FLAMES Tarantula Survey

    NASA Astrophysics Data System (ADS)

    Vink, Jorick S.; Evans, C. J.; Bestenlehner, J.; McEvoy, C.; Ramírez-Agudelo, O.; Sana, H.; Schneider, F.; VFTS Collaboration

    2017-11-01

    We present a number of notable results from the VLT-FLAMES Tarantula Survey (VFTS), an ESO Large Program during which we obtained multi-epoch medium-resolution optical spectroscopy of a very large sample of over 800 massive stars in the 30 Doradus region of the Large Magellanic Cloud (LMC). This unprecedented data-set has enabled us to address some key questions regarding atmospheres and winds, as well as the evolution of (very) massive stars. Here we focus on O-type runaways, the width of the main sequence, and the mass-loss rates for (very) massive stars. We also provide indications for the presence of a top-heavy initial mass function (IMF) in 30 Dor.

  12. Discovering novel pharmacogenomic biomarkers by imputing drug response in cancer patients from large genomics studies.

    PubMed

    Geeleher, Paul; Zhang, Zhenyu; Wang, Fan; Gruener, Robert F; Nath, Aritro; Morrison, Gladys; Bhutra, Steven; Grossman, Robert L; Huang, R Stephanie

    2017-10-01

    Obtaining accurate drug response data in large cohorts of cancer patients is very challenging; thus, most cancer pharmacogenomics discovery is conducted in preclinical studies, typically using cell lines and mouse models. However, these platforms suffer from serious limitations, including small sample sizes. Here, we have developed a novel computational method that allows us to impute drug response in very large clinical cancer genomics data sets, such as The Cancer Genome Atlas (TCGA). The approach works by creating statistical models relating gene expression to drug response in large panels of cancer cell lines and applying these models to tumor gene expression data in the clinical data sets (e.g., TCGA). This yields an imputed drug response for every drug in each patient. These imputed drug response data are then associated with somatic genetic variants measured in the clinical cohort, such as copy number changes or mutations in protein coding genes. These analyses recapitulated drug associations for known clinically actionable somatic genetic alterations and identified new predictive biomarkers for existing drugs. © 2017 Geeleher et al.; Published by Cold Spring Harbor Laboratory Press.

  13. Computationally efficient algorithm for Gaussian Process regression in case of structured samples

    NASA Astrophysics Data System (ADS)

    Belyaev, M.; Burnaev, E.; Kapushev, Y.

    2016-04-01

    Surrogate modeling is widely used in many engineering problems. Data sets often have Cartesian product structure (for instance factorial design of experiments with missing points). In such case the size of the data set can be very large. Therefore, one of the most popular algorithms for approximation-Gaussian Process regression-can be hardly applied due to its computational complexity. In this paper a computationally efficient approach for constructing Gaussian Process regression in case of data sets with Cartesian product structure is presented. Efficiency is achieved by using a special structure of the data set and operations with tensors. Proposed algorithm has low computational as well as memory complexity compared to existing algorithms. In this work we also introduce a regularization procedure allowing to take into account anisotropy of the data set and avoid degeneracy of regression model.

  14. Development of a universal metabolome-standard method for long-term LC-MS metabolome profiling and its application for bladder cancer urine-metabolite-biomarker discovery.

    PubMed

    Peng, Jun; Chen, Yi-Ting; Chen, Chien-Lun; Li, Liang

    2014-07-01

    Large-scale metabolomics study requires a quantitative method to generate metabolome data over an extended period with high technical reproducibility. We report a universal metabolome-standard (UMS) method, in conjunction with chemical isotope labeling liquid chromatography-mass spectrometry (LC-MS), to provide long-term analytical reproducibility and facilitate metabolome comparison among different data sets. In this method, UMS of a specific type of sample labeled by an isotope reagent is prepared a priori. The UMS is spiked into any individual samples labeled by another form of the isotope reagent in a metabolomics study. The resultant mixture is analyzed by LC-MS to provide relative quantification of the individual sample metabolome to UMS. UMS is independent of a study undertaking as well as the time of analysis and useful for profiling the same type of samples in multiple studies. In this work, the UMS method was developed and applied for a urine metabolomics study of bladder cancer. UMS of human urine was prepared by (13)C2-dansyl labeling of a pooled sample from 20 healthy individuals. This method was first used to profile the discovery samples to generate a list of putative biomarkers potentially useful for bladder cancer detection and then used to analyze the verification samples about one year later. Within the discovery sample set, three-month technical reproducibility was examined using a quality control sample and found a mean CV of 13.9% and median CV of 9.4% for all the quantified metabolites. Statistical analysis of the urine metabolome data showed a clear separation between the bladder cancer group and the control group from the discovery samples, which was confirmed by the verification samples. Receiver operating characteristic (ROC) test showed that the area under the curve (AUC) was 0.956 in the discovery data set and 0.935 in the verification data set. These results demonstrated the utility of the UMS method for long-term metabolomics and discovering potential metabolite biomarkers for diagnosis of bladder cancer.

  15. Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data

    PubMed Central

    Serang, Oliver; MacCoss, Michael J.; Noble, William Stafford

    2010-01-01

    The problem of identifying proteins from a shotgun proteomics experiment has not been definitively solved. Identifying the proteins in a sample requires ranking them, ideally with interpretable scores. In particular, “degenerate” peptides, which map to multiple proteins, have made such a ranking difficult to compute. The problem of computing posterior probabilities for the proteins, which can be interpreted as confidence in a protein’s presence, has been especially daunting. Previous approaches have either ignored the peptide degeneracy problem completely, addressed it by computing a heuristic set of proteins or heuristic posterior probabilities, or by estimating the posterior probabilities with sampling methods. We present a probabilistic model for protein identification in tandem mass spectrometry that recognizes peptide degeneracy. We then introduce graph-transforming algorithms that facilitate efficient computation of protein probabilities, even for large data sets. We evaluate our identification procedure on five different well-characterized data sets and demonstrate our ability to efficiently compute high-quality protein posteriors. PMID:20712337

  16. Fast, Safe, Propellant-Efficient Spacecraft Motion Planning Under Clohessy-Wiltshire-Hill Dynamics

    NASA Technical Reports Server (NTRS)

    Starek, Joseph A.; Schmerling, Edward; Maher, Gabriel D.; Barbee, Brent W.; Pavone, Marco

    2016-01-01

    This paper presents a sampling-based motion planning algorithm for real-time and propellant-optimized autonomous spacecraft trajectory generation in near-circular orbits. Specifically, this paper leverages recent algorithmic advances in the field of robot motion planning to the problem of impulsively actuated, propellant- optimized rendezvous and proximity operations under the Clohessy-Wiltshire-Hill dynamics model. The approach calls upon a modified version of the FMT* algorithm to grow a set of feasible trajectories over a deterministic, low-dispersion set of sample points covering the free state space. To enforce safety, the tree is only grown over the subset of actively safe samples, from which there exists a feasible one-burn collision-avoidance maneuver that can safely circularize the spacecraft orbit along its coasting arc under a given set of potential thruster failures. Key features of the proposed algorithm include 1) theoretical guarantees in terms of trajectory safety and performance, 2) amenability to real-time implementation, and 3) generality, in the sense that a large class of constraints can be handled directly. As a result, the proposed algorithm offers the potential for widespread application, ranging from on-orbit satellite servicing to orbital debris removal and autonomous inspection missions.

  17. Fast structure similarity searches among protein models: efficient clustering of protein fragments

    PubMed Central

    2012-01-01

    Background For many predictive applications a large number of models is generated and later clustered in subsets based on structure similarity. In most clustering algorithms an all-vs-all root mean square deviation (RMSD) comparison is performed. Most of the time is typically spent on comparison of non-similar structures. For sets with more than, say, 10,000 models this procedure is very time-consuming and alternative faster algorithms, restricting comparisons only to most similar structures would be useful. Results We exploit the inverse triangle inequality on the RMSD between two structures given the RMSDs with a third structure. The lower bound on RMSD may be used, when restricting the search of similarity to a reasonably low RMSD threshold value, to speed up similarity searches significantly. Tests are performed on large sets of decoys which are widely used as test cases for predictive methods, with a speed-up of up to 100 times with respect to all-vs-all comparison depending on the set and parameters used. Sample applications are shown. Conclusions The algorithm presented here allows fast comparison of large data sets of structures with limited memory requirements. As an example of application we present clustering of more than 100000 fragments of length 5 from the top500H dataset into few hundred representative fragments. A more realistic scenario is provided by the search of similarity within the very large decoy sets used for the tests. Other applications regard filtering nearly-indentical conformation in selected CASP9 datasets and clustering molecular dynamics snapshots. Availability A linux executable and a Perl script with examples are given in the supplementary material (Additional file 1). The source code is available upon request from the authors. PMID:22642815

  18. The use of spatio-temporal correlation to forecast critical transitions

    NASA Astrophysics Data System (ADS)

    Karssenberg, Derek; Bierkens, Marc F. P.

    2010-05-01

    Complex dynamical systems may have critical thresholds at which the system shifts abruptly from one state to another. Such critical transitions have been observed in systems ranging from the human body system to financial markets and the Earth system. Forecasting the timing of critical transitions before they are reached is of paramount importance because critical transitions are associated with a large shift in dynamical regime of the system under consideration. However, it is hard to forecast critical transitions, because the state of the system shows relatively little change before the threshold is reached. Recently, it was shown that increased spatio-temporal autocorrelation and variance can serve as alternative early warning signal for critical transitions. However, thus far these second order statistics have not been used for forecasting in a data assimilation framework. Here we show that the use of spatio-temporal autocorrelation and variance in the state of the system reduces the uncertainty in the predicted timing of critical transitions compared to classical approaches that use the value of the system state only. This is shown by assimilating observed spatio-temporal autocorrelation and variance into a dynamical system model using a Particle Filter. We adapt a well-studied distributed model of a logistically growing resource with a fixed grazing rate. The model describes the transition from an underexploited system with high resource biomass to overexploitation as grazing pressure crosses the critical threshold, which is a fold bifurcation. To represent limited prior information, we use a large variance in the prior probability distributions of model parameters and the system driver (grazing rate). First, we show that the rate of increase in spatio-temporal autocorrelation and variance prior to reaching the critical threshold is relatively consistent across the uncertainty range of the driver and parameter values used. This indicates that an increase in spatio-temporal autocorrelation and variance are consistent predictors of a critical transition, even under the condition of a poorly defined system. Second, we perform data assimilation experiments using an artificial exhaustive data set generated by one realization of the model. To mimic real-world sampling, an observational data set is created from this exhaustive data set. This is done by sampling on a regular spatio-temporal grid, supplemented by sampling locations at a short distance. Spatial and temporal autocorrelation in this observational data set is calculated for different spatial and temporal separation (lag) distances. To assign appropriate weights to observations (here, autocorrelation values and variance) in the Particle Filter, the covariance matrix of the error in these observations is required. This covariance matrix is estimated using Monte Carlo sampling, selecting a different random position of the sampling network relative to the exhaustive data set for each realization. At each update moment in the Particle Filter, observed autocorrelation values are assimilated into the model and the state of the model is updated. Using this approach, it is shown that the use of autocorrelation reduces the uncertainty in the forecasted timing of a critical transition compared to runs without data assimilation. The performance of the use of spatial autocorrelation versus temporal autocorrelation depends on the timing and number of observational data. This study is restricted to a single model only. However, it is becoming increasingly clear that spatio-temporal autocorrelation and variance can be used as early warning signals for a large number of systems. Thus, it is expected that spatio-temporal autocorrelation and variance are valuable in data assimilation frameworks in a large number of dynamical systems.

  19. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining.

    PubMed

    Hero, Alfred O; Rajaratnam, Bala

    2016-01-01

    When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.

  20. Noise-enhanced convolutional neural networks.

    PubMed

    Audhkhasi, Kartik; Osoba, Osonde; Kosko, Bart

    2016-06-01

    Injecting carefully chosen noise can speed convergence in the backpropagation training of a convolutional neural network (CNN). The Noisy CNN algorithm speeds training on average because the backpropagation algorithm is a special case of the generalized expectation-maximization (EM) algorithm and because such carefully chosen noise always speeds up the EM algorithm on average. The CNN framework gives a practical way to learn and recognize images because backpropagation scales with training data. It has only linear time complexity in the number of training samples. The Noisy CNN algorithm finds a special separating hyperplane in the network's noise space. The hyperplane arises from the likelihood-based positivity condition that noise-boosts the EM algorithm. The hyperplane cuts through a uniform-noise hypercube or Gaussian ball in the noise space depending on the type of noise used. Noise chosen from above the hyperplane speeds training on average. Noise chosen from below slows it on average. The algorithm can inject noise anywhere in the multilayered network. Adding noise to the output neurons reduced the average per-iteration training-set cross entropy by 39% on a standard MNIST image test set of handwritten digits. It also reduced the average per-iteration training-set classification error by 47%. Adding noise to the hidden layers can also reduce these performance measures. The noise benefit is most pronounced for smaller data sets because the largest EM hill-climbing gains tend to occur in the first few iterations. This noise effect can assist random sampling from large data sets because it allows a smaller random sample to give the same or better performance than a noiseless sample gives. Copyright © 2015 Elsevier Ltd. All rights reserved.

  1. Experimental layout, data analysis, and thresholds in ELISA testing of maize for aphid-borne viruses.

    PubMed

    Caciagli, P; Verderio, A

    2003-06-30

    Several aspects of enzyme-linked immunosorbent assay (ELISA) procedures and data analysis have been examined in an attempt to find a rapid and reliable method for discriminating between 'positive' and 'negative' results when testing a large number of samples. A layout of ELISA plates was designed to reduce uncontrolled variation and to optimize the number of negative and positive controls. A transformation using the fourth root (A(1/4)) of the optical density readings corrected for the blank (A) stabilized the variance of most ELISA data examined. Transformed A values were used to calculate the true limits, at a set protection level, for false positive (C) and false negative (D). Methods are discussed to reduce the number of undifferentiated samples, i.e. the samples with response falling between C and D. The whole procedure was set up for use with an electronic spreadsheet. With the addition of few instructions of the type 'if em leader then em leader else' in the spreadsheet, the ELISA results were obtained in the simple trichotomous form 'negative/undefined/positive'. This allowed rapid analysis of more than 1100 maize samples testing for the presence of seven aphid-borne viruses-in fact almost 8000 ELISA samples.

  2. Combined target factor analysis and Bayesian soft-classification of interference-contaminated samples: forensic fire debris analysis.

    PubMed

    Williams, Mary R; Sigman, Michael E; Lewis, Jennifer; Pitan, Kelly McHugh

    2012-10-10

    A bayesian soft classification method combined with target factor analysis (TFA) is described and tested for the analysis of fire debris data. The method relies on analysis of the average mass spectrum across the chromatographic profile (i.e., the total ion spectrum, TIS) from multiple samples taken from a single fire scene. A library of TIS from reference ignitable liquids with assigned ASTM classification is used as the target factors in TFA. The class-conditional distributions of correlations between the target and predicted factors for each ASTM class are represented by kernel functions and analyzed by bayesian decision theory. The soft classification approach assists in assessing the probability that ignitable liquid residue from a specific ASTM E1618 class, is present in a set of samples from a single fire scene, even in the presence of unspecified background contributions from pyrolysis products. The method is demonstrated with sample data sets and then tested on laboratory-scale burn data and large-scale field test burns. The overall performance achieved in laboratory and field test of the method is approximately 80% correct classification of fire debris samples. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  3. Developing an Apicomplexan DNA Barcoding System to Detect Blood Parasites of Small Coral Reef Fishes.

    PubMed

    Renoux, Lance P; Dolan, Maureen C; Cook, Courtney A; Smit, Nico J; Sikkel, Paul C

    2017-08-01

    Apicomplexan parasites are obligate parasites of many species of vertebrates. To date, there is very limited understanding of these parasites in the most-diverse group of vertebrates, actinopterygian fishes. While DNA barcoding targeting the eukaryotic 18S small subunit rRNA gene sequence has been useful in identifying apicomplexans in tetrapods, identification of apicomplexans infecting fishes has relied solely on morphological identification by microscopy. In this study, a DNA barcoding method was developed that targets the 18S rRNA gene primers for identifying apicomplexans parasitizing certain actinopterygian fishes. A lead primer set was selected showing no cross-reactivity to the overwhelming abundant host DNA and successfully confirmed 37 of the 41 (90.2%) microscopically verified parasitized fish blood samples analyzed in this study. Furthermore, this DNA barcoding method identified 4 additional samples that screened negative for parasitemia, suggesting this molecular method may provide improved sensitivity over morphological characterization by microscopy. In addition, this PCR screening method for fish apicomplexans, using Whatman FTA preserved DNA, was tested in efforts leading to a more simplified field collection, transport, and sample storage method as well as a streamlining sample processing important for DNA barcoding of large sample sets.

  4. Estimating the Size of a Large Network and its Communities from a Random Sample

    PubMed Central

    Chen, Lin; Karbasi, Amin; Crawford, Forrest W.

    2017-01-01

    Most real-world networks are too large to be measured or studied directly and there is substantial interest in estimating global network properties from smaller sub-samples. One of the most important global properties is the number of vertices/nodes in the network. Estimating the number of vertices in a large network is a major challenge in computer science, epidemiology, demography, and intelligence analysis. In this paper we consider a population random graph G = (V, E) from the stochastic block model (SBM) with K communities/blocks. A sample is obtained by randomly choosing a subset W ⊆ V and letting G(W) be the induced subgraph in G of the vertices in W. In addition to G(W), we observe the total degree of each sampled vertex and its block membership. Given this partial information, we propose an efficient PopULation Size Estimation algorithm, called PULSE, that accurately estimates the size of the whole population as well as the size of each community. To support our theoretical analysis, we perform an exhaustive set of experiments to study the effects of sample size, K, and SBM model parameters on the accuracy of the estimates. The experimental results also demonstrate that PULSE significantly outperforms a widely-used method called the network scale-up estimator in a wide variety of scenarios. PMID:28867924

  5. Estimating the Size of a Large Network and its Communities from a Random Sample.

    PubMed

    Chen, Lin; Karbasi, Amin; Crawford, Forrest W

    2016-01-01

    Most real-world networks are too large to be measured or studied directly and there is substantial interest in estimating global network properties from smaller sub-samples. One of the most important global properties is the number of vertices/nodes in the network. Estimating the number of vertices in a large network is a major challenge in computer science, epidemiology, demography, and intelligence analysis. In this paper we consider a population random graph G = ( V, E ) from the stochastic block model (SBM) with K communities/blocks. A sample is obtained by randomly choosing a subset W ⊆ V and letting G ( W ) be the induced subgraph in G of the vertices in W . In addition to G ( W ), we observe the total degree of each sampled vertex and its block membership. Given this partial information, we propose an efficient PopULation Size Estimation algorithm, called PULSE, that accurately estimates the size of the whole population as well as the size of each community. To support our theoretical analysis, we perform an exhaustive set of experiments to study the effects of sample size, K , and SBM model parameters on the accuracy of the estimates. The experimental results also demonstrate that PULSE significantly outperforms a widely-used method called the network scale-up estimator in a wide variety of scenarios.

  6. Porous extraction paddle: a solid phase extraction technique for studying the urine metabolome

    PubMed Central

    Shao, Gang; MacNeil, Michael; Yao, Yuanyuan; Giese, Roger W.

    2016-01-01

    RATIONALE A method was needed to accomplish solid phase extraction of a large urine volume in a convenient way where resources are limited, towards a goal of metabolome and xenobiotic exposome analysis at another, distant location. METHODS A porous extraction paddle (PEP) was set up, comprising a porous nylon bag containing extraction particles that is flattened and immobilized between two stainless steel meshes. Stirring the PEP after attachment to a shaft of a motor mounted on the lid of the jar containing the urine accomplishes extraction. The bag contained a mixture of nonpolar and partly nonpolar particles to extract a diversity of corresponding compounds. RESULTS Elution of a urine-exposed, water-washed PEP with aqueous methanol containing triethylammonium acetate (conditions intended to give a complete elution), followed by MALDI-TOF/TOF-MS, demonstrated that a diversity of compounds had been extracted ranging from uric acid to peptides. CONCLUSION The PEP allows the user to extract a large liquid sample in a jar simply by turning on a motor. The technique will be helpful in conducting metabolomics and xenobiotic exposome studies of urine, encouraging the extraction of large volumes to set up a convenient repository sample (e.g. 2 g of exposed adsorbent in a cryovial) for shipment and re-analysis in various ways in the future, including scaled-up isolation of unknown chemicals for identification. PMID:27624170

  7. Large scale aggregate microarray analysis reveals three distinct molecular subclasses of human preeclampsia.

    PubMed

    Leavey, Katherine; Bainbridge, Shannon A; Cox, Brian J

    2015-01-01

    Preeclampsia (PE) is a life-threatening hypertensive pathology of pregnancy affecting 3-5% of all pregnancies. To date, PE has no cure, early detection markers, or effective treatments short of the removal of what is thought to be the causative organ, the placenta, which may necessitate a preterm delivery. Additionally, numerous small placental microarray studies attempting to identify "PE-specific" genes have yielded inconsistent results. We therefore hypothesize that preeclampsia is a multifactorial disease encompassing several pathology subclasses, and that large cohort placental gene expression analysis will reveal these groups. To address our hypothesis, we utilized known bioinformatic methods to aggregate 7 microarray data sets across multiple platforms in order to generate a large data set of 173 patient samples, including 77 with preeclampsia. Unsupervised clustering of these patient samples revealed three distinct molecular subclasses of PE. This included a "canonical" PE subclass demonstrating elevated expression of known PE markers and genes associated with poor oxygenation and increased secretion, as well as two other subclasses potentially representing a poor maternal response to pregnancy and an immunological presentation of preeclampsia. Our analysis sheds new light on the heterogeneity of PE patients, and offers up additional avenues for future investigation. Hopefully, our subclassification of preeclampsia based on molecular diversity will finally lead to the development of robust diagnostics and patient-based treatments for this disorder.

  8. Porous extraction paddle: a solid phase extraction technique for studying the urine metabolome.

    PubMed

    Shao, Gang; MacNeil, Michael; Yao, Yuanyuan; Giese, Roger W

    2016-09-14

    A method was needed to accomplish solid phase extraction of a large urine volume in a convenient way where resources are limited, towards a goal of metabolome and xenobiotic exposome analysis at another, distant location. A porous extraction paddle (PEP) was set up, comprising a porous nylon bag containing extraction particles that is flattened and immobilized between two stainless steel meshes. Stirring the PEP after attachment to a shaft of a motor mounted on the lid of the jar containing the urine accomplishes extraction. The bag contained a mixture of nonpolar and partly nonpolar particles to extract a diversity of corresponding compounds. Elution of a urine-exposed, water-washed PEP with aqueous methanol containing triethylammonium acetate (conditions intended to give a complete elution), followed by MALDI-TOF/TOF-MS, demonstrated that a diversity of compounds had been extracted ranging from uric acid to peptides. The PEP allows the user to extract a large liquid sample in a jar simply by turning on a motor. The technique will be helpful in conducting metabolomics and xenobiotic exposome studies of urine, encouraging the extraction of large volumes to set up a convenient repository sample (e.g. 2 g of exposed adsorbent in a cryovial) for shipment and re-analysis in various ways in the future, including scaled-up isolation of unknown chemicals for identification. This article is protected by copyright. All rights reserved.

  9. Common genetic variants associated with cognitive performance identified using the proxy-phenotype method

    PubMed Central

    Rietveld, Cornelius A.; Esko, Tõnu; Davies, Gail; Pers, Tune H.; Turley, Patrick; Benyamin, Beben; Chabris, Christopher F.; Emilsson, Valur; Johnson, Andrew D.; Lee, James J.; de Leeuw, Christiaan; Marioni, Riccardo E.; Medland, Sarah E.; Miller, Michael B.; Rostapshova, Olga; van der Lee, Sven J.; Vinkhuyzen, Anna A. E.; Amin, Najaf; Conley, Dalton; Derringer, Jaime; van Duijn, Cornelia M.; Fehrmann, Rudolf; Franke, Lude; Glaeser, Edward L.; Hansell, Narelle K.; Hayward, Caroline; Iacono, William G.; Ibrahim-Verbaas, Carla; Jaddoe, Vincent; Karjalainen, Juha; Laibson, David; Lichtenstein, Paul; Liewald, David C.; Magnusson, Patrik K. E.; Martin, Nicholas G.; McGue, Matt; McMahon, George; Pedersen, Nancy L.; Pinker, Steven; Porteous, David J.; Posthuma, Danielle; Rivadeneira, Fernando; Smith, Blair H.; Starr, John M.; Tiemeier, Henning; Timpson, Nicholas J.; Trzaskowski, Maciej; Uitterlinden, André G.; Verhulst, Frank C.; Ward, Mary E.; Wright, Margaret J.; Davey Smith, George; Deary, Ian J.; Johannesson, Magnus; Plomin, Robert; Visscher, Peter M.; Benjamin, Daniel J.; Koellinger, Philipp D.

    2014-01-01

    We identify common genetic variants associated with cognitive performance using a two-stage approach, which we call the proxy-phenotype method. First, we conduct a genome-wide association study of educational attainment in a large sample (n = 106,736), which produces a set of 69 education-associated SNPs. Second, using independent samples (n = 24,189), we measure the association of these education-associated SNPs with cognitive performance. Three SNPs (rs1487441, rs7923609, and rs2721173) are significantly associated with cognitive performance after correction for multiple hypothesis testing. In an independent sample of older Americans (n = 8,652), we also show that a polygenic score derived from the education-associated SNPs is associated with memory and absence of dementia. Convergent evidence from a set of bioinformatics analyses implicates four specific genes (KNCMA1, NRXN1, POU2F3, and SCRT). All of these genes are associated with a particular neurotransmitter pathway involved in synaptic plasticity, the main cellular mechanism for learning and memory. PMID:25201988

  10. Evaluation of two outlier-detection-based methods for detecting tissue-selective genes from microarray data.

    PubMed

    Kadota, Koji; Konishi, Tomokazu; Shimizu, Kentaro

    2007-05-01

    Large-scale expression profiling using DNA microarrays enables identification of tissue-selective genes for which expression is considerably higher and/or lower in some tissues than in others. Among numerous possible methods, only two outlier-detection-based methods (an AIC-based method and Sprent's non-parametric method) can treat equally various types of selective patterns, but they produce substantially different results. We investigated the performance of these two methods for different parameter settings and for a reduced number of samples. We focused on their ability to detect selective expression patterns robustly. We applied them to public microarray data collected from 36 normal human tissue samples and analyzed the effects of both changing the parameter settings and reducing the number of samples. The AIC-based method was more robust in both cases. The findings confirm that the use of the AIC-based method in the recently proposed ROKU method for detecting tissue-selective expression patterns is correct and that Sprent's method is not suitable for ROKU.

  11. Laser ablation-laser induced breakdown spectroscopy for the measurement of total elemental concentration in soils.

    PubMed

    Pareja, Jhon; López, Sebastian; Jaramillo, Daniel; Hahn, David W; Molina, Alejandro

    2013-04-10

    The performances of traditional laser-induced breakdown spectroscopy (LIBS) and laser ablation-LIBS (LA-LIBS) were compared by quantifying the total elemental concentration of potassium in highly heterogeneous solid samples, namely soils. Calibration curves for a set of fifteen samples with a wide range of potassium concentrations were generated. The LA-LIBS approach produced a superior linear response different than the traditional LIBS scheme. The analytical response of LA-LIBS was tested with a large set of different soil samples for the quantification of the total concentration of Fe, Mn, Mg, Ca, Na, and K. Results showed an acceptable linear response for Ca, Fe, Mg, and K while poor signal responses were found for Na and Mn. Signs of remaining matrix effects for the LA-LIBS approach in the case of soil analysis were found and discussed. Finally, some improvements and possibilities for future studies toward quantitative soil analysis with the LA-LIBS technique are suggested.

  12. Optical properties (bidirectional reflectance distribution function) of shot fabric.

    PubMed

    Lu, R; Koenderink, J J; Kappers, A M

    2000-11-01

    To study the optical properties of materials, one needs a complete set of the angular distribution functions of surface scattering from the materials. Here we present a convenient method for collecting a large set of bidirectional reflectance distribution function (BRDF) samples in the hemispherical scattering space. Material samples are wrapped around a right-circular cylinder and irradiated by a parallel light source, and the scattered radiance is collected by a digital camera. We tilted the cylinder around its center to collect the BRDF samples outside the plane of incidence. This method can be used with materials that have isotropic and anisotropic scattering properties. We demonstrate this method in a detailed investigation of shot fabrics. The warps and the fillings of shot fabrics are dyed different colors so that the fabric appears to change color at different viewing angles. These color-changing characteristics are found to be related to the physical and geometrical structure of shot fabric. Our study reveals that the color-changing property of shot fabrics is due mainly to an occlusion effect.

  13. Indicators of quality of antenatal care: a pilot study.

    PubMed

    Vause, S; Maresh, M

    1999-03-01

    To pilot a list of indicators of quality of antenatal care across a range of maternity care settings. For each indicator to determine what is achieved in current clinical practice, to facilitate the setting of audit standards and calculation of appropriate sample sizes for audit. A multicentre retrospective observational study. Nine maternity units in the United Kingdom. 20,771 women with a singleton pregnancy, who were delivered between 1 August 1994 and 31 July 1995. Nine of the eleven suggested indicators were successfully piloted. Two indicators require further development. In seven of the nine hospitals external cephalic version was not commonly performed. There were wide variations in the proportions of women screened for asymptomatic bacteriuria. Screening of women from ethnic minorities for haemoglobinopathy was more likely in hospitals with a large proportion of non-caucasian women. A large number of Rhesus negative women did not have a Rhesus antibody check performed after 28 weeks of gestation and did not receive anti-D immunoglobulin after a potentially sensitising event during pregnancy. As a result of the study appropriate sample sizes for future audit could be calculated. Measuring the extent to which evidence-based interventions are used in routine clinical practice provides a more detailed picture of the strengths and weaknesses in an antenatal service than traditional outcomes such as perinatal mortality rates. Awareness of an appropriate sample size should prevent waste of time and resources on inconclusive audits.

  14. Consistency of ARESE II Cloud Absorption Estimates and Sampling Issues

    NASA Technical Reports Server (NTRS)

    Oreopoulos, L.; Marshak, A.; Cahalan, R. F.; Lau, William K. M. (Technical Monitor)

    2002-01-01

    Data from three cloudy days (March 3, 21, 29, 2000) of the ARM Enhanced Shortwave Experiment II (ARESE II) were analyzed. Grand averages of broadband absorptance among three sets of instruments were compared. Fractional solar absorptances were approx. 0.21-0.22 with the exception of March 3 when two sets of instruments gave values smaller by approx. 0.03-0.04. The robustness of these values was investigated by looking into possible sampling problems with the aid of 500 nm spectral fluxes. Grand averages of 500 nm apparent absorptance cover a wide range of values for these three days, namely from a large positive (approx. 0.011) average for March 3, to a small negative (approximately -0.03) for March 21, to near zero (approx. 0.01) for March 29. We present evidence suggesting that a large part of the discrepancies among the three days is due to the different nature of clouds and their non-uniform sampling. Hence, corrections to the grand average broadband absorptance values may be necessary. However, application of the known correction techniques may be precarious due to the sparsity of collocated flux measurements above and below the clouds. Our analysis leads to the conclusion that only March 29 fulfills all requirements for reliable estimates of cloud absorption, that is, the presence of thick, overcast, homogeneous clouds.

  15. A new approach to untargeted integration of high resolution liquid chromatography-mass spectrometry data.

    PubMed

    van der Kloet, Frans M; Hendriks, Margriet; Hankemeier, Thomas; Reijmers, Theo

    2013-11-01

    Because of its high sensitivity and specificity, hyphenated mass spectrometry has become the predominant method to detect and quantify metabolites present in bio-samples relevant for all sorts of life science studies being executed. In contrast to targeted methods that are dedicated to specific features, global profiling acquisition methods allow new unspecific metabolites to be analyzed. The challenge with these so-called untargeted methods is the proper and automated extraction and integration of features that could be of relevance. We propose a new algorithm that enables untargeted integration of samples that are measured with high resolution liquid chromatography-mass spectrometry (LC-MS). In contrast to other approaches limited user interaction is needed allowing also less experienced users to integrate their data. The large amount of single features that are found within a sample is combined to a smaller list of, compound-related, grouped feature-sets representative for that sample. These feature-sets allow for easier interpretation and identification and as important, easier matching over samples. We show that the automatic obtained integration results for a set of known target metabolites match those generated with vendor software but that at least 10 times more feature-sets are extracted as well. We demonstrate our approach using high resolution LC-MS data acquired for 128 samples on a lipidomics platform. The data was also processed in a targeted manner (with a combination of automatic and manual integration) using vendor software for a set of 174 targets. As our untargeted extraction procedure is run per sample and per mass trace the implementation of it is scalable. Because of the generic approach, we envision that this data extraction lipids method will be used in a targeted as well as untargeted analysis of many different kinds of TOF-MS data, even CE- and GC-MS data or MRM. The Matlab package is available for download on request and efforts are directed toward a user-friendly Windows executable. Copyright © 2013 Elsevier B.V. All rights reserved.

  16. Dynamics of large submarine landslide from analyzing the basal section of mass-transport deposits sampled by IODP Nankai Trough Submarine Landslide History (NanTroSLIDE)

    NASA Astrophysics Data System (ADS)

    Strasser, M.; Dugan, B.; Henry, P.; Jurado, M. J.; Kanagawa, K.; Kanamatsu, T.; Moore, G. F.; Panieri, G.; Pini, G. A.

    2014-12-01

    Mulitbeam swath bathymetry and reflection seismic data image large submarine landslide complexes along ocean margins worldwide. However, slope failure initiation, acceleration of motion and mass-transport dynamics of submarine landslides, which are all key to assess their tsunamigenic potential or impact on offshore infrastructure, cannot be conclusively deduced from geometric expression and acoustic characteristics of geophysical data sets alone, but cores and in situ data from the subsurface are needed to complement our understanding of submarine landslide dynamics. Here we present data and results from drilling, logging and coring thick mass-transport deposits (MTDs) in the Nankai Trough accretionary prism during Integrated Ocean Drilling Program (IODP) Expeditions 333 and 338. We integrate analysis on 3D seismic and Logging While Drilling (LWD) data sets, with data from laboratory analysis on core samples (geotechnical shear experiments, X-ray Computed Tomography (X-CT), Scanning Electron Microscopy (SEM) of deformation indicators, and magnetic fabric analysis) to study nature and mode of deformation and dynamics of mass transport in this active tectonic setting. In particular, we show that Fe-S filaments commonly observed on X-ray CT data of marine sediments, likely resulting from early diagenesis of worm burrows, are folded in large MTDs and display preferential orientation at their base. The observed lineation has low dip and is interpreted as the consequence of shear along the basal surface, revealing a new proxy for strain in soft sediments that can be applied to cores that reach through the entire depth of MTDs. Shear deformation in the lower part of thick MTDs is also revealed from AMS data, which - in combination with other paleo-magnetic data - is used to reconstruct strain and transport direction of the landslides.

  17. A Simple Sampling Method for Estimating the Accuracy of Large Scale Record Linkage Projects.

    PubMed

    Boyd, James H; Guiver, Tenniel; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Anderson, Phil; Dickinson, Teresa

    2016-05-17

    Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality typically focus on precision (the proportion of incorrect links), given the difficulty of measuring the proportion of false negatives. The aim of this work is to introduce and evaluate a sampling based method to estimate both precision and recall following record linkage. In the sampling based method, record-pairs from each threshold (including those below the identified cut-off for acceptance) are sampled and clerically reviewed. These results are then applied to the entire set of record-pairs, providing estimates of false positives and false negatives. This method was evaluated on a synthetically generated dataset, where the true match status (which records belonged to the same person) was known. The sampled estimates of linkage quality were relatively close to actual linkage quality metrics calculated for the whole synthetic dataset. The precision and recall measures for seven reviewers were very consistent with little variation in the clerical assessment results (overall agreement using the Fleiss Kappa statistics was 0.601). This method presents as a possible means of accurately estimating matching quality and refining linkages in population level linkage studies. The sampling approach is especially important for large project linkages where the number of record pairs produced may be very large often running into millions.

  18. A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets.

    PubMed

    Savitski, Mikhail M; Wilhelm, Mathias; Hahne, Hannes; Kuster, Bernhard; Bantscheff, Marcus

    2015-09-01

    Calculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target-decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target-decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The "picked" protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The "picked" target-decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used "classic" protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software. © 2015 by The American Society for Biochemistry and Molecular Biology, Inc.

  19. A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets

    PubMed Central

    Savitski, Mikhail M.; Wilhelm, Mathias; Hahne, Hannes; Kuster, Bernhard; Bantscheff, Marcus

    2015-01-01

    Calculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target–decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target–decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The “picked” protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The “picked” target–decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used “classic” protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software. PMID:25987413

  20. Cost Prediction Using a Survival Grouping Algorithm: An Application to Incident Prostate Cancer Cases.

    PubMed

    Onukwugha, Eberechukwu; Qi, Ran; Jayasekera, Jinani; Zhou, Shujia

    2016-02-01

    Prognostic classification approaches are commonly used in clinical practice to predict health outcomes. However, there has been limited focus on use of the general approach for predicting costs. We applied a grouping algorithm designed for large-scale data sets and multiple prognostic factors to investigate whether it improves cost prediction among older Medicare beneficiaries diagnosed with prostate cancer. We analysed the linked Surveillance, Epidemiology and End Results (SEER)-Medicare data, which included data from 2000 through 2009 for men diagnosed with incident prostate cancer between 2000 and 2007. We split the survival data into two data sets (D0 and D1) of equal size. We trained the classifier of the Grouping Algorithm for Cancer Data (GACD) on D0 and tested it on D1. The prognostic factors included cancer stage, age, race and performance status proxies. We calculated the average difference between observed D1 costs and predicted D1 costs at 5 years post-diagnosis with and without the GACD. The sample included 110,843 men with prostate cancer. The median age of the sample was 74 years, and 10% were African American. The average difference (mean absolute error [MAE]) per person between the real and predicted total 5-year cost was US$41,525 (MAE US$41,790; 95% confidence interval [CI] US$41,421-42,158) with the GACD and US$43,113 (MAE US$43,639; 95% CI US$43,062-44,217) without the GACD. The 5-year cost prediction without grouping resulted in a sample overestimate of US$79,544,508. The grouping algorithm developed for complex, large-scale data improves the prediction of 5-year costs. The prediction accuracy could be improved by utilization of a richer set of prognostic factors and refinement of categorical specifications.

  1. Simulation Studies as Designed Experiments: The Comparison of Penalized Regression Models in the “Large p, Small n” Setting

    PubMed Central

    Chaibub Neto, Elias; Bare, J. Christopher; Margolin, Adam A.

    2014-01-01

    New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where “omics” features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights. PMID:25289666

  2. Temperature Control Diagnostics for Sample Environments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Santodonato, Louis J; Walker, Lakeisha MH; Church, Andrew J

    2010-01-01

    In a scientific laboratory setting, standard equipment such as cryocoolers are often used as part of a custom sample environment system designed to regulate temperature over a wide range. The end user may be more concerned with precise sample temperature control than with base temperature. But cryogenic systems tend to be specified mainly in terms of cooling capacity and base temperature. Technical staff at scientific user facilities (and perhaps elsewhere) often wonder how to best specify and evaluate temperature control capabilities. Here we describe test methods and give results obtained at a user facility that operates a large sample environmentmore » inventory. Although this inventory includes a wide variety of temperature, pressure, and magnetic field devices, the present work focuses on cryocooler-based systems.« less

  3. A statistical model and national data set for partioning fish-tissue mercury concentration variation between spatiotemporal and sample characteristic effects

    USGS Publications Warehouse

    Wente, Stephen P.

    2004-01-01

    Many Federal, Tribal, State, and local agencies monitor mercury in fish-tissue samples to identify sites with elevated fish-tissue mercury (fish-mercury) concentrations, track changes in fish-mercury concentrations over time, and produce fish-consumption advisories. Interpretation of such monitoring data commonly is impeded by difficulties in separating the effects of sample characteristics (species, tissues sampled, and sizes of fish) from the effects of spatial and temporal trends on fish-mercury concentrations. Without such a separation, variation in fish-mercury concentrations due to differences in the characteristics of samples collected over time or across space can be misattributed to temporal or spatial trends; and/or actual trends in fish-mercury concentration can be misattributed to differences in sample characteristics. This report describes a statistical model and national data set (31,813 samples) for calibrating the aforementioned statistical model that can separate spatiotemporal and sample characteristic effects in fish-mercury concentration data. This model could be useful for evaluating spatial and temporal trends in fishmercury concentrations and developing fish-consumption advisories. The observed fish-mercury concentration data and model predictions can be accessed, displayed geospatially, and downloaded via the World Wide Web (http://emmma.usgs.gov). This report and the associated web site may assist in the interpretation of large amounts of data from widespread fishmercury monitoring efforts.

  4. Automation practices in large molecule bioanalysis: recommendations from group L5 of the global bioanalytical consortium.

    PubMed

    Ahene, Ago; Calonder, Claudio; Davis, Scott; Kowalchick, Joseph; Nakamura, Takahiro; Nouri, Parya; Vostiar, Igor; Wang, Yang; Wang, Jin

    2014-01-01

    In recent years, the use of automated sample handling instrumentation has come to the forefront of bioanalytical analysis in order to ensure greater assay consistency and throughput. Since robotic systems are becoming part of everyday analytical procedures, the need for consistent guidance across the pharmaceutical industry has become increasingly important. Pre-existing regulations do not go into sufficient detail in regard to how to handle the use of robotic systems for use with analytical methods, especially large molecule bioanalysis. As a result, Global Bioanalytical Consortium (GBC) Group L5 has put forth specific recommendations for the validation, qualification, and use of robotic systems as part of large molecule bioanalytical analyses in the present white paper. The guidelines presented can be followed to ensure that there is a consistent, transparent methodology that will ensure that robotic systems can be effectively used and documented in a regulated bioanalytical laboratory setting. This will allow for consistent use of robotic sample handling instrumentation as part of large molecule bioanalysis across the globe.

  5. A multicomponent matched filter cluster confirmation tool for eROSITA: initial application to the RASS and DES-SV data sets

    DOE PAGES

    Klein, M.; Mohr, J. J.; Desai, S.; ...

    2017-11-14

    We describe a multi-component matched filter cluster confirmation tool (MCMF) designed for the study of large X-ray source catalogs produced by the upcoming X-ray all-sky survey mission eROSITA. We apply the method to confirm a sample of 88 clusters with redshifts $0.05

  6. Does the public notice visual resource problems on the federal estate?

    Treesearch

    John D. Peine

    1979-01-01

    Results of the 1977 Federal estate are highlighted. The survey of recreation on the Federal estate represents a unique data set which was uniformly collected across all Federal land managing agencies and sections of the country. The on-site sampling procedures utilized in this survey process have never before been applied on such a large scale. Procedures followed and...

  7. Network-scale spatial and temporal variation in Chinook salmon (Oncorhynchus tshawytscha) redd distributions: patterns inferred from spatially continuous replicate surveys

    Treesearch

    Daniel J. Isaak; Russell F. Thurow

    2006-01-01

    Spatially continuous sampling designs, when temporally replicated, provide analytical flexibility and are unmatched in their ability to provide a dynamic system view. We have compiled such a data set by georeferencing the network-scale distribution of Chinook salmon (Oncorhynchus tshawytscha) redds across a large wilderness basin (7330 km2) in...

  8. Goals Set in the Land of the Living/Dying: A Longitudinal Study of Patients Living with Advanced Cancer

    ERIC Educational Resources Information Center

    Nissim, Rinat; Rennie, David; Fleming, Stephen; Hales, Sarah; Gagliese, Lucia; Rodin, Gary

    2012-01-01

    A longitudinal qualitative research study was undertaken to provide an understanding of a prolonged experience of advanced cancer, as seen through the eyes of dying individuals. Using a variant of the grounded theory method, the authors theoretically sampled, from outpatient clinics in a large comprehensive cancer treatment center, 27 patients…

  9. The Teaching of Literature in a Singapore Secondary School: Disciplinarity, Curriculum Coverage and the Opportunity Costs Involved

    ERIC Educational Resources Information Center

    Towndrow, Phillip; Kwek, Dennis Beng Kiat

    2017-01-01

    Set against the backdrop of reinvigorating the study of literature and concerns about the adequate preparation of students for the world of work, this paper explores how a Singapore teacher presented a literary text in the classroom. Drawing on data from a large-scale representative sample of Singapore schools in instruction and assessment…

  10. They're Not All at Home: Residential Placements of Early Adolescents in Special Education

    ERIC Educational Resources Information Center

    Chen, Chin-Chih; Culhane, Dennis P.; Metraux, Stephen; Park, Jung Min; Venable, Jessica C.; Burnett, T. C.

    2016-01-01

    Using an integrated administrative data set, out-of-home residential placements (i.e., child welfare, juvenile justice, mental health) were examined in a sample of early adolescents in a large urban school district. Out-of-home placements were tracked across Grades 7 to 9 in a population of 58,000 youth. This included 10,911 students identified…

  11. A multicomponent matched filter cluster confirmation tool for eROSITA: initial application to the RASS and DES-SV data sets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Klein, M.; Mohr, J. J.; Desai, S.

    We describe a multi-component matched filter cluster confirmation tool (MCMF) designed for the study of large X-ray source catalogs produced by the upcoming X-ray all-sky survey mission eROSITA. We apply the method to confirm a sample of 88 clusters with redshifts $0.05

  12. Antipoaching standards in onshore hydrocarbon concessions drawn from a Central African case study.

    PubMed

    Vanthomme, Hadrien P A; Tobi, Elie; Todd, Angelique F; Korte, Lisa; Alonso, Alfonso

    2017-06-01

    Unsustainable hunting outside protected areas is threatening tropical biodiversity worldwide and requires conservationists to engage increasingly in antipoaching activities. Following the example of ecocertified logging companies, we argue that other extractive industries managing large concessions should engage in antipoaching activities as part of their environmental management plans. Onshore hydrocarbon concessions should also adopt antipoaching protocols as a standard because they represent a biodiversity threat comparable to logging. We examined the spatiotemporal patterns of small- and large-mammal poaching in an onshore oil concession in Gabon, Central Africa, with a Bayesian occupancy model based on signs of poaching collected from 2010 to 2015 on antipoaching patrols. Patrol locations were initially determined based on local intelligence and past patrol successes (adaptive management) and subsequently with a systematic sampling of the concession. We generated maps of poaching probability in the concession and determined the temporal trends of this threat over 5 years. The spatiotemporal patterns of large- and small-mammal poaching differed throughout the concession, and likely these groups will need different management strategies. By elucidating the relationship between site-specific sampling effort and detection probability, the Bayesian method allowed us to set goals for future antipoaching patrols. Our results indicate that a combination of systematic sampling and adaptive management data is necessary to infer spatiotemporal patterns with the statistical method we used. On the basis of our case study, we recommend hydrocarbon companies interested in implementing efficient antipoaching activities in their onshore concessions to lay the foundation of long-needed industry standards by: adequately measuring antipoaching effort; mixing adaptive management and balanced sampling; setting goals for antipoaching effort; pairing patrols with large-mammal monitoring; supporting antipoaching patrols across the landscape; restricting access to their concessions; performing random searches for bushmeat and mammal products at points of entry; controlling urban and agricultural expansion; supporting bushmeat alternatives; and supporting land-use planning. Published 2016. This article is a U.S. Government work and is in the public domain in the USA. Conservation Biology published by Wiley Periodicals, Inc. on behalf of Society for Conservation Biology.

  13. Uncovering the hidden risk architecture of the schizophrenias: confirmation in three independent genome-wide association studies.

    PubMed

    Arnedo, Javier; Svrakic, Dragan M; Del Val, Coral; Romero-Zaliz, Rocío; Hernández-Cuervo, Helena; Fanous, Ayman H; Pato, Michele T; Pato, Carlos N; de Erausquin, Gabriel A; Cloninger, C Robert; Zwir, Igor

    2015-02-01

    The authors sought to demonstrate that schizophrenia is a heterogeneous group of heritable disorders caused by different genotypic networks that cause distinct clinical syndromes. In a large genome-wide association study of cases with schizophrenia and controls, the authors first identified sets of interacting single-nucleotide polymorphisms (SNPs) that cluster within particular individuals (SNP sets) regardless of clinical status. Second, they examined the risk of schizophrenia for each SNP set and tested replicability in two independent samples. Third, they identified genotypic networks composed of SNP sets sharing SNPs or subjects. Fourth, they identified sets of distinct clinical features that cluster in particular cases (phenotypic sets or clinical syndromes) without regard for their genetic background. Fifth, they tested whether SNP sets were associated with distinct phenotypic sets in a replicable manner across the three studies. The authors identified 42 SNP sets associated with a 70% or greater risk of schizophrenia, and confirmed 34 (81%) or more with similar high risk of schizophrenia in two independent samples. Seventeen networks of SNP sets did not share any SNP or subject. These disjoint genotypic networks were associated with distinct gene products and clinical syndromes (i.e., the schizophrenias) varying in symptoms and severity. Associations between genotypic networks and clinical syndromes were complex, showing multifinality and equifinality. The interactive networks explained the risk of schizophrenia more than the average effects of all SNPs (24%). Schizophrenia is a group of heritable disorders caused by a moderate number of separate genotypic networks associated with several distinct clinical syndromes.

  14. Profiling cellular protein complexes by proximity ligation with dual tag microarray readout.

    PubMed

    Hammond, Maria; Nong, Rachel Yuan; Ericsson, Olle; Pardali, Katerina; Landegren, Ulf

    2012-01-01

    Patterns of protein interactions provide important insights in basic biology, and their analysis plays an increasing role in drug development and diagnostics of disease. We have established a scalable technique to compare two biological samples for the levels of all pairwise interactions among a set of targeted protein molecules. The technique is a combination of the proximity ligation assay with readout via dual tag microarrays. In the proximity ligation assay protein identities are encoded as DNA sequences by attaching DNA oligonucleotides to antibodies directed against the proteins of interest. Upon binding by pairs of antibodies to proteins present in the same molecular complexes, ligation reactions give rise to reporter DNA molecules that contain the combined sequence information from the two DNA strands. The ligation reactions also serve to incorporate a sample barcode in the reporter molecules to allow for direct comparison between pairs of samples. The samples are evaluated using a dual tag microarray where information is decoded, revealing which pairs of tags that have become joined. As a proof-of-concept we demonstrate that this approach can be used to detect a set of five proteins and their pairwise interactions both in cellular lysates and in fixed tissue culture cells. This paper provides a general strategy to analyze the extent of any pairwise interactions in large sets of molecules by decoding reporter DNA strands that identify the interacting molecules.

  15. Sea-level rise and archaeological site destruction: An example from the southeastern United States using DINAA (Digital Index of North American Archaeology).

    PubMed

    Anderson, David G; Bissett, Thaddeus G; Yerka, Stephen J; Wells, Joshua J; Kansa, Eric C; Kansa, Sarah W; Myers, Kelsey Noack; DeMuth, R Carl; White, Devin A

    2017-01-01

    The impact of changing climate on terrestrial and underwater archaeological sites, historic buildings, and cultural landscapes can be examined through quantitatively-based analyses encompassing large data samples and broad geographic and temporal scales. The Digital Index of North American Archaeology (DINAA) is a multi-institutional collaboration that allows researchers online access to linked heritage data from multiple sources and data sets. The effects of sea-level rise and concomitant human population relocation is examined using a sample from nine states encompassing much of the Gulf and Atlantic coasts of the southeastern United States. A 1 m rise in sea-level will result in the loss of over >13,000 recorded historic and prehistoric archaeological sites, as well as over 1000 locations currently eligible for inclusion on the National Register of Historic Places (NRHP), encompassing archaeological sites, standing structures, and other cultural properties. These numbers increase substantially with each additional 1 m rise in sea level, with >32,000 archaeological sites and >2400 NRHP properties lost should a 5 m rise occur. Many more unrecorded archaeological and historic sites will also be lost as large areas of the landscape are flooded. The displacement of millions of people due to rising seas will cause additional impacts where these populations resettle. Sea level rise will thus result in the loss of much of the record of human habitation of the coastal margin in the Southeast within the next one to two centuries, and the numbers indicate the magnitude of the impact on the archaeological record globally. Construction of large linked data sets is essential to developing procedures for sampling, triage, and mitigation of these impacts.

  16. Sea-level rise and archaeological site destruction: An example from the southeastern United States using DINAA (Digital Index of North American Archaeology)

    PubMed Central

    Wells, Joshua J.; Kansa, Eric C.; Kansa, Sarah W.; Myers, Kelsey Noack; DeMuth, R. Carl; White, Devin A.

    2017-01-01

    The impact of changing climate on terrestrial and underwater archaeological sites, historic buildings, and cultural landscapes can be examined through quantitatively-based analyses encompassing large data samples and broad geographic and temporal scales. The Digital Index of North American Archaeology (DINAA) is a multi-institutional collaboration that allows researchers online access to linked heritage data from multiple sources and data sets. The effects of sea-level rise and concomitant human population relocation is examined using a sample from nine states encompassing much of the Gulf and Atlantic coasts of the southeastern United States. A 1 m rise in sea-level will result in the loss of over >13,000 recorded historic and prehistoric archaeological sites, as well as over 1000 locations currently eligible for inclusion on the National Register of Historic Places (NRHP), encompassing archaeological sites, standing structures, and other cultural properties. These numbers increase substantially with each additional 1 m rise in sea level, with >32,000 archaeological sites and >2400 NRHP properties lost should a 5 m rise occur. Many more unrecorded archaeological and historic sites will also be lost as large areas of the landscape are flooded. The displacement of millions of people due to rising seas will cause additional impacts where these populations resettle. Sea level rise will thus result in the loss of much of the record of human habitation of the coastal margin in the Southeast within the next one to two centuries, and the numbers indicate the magnitude of the impact on the archaeological record globally. Construction of large linked data sets is essential to developing procedures for sampling, triage, and mitigation of these impacts. PMID:29186200

  17. Cryptic diversity and discordance in single-locus species delimitation methods within horned lizards (Phrynosomatidae: Phrynosoma).

    PubMed

    Blair, Christopher; Bryson, Robert W

    2017-11-01

    Biodiversity reduction and loss continues to progress at an alarming rate, and thus, there is widespread interest in utilizing rapid and efficient methods for quantifying and delimiting taxonomic diversity. Single-locus species delimitation methods have become popular, in part due to the adoption of the DNA barcoding paradigm. These techniques can be broadly classified into tree-based and distance-based methods depending on whether species are delimited based on a constructed genealogy. Although the relative performance of these methods has been tested repeatedly with simulations, additional studies are needed to assess congruence with empirical data. We compiled a large data set of mitochondrial ND4 sequences from horned lizards (Phrynosoma) to elucidate congruence using four tree-based (single-threshold GMYC, multiple-threshold GMYC, bPTP, mPTP) and one distance-based (ABGD) species delimitation models. We were particularly interested in cases with highly uneven sampling and/or large differences in intraspecific diversity. Results showed a high degree of discordance among methods, with multiple-threshold GMYC and bPTP suggesting an unrealistically high number of species (29 and 26 species within the P. douglasii complex alone). The single-threshold GMYC model was the most conservative, likely a result of difficulty in locating the inflection point in the genealogies. mPTP and ABGD appeared to be the most stable across sampling regimes and suggested the presence of additional cryptic species that warrant further investigation. These results suggest that the mPTP model may be preferable in empirical data sets with highly uneven sampling or large differences in effective population sizes of species. © 2017 John Wiley & Sons Ltd.

  18. Implications of High Molecular Divergence of Nuclear rRNA and Phylogenetic Structure for the Dinoflagellate Prorocentrum (Dinophyceae, Prorocentrales).

    PubMed

    Boopathi, Thangavelu; Faria, Daphne Georgina; Cheon, Ju-Yong; Youn, Seok Hyun; Ki, Jang-Seu

    2015-01-01

    The small and large nuclear subunit molecular phylogeny of the genus Prorocentrum demonstrated that the species are dichotomized into two clades. These two clades were significantly different (one-factor ANOVA, p < 0.01) with patterns compatible for both small and large subunit Bayesian phylogenetic trees, and for a larger taxon sampled dinoflagellate phylogeny. Evaluation of the molecular divergence levels showed that intraspecies genetic variations were significantly low (t-test, p < 0.05), than those for interspecies variations (> 2.9% and > 26.8% dissimilarity in the small and large subunit [D1/D2], respectively). Based on the calculated molecular divergence, the genus comprises two genetically distinct groups that should be considered as two separate genera, thereby setting the pace for major systematic changes for the genus Prorocentrum sensu Dodge. Moreover, the information presented in this study would be useful for improving species identification, detection of novel clades from environmental samples. © 2015 The Author(s) Journal of Eukaryotic Microbiology © 2015 International Society of Protistologists.

  19. Statistical Analysis of a Large Sample Size Pyroshock Test Data Set Including Post Flight Data Assessment. Revision 1

    NASA Technical Reports Server (NTRS)

    Hughes, William O.; McNelis, Anne M.

    2010-01-01

    The Earth Observing System (EOS) Terra spacecraft was launched on an Atlas IIAS launch vehicle on its mission to observe planet Earth in late 1999. Prior to launch, the new design of the spacecraft's pyroshock separation system was characterized by a series of 13 separation ground tests. The analysis methods used to evaluate this unusually large amount of shock data will be discussed in this paper, with particular emphasis on population distributions and finding statistically significant families of data, leading to an overall shock separation interface level. The wealth of ground test data also allowed a derivation of a Mission Assurance level for the flight. All of the flight shock measurements were below the EOS Terra Mission Assurance level thus contributing to the overall success of the EOS Terra mission. The effectiveness of the statistical methodology for characterizing the shock interface level and for developing a flight Mission Assurance level from a large sample size of shock data is demonstrated in this paper.

  20. State-space reduction and equivalence class sampling for a molecular self-assembly model.

    PubMed

    Packwood, Daniel M; Han, Patrick; Hitosugi, Taro

    2016-07-01

    Direct simulation of a model with a large state space will generate enormous volumes of data, much of which is not relevant to the questions under study. In this paper, we consider a molecular self-assembly model as a typical example of a large state-space model, and present a method for selectively retrieving 'target information' from this model. This method partitions the state space into equivalence classes, as identified by an appropriate equivalence relation. The set of equivalence classes H, which serves as a reduced state space, contains none of the superfluous information of the original model. After construction and characterization of a Markov chain with state space H, the target information is efficiently retrieved via Markov chain Monte Carlo sampling. This approach represents a new breed of simulation techniques which are highly optimized for studying molecular self-assembly and, moreover, serves as a valuable guideline for analysis of other large state-space models.

  1. A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

    PubMed Central

    Canela-Xandri, Oriol; Law, Andy; Gray, Alan; Woolliams, John A.; Tenesa, Albert

    2015-01-01

    Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in ∼4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes. PMID:26657010

  2. Partitioning heritability by functional annotation using genome-wide association summary statistics.

    PubMed

    Finucane, Hilary K; Bulik-Sullivan, Brendan; Gusev, Alexander; Trynka, Gosia; Reshef, Yakir; Loh, Po-Ru; Anttila, Verneri; Xu, Han; Zang, Chongzhi; Farh, Kyle; Ripke, Stephan; Day, Felix R; Purcell, Shaun; Stahl, Eli; Lindstrom, Sara; Perry, John R B; Okada, Yukinori; Raychaudhuri, Soumya; Daly, Mark J; Patterson, Nick; Neale, Benjamin M; Price, Alkes L

    2015-11-01

    Recent work has demonstrated that some functional categories of the genome contribute disproportionately to the heritability of complex diseases. Here we analyze a broad set of functional elements, including cell type-specific elements, to estimate their polygenic contributions to heritability in genome-wide association studies (GWAS) of 17 complex diseases and traits with an average sample size of 73,599. To enable this analysis, we introduce a new method, stratified LD score regression, for partitioning heritability from GWAS summary statistics while accounting for linked markers. This new method is computationally tractable at very large sample sizes and leverages genome-wide information. Our findings include a large enrichment of heritability in conserved regions across many traits, a very large immunological disease-specific enrichment of heritability in FANTOM5 enhancers and many cell type-specific enrichments, including significant enrichment of central nervous system cell types in the heritability of body mass index, age at menarche, educational attainment and smoking behavior.

  3. An Independent Filter for Gene Set Testing Based on Spectral Enrichment.

    PubMed

    Frost, H Robert; Li, Zhigang; Asselbergs, Folkert W; Moore, Jason H

    2015-01-01

    Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.

  4. Blood Gene Expression Predicts Bronchiolitis Obliterans Syndrome

    PubMed Central

    Danger, Richard; Royer, Pierre-Joseph; Reboulleau, Damien; Durand, Eugénie; Loy, Jennifer; Tissot, Adrien; Lacoste, Philippe; Roux, Antoine; Reynaud-Gaubert, Martine; Gomez, Carine; Kessler, Romain; Mussot, Sacha; Dromer, Claire; Brugière, Olivier; Mornex, Jean-François; Guillemain, Romain; Dahan, Marcel; Knoop, Christiane; Botturi, Karine; Foureau, Aurore; Pison, Christophe; Koutsokera, Angela; Nicod, Laurent P.; Brouard, Sophie; Magnan, Antoine; Jougon, J.

    2018-01-01

    Bronchiolitis obliterans syndrome (BOS), the main manifestation of chronic lung allograft dysfunction, leads to poor long-term survival after lung transplantation. Identifying predictors of BOS is essential to prevent the progression of dysfunction before irreversible damage occurs. By using a large set of 107 samples from lung recipients, we performed microarray gene expression profiling of whole blood to identify early biomarkers of BOS, including samples from 49 patients with stable function for at least 3 years, 32 samples collected at least 6 months before BOS diagnosis (prediction group), and 26 samples at or after BOS diagnosis (diagnosis group). An independent set from 25 lung recipients was used for validation by quantitative PCR (13 stables, 11 in the prediction group, and 8 in the diagnosis group). We identified 50 transcripts differentially expressed between stable and BOS recipients. Three genes, namely POU class 2 associating factor 1 (POU2AF1), T-cell leukemia/lymphoma protein 1A (TCL1A), and B cell lymphocyte kinase, were validated as predictive biomarkers of BOS more than 6 months before diagnosis, with areas under the curve of 0.83, 0.77, and 0.78 respectively. These genes allow stratification based on BOS risk (log-rank test p < 0.01) and are not associated with time posttransplantation. This is the first published large-scale gene expression analysis of blood after lung transplantation. The three-gene blood signature could provide clinicians with new tools to improve follow-up and adapt treatment of patients likely to develop BOS. PMID:29375549

  5. Distribution and sources of surfzone bacteria at Huntington Beach before and after disinfection on an ocean outfall - A frequency-domain analysis

    USGS Publications Warehouse

    Noble, M.A.; Xu, J. P.; Robertson, G.L.; Rosenfeld, L.K.

    2006-01-01

    Fecal indicator bacteria (FIB) were measured approximately 5 days a week in ankle-depth water at 19 surfzone stations along Huntington Beach and Newport Beach, California, from 1998 to the end of 2003. These sampling periods span the time before and after treated sewage effluent, discharged into the coastal ocean from the local outfall, was disinfected. Bacterial samples were also taken in the vicinity of the outfall during the pre- and post-disinfection periods. Our analysis of the results from both data sets suggest that land-based sources, rather than the local outfall, were the source of the FIB responsible for the frequent closures and postings of local beaches in the summers of 2001 and 2002. Because the annual cycle is the dominant frequency in the fecal and total coliform data sets at most sampling stations, we infer that sources associated with local runoff were responsible for the majority of coliform contamination along wide stretches of the beach. The dominant fortnightly cycle in enterococci at many surfzone sampling stations suggests that the source for these relatively frequent bacteria contamination events in summer is related to the wetting and draining of the land due to the large tidal excursions found during spring tides. Along the most frequently closed section of the beach at stations 3N-15N, the fortnightly cycle is dominant in all FIBs. The strikingly different spatial and spectral patterns found in coliform and in enterococci suggest the presence of different sources, at least for large sections of beach. The presence of a relatively large enterococci fortnightly cycle along the beaches near Newport Harbor indicates that contamination sources similar to those found off Huntington Beach are present, though not at high enough levels to close the Newport beaches. ?? 2006 Elsevier Ltd. All rights reserved.

  6. Rotating magnetic field experiments in a pure superconducting Pb sphere

    NASA Astrophysics Data System (ADS)

    Vélez, Saül; García-Santiago, Antoni; Hernandez, Joan Manel; Tejada, Javier

    2009-10-01

    The magnetic properties of a sphere of pure type-I superconducting lead (Pb) under rotating magnetic fields have been investigated in different experimental conditions by measuring the voltage generated in a set of detection coils by the response of the sample to the time variation in the magnetic field. The influence of the frequency of rotation of the magnet, the time it takes to record each data point and the temperature of the sample during the measuring process is explored. A strong reduction in the thermodynamic critical field and the onset of hysteretical effects in the magnetic field dependence of the amplitude of the magnetic susceptibility are observed for large frequencies and large values of the recording time. Heating of the sample during the motion of normal zones in the intermediate state and the dominance of a resistive term in the contribution of the Lenz’s law to the magnetic susceptibility in the normal state under time varying magnetic fields are suggested as possible explanations for these effects.

  7. Extreme-value dependence: An application to exchange rate markets

    NASA Astrophysics Data System (ADS)

    Fernandez, Viviana

    2007-04-01

    Extreme value theory (EVT) focuses on modeling the tail behavior of a loss distribution using only extreme values rather than the whole data set. For a sample of 10 countries with dirty/free float regimes, we investigate whether paired currencies exhibit a pattern of asymptotic dependence. That is, whether an extremely large appreciation or depreciation in the nominal exchange rate of one country might transmit to another. In general, after controlling for volatility clustering and inertia in returns, we do not find evidence of extreme-value dependence between paired exchange rates. However, for asymptotic-independent paired returns, we find that tail dependency of exchange rates is stronger under large appreciations than under large depreciations.

  8. Data splitting for artificial neural networks using SOM-based stratified sampling.

    PubMed

    May, R J; Maier, H R; Dandy, G C

    2010-03-01

    Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets. Copyright 2009 Elsevier Ltd. All rights reserved.

  9. Constructing a Reward-Related Quality of Life Statistic in Daily Life-a Proof of Concept Study Using Positive Affect.

    PubMed

    Verhagen, Simone J W; Simons, Claudia J P; van Zelst, Catherine; Delespaul, Philippe A E G

    2017-01-01

    Background: Mental healthcare needs person-tailored interventions. Experience Sampling Method (ESM) can provide daily life monitoring of personal experiences. This study aims to operationalize and test a measure of momentary reward-related Quality of Life (rQoL). Intuitively, quality of life improves by spending more time on rewarding experiences. ESM clinical interventions can use this information to coach patients to find a realistic, optimal balance of positive experiences (maximize reward) in daily life. rQoL combines the frequency of engaging in a relevant context (a 'behavior setting') with concurrent (positive) affect. High rQoL occurs when the most frequent behavior settings are combined with positive affect or infrequent behavior settings co-occur with low positive affect. Methods: Resampling procedures (Monte Carlo experiments) were applied to assess the reliability of rQoL using various behavior setting definitions under different sampling circumstances, for real or virtual subjects with low-, average- and high contextual variability. Furthermore, resampling was used to assess whether rQoL is a distinct concept from positive affect. Virtual ESM beep datasets were extracted from 1,058 valid ESM observations for virtual and real subjects. Results: Behavior settings defined by Who-What contextual information were most informative. Simulations of at least 100 ESM observations are needed for reliable assessment. Virtual ESM beep datasets of a real subject can be defined by Who-What-Where behavior setting combinations. Large sample sizes are necessary for reliable rQoL assessments, except for subjects with low contextual variability. rQoL is distinct from positive affect. Conclusion: rQoL is a feasible concept. Monte Carlo experiments should be used to assess the reliable implementation of an ESM statistic. Future research in ESM should asses the behavior of summary statistics under different sampling situations. This exploration is especially relevant in clinical implementation, where often only small datasets are available.

  10. Evolutionary fuzzy ARTMAP neural networks for classification of semiconductor defects.

    PubMed

    Tan, Shing Chiang; Watada, Junzo; Ibrahim, Zuwairie; Khalid, Marzuki

    2015-05-01

    Wafer defect detection using an intelligent system is an approach of quality improvement in semiconductor manufacturing that aims to enhance its process stability, increase production capacity, and improve yields. Occasionally, only few records that indicate defective units are available and they are classified as a minority group in a large database. Such a situation leads to an imbalanced data set problem, wherein it engenders a great challenge to deal with by applying machine-learning techniques for obtaining effective solution. In addition, the database may comprise overlapping samples of different classes. This paper introduces two models of evolutionary fuzzy ARTMAP (FAM) neural networks to deal with the imbalanced data set problems in a semiconductor manufacturing operations. In particular, both the FAM models and hybrid genetic algorithms are integrated in the proposed evolutionary artificial neural networks (EANNs) to classify an imbalanced data set. In addition, one of the proposed EANNs incorporates a facility to learn overlapping samples of different classes from the imbalanced data environment. The classification results of the proposed evolutionary FAM neural networks are presented, compared, and analyzed using several classification metrics. The outcomes positively indicate the effectiveness of the proposed networks in handling classification problems with imbalanced data sets.

  11. The Mira-Titan Universe. II. Matter Power Spectrum Emulation

    NASA Astrophysics Data System (ADS)

    Lawrence, Earl; Heitmann, Katrin; Kwan, Juliana; Upadhye, Amol; Bingham, Derek; Habib, Salman; Higdon, David; Pope, Adrian; Finkel, Hal; Frontiere, Nicholas

    2017-09-01

    We introduce a new cosmic emulator for the matter power spectrum covering eight cosmological parameters. Targeted at optical surveys, the emulator provides accurate predictions out to a wavenumber k˜ 5 Mpc-1 and redshift z≤slant 2. In addition to covering the standard set of ΛCDM parameters, massive neutrinos and a dynamical dark energy of state are included. The emulator is built on a sample set of 36 cosmological models, carefully chosen to provide accurate predictions over the wide and large parameter space. For each model, we have performed a high-resolution simulation, augmented with 16 medium-resolution simulations and TimeRG perturbation theory results to provide accurate coverage over a wide k-range; the data set generated as part of this project is more than 1.2Pbytes. With the current set of simulated models, we achieve an accuracy of approximately 4%. Because the sampling approach used here has established convergence and error-control properties, follow-up results with more than a hundred cosmological models will soon achieve ˜ 1 % accuracy. We compare our approach with other prediction schemes that are based on halo model ideas and remapping approaches. The new emulator code is publicly available.

  12. The Mira-Titan Universe. II. Matter Power Spectrum Emulation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lawrence, Earl; Heitmann, Katrin; Kwan, Juliana

    We introduce a new cosmic emulator for the matter power spectrum covering eight cosmological parameters. Targeted at optical surveys, the emulator provides accurate predictions out to a wavenumber k similar to 5 Mpc(-1) and redshift z <= 2. In addition to covering the standard set of Lambda CDM parameters, massive neutrinos and a dynamical dark energy of state are included. The emulator is built on a sample set of 36 cosmological models, carefully chosen to provide accurate predictions over the wide and large parameter space. For each model, we have performed a high-resolution simulation, augmented with 16 medium-resolution simulations andmore » TimeRG perturbation theory results to provide accurate coverage over a wide k-range; the data set generated as part of this project is more than 1.2Pbytes. With the current set of simulated models, we achieve an accuracy of approximately 4%. Because the sampling approach used here has established convergence and error-control properties, follow-up results with more than a hundred cosmological models will soon achieve similar to 1% accuracy. We compare our approach with other prediction schemes that are based on halo model ideas and remapping approaches.« less

  13. Can multi-subpopulation reference sets improve the genomic predictive ability for pigs?

    PubMed

    Fangmann, A; Bergfelder-Drüing, S; Tholen, E; Simianer, H; Erbe, M

    2015-12-01

    In most countries and for most livestock species, genomic evaluations are obtained from within-breed analyses. To achieve reliable breeding values, however, a sufficient reference sample size is essential. To increase this size, the use of multibreed reference populations for small populations is considered a suitable option in other species. Over decades, the separate breeding work of different pig breeding organizations in Germany has led to stratified subpopulations in the breed German Large White. Due to this fact and the limited number of Large White animals available in each organization, there was a pressing need for ascertaining if multi-subpopulation genomic prediction is superior compared with within-subpopulation prediction in pigs. Direct genomic breeding values were estimated with genomic BLUP for the trait "number of piglets born alive" using genotype data (Illumina Porcine 60K SNP BeadChip) from 2,053 German Large White animals from five different commercial pig breeding companies. To assess the prediction accuracy of within- and multi-subpopulation reference sets, a random 5-fold cross-validation with 20 replications was performed. The five subpopulations considered were only slightly differentiated from each other. However, the prediction accuracy of the multi-subpopulations approach was not better than that of the within-subpopulation evaluation, for which the predictive ability was already high. Reference sets composed of closely related multi-subpopulation sets performed better than sets of distantly related subpopulations but not better than the within-subpopulation approach. Despite the low differentiation of the five subpopulations, the genetic connectedness between these different subpopulations seems to be too small to improve the prediction accuracy by applying multi-subpopulation reference sets. Consequently, resources should be used for enlarging the reference population within subpopulation, for example, by adding genotyped females.

  14. HLA imputation in an admixed population: An assessment of the 1000 Genomes data as a training set.

    PubMed

    Nunes, Kelly; Zheng, Xiuwen; Torres, Margareth; Moraes, Maria Elisa; Piovezan, Bruno Z; Pontes, Gerlandia N; Kimura, Lilian; Carnavalli, Juliana E P; Mingroni Netto, Regina C; Meyer, Diogo

    2016-03-01

    Methods to impute HLA alleles based on dense single nucleotide polymorphism (SNP) data provide a valuable resource to association studies and evolutionary investigation of the MHC region. The availability of appropriate training sets is critical to the accuracy of HLA imputation, and the inclusion of samples with various ancestries is an important pre-requisite in studies of admixed populations. We assess the accuracy of HLA imputation using 1000 Genomes Project data as a training set, applying it to a highly admixed Brazilian population, the Quilombos from the state of São Paulo. To assess accuracy, we compared imputed and experimentally determined genotypes for 146 samples at 4 HLA classical loci. We found imputation accuracies of 82.9%, 81.8%, 94.8% and 86.6% for HLA-A, -B, -C and -DRB1 respectively (two-field resolution). Accuracies were improved when we included a subset of Quilombo individuals in the training set. We conclude that the 1000 Genomes data is a valuable resource for construction of training sets due to the diversity of ancestries and the potential for a large overlap of SNPs with the target population. We also show that tailoring training sets to features of the target population substantially enhances imputation accuracy. Copyright © 2016 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.

  15. Conceptual data sampling for breast cancer histology image classification.

    PubMed

    Rezk, Eman; Awan, Zainab; Islam, Fahad; Jaoua, Ali; Al Maadeed, Somaya; Zhang, Nan; Das, Gautam; Rajpoot, Nasir

    2017-10-01

    Data analytics have become increasingly complicated as the amount of data has increased. One technique that is used to enable data analytics in large datasets is data sampling, in which a portion of the data is selected to preserve the data characteristics for use in data analytics. In this paper, we introduce a novel data sampling technique that is rooted in formal concept analysis theory. This technique is used to create samples reliant on the data distribution across a set of binary patterns. The proposed sampling technique is applied in classifying the regions of breast cancer histology images as malignant or benign. The performance of our method is compared to other classical sampling methods. The results indicate that our method is efficient and generates an illustrative sample of small size. It is also competing with other sampling methods in terms of sample size and sample quality represented in classification accuracy and F1 measure. Copyright © 2017 Elsevier Ltd. All rights reserved.

  16. Sampling procedures for throughfall monitoring: A simulation study

    NASA Astrophysics Data System (ADS)

    Zimmermann, Beate; Zimmermann, Alexander; Lark, Richard Murray; Elsenbeer, Helmut

    2010-01-01

    What is the most appropriate sampling scheme to estimate event-based average throughfall? A satisfactory answer to this seemingly simple question has yet to be found, a failure which we attribute to previous efforts' dependence on empirical studies. Here we try to answer this question by simulating stochastic throughfall fields based on parameters for statistical models of large monitoring data sets. We subsequently sampled these fields with different sampling designs and variable sample supports. We evaluated the performance of a particular sampling scheme with respect to the uncertainty of possible estimated means of throughfall volumes. Even for a relative error limit of 20%, an impractically large number of small, funnel-type collectors would be required to estimate mean throughfall, particularly for small events. While stratification of the target area is not superior to simple random sampling, cluster random sampling involves the risk of being less efficient. A larger sample support, e.g., the use of trough-type collectors, considerably reduces the necessary sample sizes and eliminates the sensitivity of the mean to outliers. Since the gain in time associated with the manual handling of troughs versus funnels depends on the local precipitation regime, the employment of automatically recording clusters of long troughs emerges as the most promising sampling scheme. Even so, a relative error of less than 5% appears out of reach for throughfall under heterogeneous canopies. We therefore suspect a considerable uncertainty of input parameters for interception models derived from measured throughfall, in particular, for those requiring data of small throughfall events.

  17. Sample presentation, sources of error and future perspectives on the application of vibrational spectroscopy in the wine industry.

    PubMed

    Cozzolino, Daniel

    2015-03-30

    Vibrational spectroscopy encompasses a number of techniques and methods including ultra-violet, visible, Fourier transform infrared or mid infrared, near infrared and Raman spectroscopy. The use and application of spectroscopy generates spectra containing hundreds of variables (absorbances at each wavenumbers or wavelengths), resulting in the production of large data sets representing the chemical and biochemical wine fingerprint. Multivariate data analysis techniques are then required to handle the large amount of data generated in order to interpret the spectra in a meaningful way in order to develop a specific application. This paper focuses on the developments of sample presentation and main sources of error when vibrational spectroscopy methods are applied in wine analysis. Recent and novel applications will be discussed as examples of these developments. © 2014 Society of Chemical Industry.

  18. Gibbs Ensembles for Nearly Compatible and Incompatible Conditional Models

    PubMed Central

    Chen, Shyh-Huei; Wang, Yuchung J.

    2010-01-01

    Gibbs sampler has been used exclusively for compatible conditionals that converge to a unique invariant joint distribution. However, conditional models are not always compatible. In this paper, a Gibbs sampling-based approach — Gibbs ensemble —is proposed to search for a joint distribution that deviates least from a prescribed set of conditional distributions. The algorithm can be easily scalable such that it can handle large data sets of high dimensionality. Using simulated data, we show that the proposed approach provides joint distributions that are less discrepant from the incompatible conditionals than those obtained by other methods discussed in the literature. The ensemble approach is also applied to a data set regarding geno-polymorphism and response to chemotherapy in patients with metastatic colorectal PMID:21286232

  19. The brominated flame retardants, PBDEs and HBCD, in Canadian human milk samples collected from 1992 to 2005; concentrations and trends.

    PubMed

    Ryan, John Jake; Rawn, Dorothea F K

    2014-09-01

    Human milk samples were collected from individuals residing in various regions across Canada mostly in the years 1992 to 2005. These included five large cities in southern Canada as well as samples from Nunavik in northern Quebec. Comparative samples were also collected from residents of Austin, Texas, USA in 2002 and 2004. More than 300 milk samples were analysed for the brominated flame retardants (BFRs), PBDEs and HBCD, by extraction, purification and quantification using either isotope dilution gas chromatography-mass spectrometry (GC-MS) or liquid chromatography-MS. The Canadian total PBDE values in the years 2002-2005 show median levels of about 20μg/kg on a lipid basis; a value significantly higher than in the 1980s and 1990s. Milk samples from Inuit donors in the northern region of Nunavik were slightly lower in PBDE concentrations than those from populated regions in the south of Quebec. Milk samples from Ontario contained slightly lower amounts of PBDEs in two time periods than those from Texas. HBCD levels in most milk samples were usually less than 1ppb milk lipid and dominated by the α-isomer. This large data set of BFRs in Canadian human milk demonstrates an increase in the last few decades in human exposure to BFRs which now appears to have stabilized. Crown Copyright © 2014. Published by Elsevier Ltd. All rights reserved.

  20. Comparison of diagnostic techniques for the detection of Cryptosporidium oocysts in animal samples

    PubMed Central

    Mirhashemi, Marzieh Ezzaty; Zintl, Annetta; Grant, Tim; Lucy, Frances E.; Mulcahy, Grace; De Waal, Theo

    2015-01-01

    While a large number of laboratory methods for the detection of Cryptosporidium oocysts in faecal samples are now available, their efficacy for identifying asymptomatic cases of cryptosporidiosis is poorly understood. This study was carried out to determine a reliable screening test for epidemiological studies in livestock. In addition, three molecular tests were compared to identify Cryptosporidium species responsible for the infection in cattle, sheep and horses. A variety of diagnostic tests including microscopic (Kinyoun's staining), immunological (Direct Fluorescence Antibody tests or DFAT), enzyme-linked immunosorbent assay (ELISA), and molecular methods (nested PCR) were compared to assess their ability to detect Cryptosporidium in cattle, horse and sheep faecal samples. The results indicate that the sensitivity and specificity of each test is highly dependent on the input samples; while Kinyoun's and DFAT proved to be reliable screening tools for cattle samples, DFAT and PCR analysis (targeted at the 18S rRNA gene fragment) were more sensitive for screening sheep and horse samples. Finally different PCR primer sets targeted at the same region resulted in the preferential amplification of certain Cryptosporidium species when multiple species were present in the sample. Therefore, for identification of Cryptosporidium spp. in the event of asymptomatic cryptosporidiosis, the combination of different 18S rRNA nested PCR primer sets is recommended for further epidemiological applications and also tracking the sources of infection. PMID:25662435

  1. A comparison of effectiveness of hepatitis B screening and linkage to care among foreign-born populations in clinical and nonclinical settings.

    PubMed

    Chandrasekar, Edwin; Kaur, Ravneet; Song, Sharon; Kim, Karen E

    2015-01-01

    Hepatitis B (HBV) is an urgent, unmet public health issue that affects Asian Americans disproportionately. Of the estimated 1.2 million living with chronic hepatitis B in USA, more than 50% are of Asian ethnicity, despite the fact that Asian Americans constitute less than 6% of the total US population. The Centers for Disease Control and Prevention recommends HBV screening of persons who are at high risk for the disease. Yet, large numbers of Asian Americans have not been diagnosed or tested, in large part because of perceived cultural and linguistic barriers. Primary care physicians are at the front line of the US health care system, and are in a position to identify individuals and families at risk. Clinical settings integrated into Asian American communities, where physicians are on staff and wellness care is emphasized, can provide testing for HBV. In this study, the Asian Health Coalition and its community partners conducted HBV screenings and follow-up linkage to care in both clinical and nonclinical settings. The nonclinic settings included health fair events organized by churches and social services agencies, and were able to reach large numbers of individuals. Twice as many Asian Americans were screened in nonclinical settings than in health clinics. Chi-square and independent samples t-test showed that participants from the two settings did not differ in test positivity, sex, insurance status, years of residence in USA, or education. Additionally, the same proportion of individuals found to be infected in the two groups underwent successful linkage to care. Nonclinical settings were as effective as clinical settings in screening for HBV, as well as in making treatment options available to those who tested positive; demographic factors did not confound the similarities. Further research is needed to evaluate if linkage to care can be accomplished equally efficiently on a larger scale.

  2. Predicting ambient aerosol thermal-optical reflectance (TOR) measurements from infrared spectra: organic carbon

    NASA Astrophysics Data System (ADS)

    Dillner, A. M.; Takahama, S.

    2015-03-01

    Organic carbon (OC) can constitute 50% or more of the mass of atmospheric particulate matter. Typically, organic carbon is measured from a quartz fiber filter that has been exposed to a volume of ambient air and analyzed using thermal methods such as thermal-optical reflectance (TOR). Here, methods are presented that show the feasibility of using Fourier transform infrared (FT-IR) absorbance spectra from polytetrafluoroethylene (PTFE or Teflon) filters to accurately predict TOR OC. This work marks an initial step in proposing a method that can reduce the operating costs of large air quality monitoring networks with an inexpensive, non-destructive analysis technique using routinely collected PTFE filter samples which, in addition to OC concentrations, can concurrently provide information regarding the composition of organic aerosol. This feasibility study suggests that the minimum detection limit and errors (or uncertainty) of FT-IR predictions are on par with TOR OC such that evaluation of long-term trends and epidemiological studies would not be significantly impacted. To develop and test the method, FT-IR absorbance spectra are obtained from 794 samples from seven Interagency Monitoring of PROtected Visual Environment (IMPROVE) sites collected during 2011. Partial least-squares regression is used to calibrate sample FT-IR absorbance spectra to TOR OC. The FTIR spectra are divided into calibration and test sets by sampling site and date. The calibration produces precise and accurate TOR OC predictions of the test set samples by FT-IR as indicated by high coefficient of variation (R2; 0.96), low bias (0.02 μg m-3, the nominal IMPROVE sample volume is 32.8 m3), low error (0.08 μg m-3) and low normalized error (11%). These performance metrics can be achieved with various degrees of spectral pretreatment (e.g., including or excluding substrate contributions to the absorbances) and are comparable in precision to collocated TOR measurements. FT-IR spectra are also divided into calibration and test sets by OC mass and by OM / OC ratio, which reflects the organic composition of the particulate matter and is obtained from organic functional group composition; these divisions also leads to precise and accurate OC predictions. Low OC concentrations have higher bias and normalized error due to TOR analytical errors and artifact-correction errors, not due to the range of OC mass of the samples in the calibration set. However, samples with low OC mass can be used to predict samples with high OC mass, indicating that the calibration is linear. Using samples in the calibration set that have different OM / OC or ammonium / OC distributions than the test set leads to only a modest increase in bias and normalized error in the predicted samples. We conclude that FT-IR analysis with partial least-squares regression is a robust method for accurately predicting TOR OC in IMPROVE network samples - providing complementary information to the organic functional group composition and organic aerosol mass estimated previously from the same set of sample spectra (Ruthenburg et al., 2014).

  3. Definitive Characterization of CA 19-9 in Resectable Pancreatic Cancer Using a Reference Set of Serum and Plasma Specimens.

    PubMed

    Haab, Brian B; Huang, Ying; Balasenthil, Seetharaman; Partyka, Katie; Tang, Huiyuan; Anderson, Michelle; Allen, Peter; Sasson, Aaron; Zeh, Herbert; Kaul, Karen; Kletter, Doron; Ge, Shaokui; Bern, Marshall; Kwon, Richard; Blasutig, Ivan; Srivastava, Sudhir; Frazier, Marsha L; Sen, Subrata; Hollingsworth, Michael A; Rinaudo, Jo Ann; Killary, Ann M; Brand, Randall E

    2015-01-01

    The validation of candidate biomarkers often is hampered by the lack of a reliable means of assessing and comparing performance. We present here a reference set of serum and plasma samples to facilitate the validation of biomarkers for resectable pancreatic cancer. The reference set includes a large cohort of stage I-II pancreatic cancer patients, recruited from 5 different institutions, and relevant control groups. We characterized the performance of the current best serological biomarker for pancreatic cancer, CA 19-9, using plasma samples from the reference set to provide a benchmark for future biomarker studies and to further our knowledge of CA 19-9 in early-stage pancreatic cancer and the control groups. CA 19-9 distinguished pancreatic cancers from the healthy and chronic pancreatitis groups with an average sensitivity and specificity of 70-74%, similar to previous studies using all stages of pancreatic cancer. Chronic pancreatitis patients did not show CA 19-9 elevations, but patients with benign biliary obstruction had elevations nearly as high as the cancer patients. We gained additional information about the biomarker by comparing two distinct assays. The two CA 9-9 assays agreed well in overall performance but diverged in measurements of individual samples, potentially due to subtle differences in antibody specificity as revealed by glycan array analysis. Thus, the reference set promises be a valuable resource for biomarker validation and comparison, and the CA 19-9 data presented here will be useful for benchmarking and for exploring relationships to CA 19-9.

  4. Definitive Characterization of CA 19-9 in Resectable Pancreatic Cancer Using a Reference Set of Serum and Plasma Specimens

    PubMed Central

    Haab, Brian B.; Huang, Ying; Balasenthil, Seetharaman; Partyka, Katie; Tang, Huiyuan; Anderson, Michelle; Allen, Peter; Sasson, Aaron; Zeh, Herbert; Kaul, Karen; Kletter, Doron; Ge, Shaokui; Bern, Marshall; Kwon, Richard; Blasutig, Ivan; Srivastava, Sudhir; Frazier, Marsha L.; Sen, Subrata; Hollingsworth, Michael A.; Rinaudo, Jo Ann; Killary, Ann M.; Brand, Randall E.

    2015-01-01

    The validation of candidate biomarkers often is hampered by the lack of a reliable means of assessing and comparing performance. We present here a reference set of serum and plasma samples to facilitate the validation of biomarkers for resectable pancreatic cancer. The reference set includes a large cohort of stage I-II pancreatic cancer patients, recruited from 5 different institutions, and relevant control groups. We characterized the performance of the current best serological biomarker for pancreatic cancer, CA 19–9, using plasma samples from the reference set to provide a benchmark for future biomarker studies and to further our knowledge of CA 19–9 in early-stage pancreatic cancer and the control groups. CA 19–9 distinguished pancreatic cancers from the healthy and chronic pancreatitis groups with an average sensitivity and specificity of 70–74%, similar to previous studies using all stages of pancreatic cancer. Chronic pancreatitis patients did not show CA 19–9 elevations, but patients with benign biliary obstruction had elevations nearly as high as the cancer patients. We gained additional information about the biomarker by comparing two distinct assays. The two CA 9–9 assays agreed well in overall performance but diverged in measurements of individual samples, potentially due to subtle differences in antibody specificity as revealed by glycan array analysis. Thus, the reference set promises be a valuable resource for biomarker validation and comparison, and the CA 19–9 data presented here will be useful for benchmarking and for exploring relationships to CA 19–9. PMID:26431551

  5. The influence of locus number and information content on species delimitation: an empirical test case in an endangered Mexican salamander.

    PubMed

    Hime, Paul M; Hotaling, Scott; Grewelle, Richard E; O'Neill, Eric M; Voss, S Randal; Shaffer, H Bradley; Weisrock, David W

    2016-12-01

    Perhaps the most important recent advance in species delimitation has been the development of model-based approaches to objectively diagnose species diversity from genetic data. Additionally, the growing accessibility of next-generation sequence data sets provides powerful insights into genome-wide patterns of divergence during speciation. However, applying complex models to large data sets is time-consuming and computationally costly, requiring careful consideration of the influence of both individual and population sampling, as well as the number and informativeness of loci on species delimitation conclusions. Here, we investigated how locus number and information content affect species delimitation results for an endangered Mexican salamander species, Ambystoma ordinarium. We compared results for an eight-locus, 137-individual data set and an 89-locus, seven-individual data set. For both data sets, we used species discovery methods to define delimitation models and species validation methods to rigorously test these hypotheses. We also used integrated demographic model selection tools to choose among delimitation models, while accounting for gene flow. Our results indicate that while cryptic lineages may be delimited with relatively few loci, sampling larger numbers of loci may be required to ensure that enough informative loci are available to accurately identify and validate shallow-scale divergences. These analyses highlight the importance of striking a balance between dense sampling of loci and individuals, particularly in shallowly diverged lineages. They also suggest the presence of a currently unrecognized, endangered species in the western part of A. ordinarium's range. © 2016 John Wiley & Sons Ltd.

  6. PCAN: Probabilistic Correlation Analysis of Two Non-normal Data Sets

    PubMed Central

    Zoh, Roger S.; Mallick, Bani; Ivanov, Ivan; Baladandayuthapani, Veera; Manyam, Ganiraju; Chapkin, Robert S.; Lampe, Johanna W.; Carroll, Raymond J.

    2016-01-01

    Summary Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches. PMID:27037601

  7. PCAN: Probabilistic correlation analysis of two non-normal data sets.

    PubMed

    Zoh, Roger S; Mallick, Bani; Ivanov, Ivan; Baladandayuthapani, Veera; Manyam, Ganiraju; Chapkin, Robert S; Lampe, Johanna W; Carroll, Raymond J

    2016-12-01

    Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches. © 2016, The International Biometric Society.

  8. Searching for missing heritability: Designing rare variant association studies

    PubMed Central

    Zuk, Or; Schaffner, Stephen F.; Samocha, Kaitlin; Do, Ron; Hechter, Eliana; Kathiresan, Sekar; Daly, Mark J.; Neale, Benjamin M.; Sunyaev, Shamil R.; Lander, Eric S.

    2014-01-01

    Genetic studies have revealed thousands of loci predisposing to hundreds of human diseases and traits, revealing important biological pathways and defining novel therapeutic hypotheses. However, the genes discovered to date typically explain less than half of the apparent heritability. Because efforts have largely focused on common genetic variants, one hypothesis is that much of the missing heritability is due to rare genetic variants. Studies of common variants are typically referred to as genomewide association studies, whereas studies of rare variants are often simply called sequencing studies. Because they are actually closely related, we use the terms common variant association study (CVAS) and rare variant association study (RVAS). In this paper, we outline the similarities and differences between RVAS and CVAS and describe a conceptual framework for the design of RVAS. We apply the framework to address key questions about the sample sizes needed to detect association, the relative merits of testing disruptive alleles vs. missense alleles, frequency thresholds for filtering alleles, the value of predictors of the functional impact of missense alleles, the potential utility of isolated populations, the value of gene-set analysis, and the utility of de novo mutations. The optimal design depends critically on the selection coefficient against deleterious alleles and thus varies across genes. The analysis shows that common variant and rare variant studies require similarly large sample collections. In particular, a well-powered RVAS should involve discovery sets with at least 25,000 cases, together with a substantial replication set. PMID:24443550

  9. Type I error probabilities based on design-stage strategies with applications to noninferiority trials.

    PubMed

    Rothmann, Mark

    2005-01-01

    When testing the equality of means from two different populations, a t-test or large sample normal test tend to be performed. For these tests, when the sample size or design for the second sample is dependent on the results of the first sample, the type I error probability is altered for each specific possibility in the null hypothesis. We will examine the impact on the type I error probabilities for two confidence interval procedures and procedures using test statistics when the design for the second sample or experiment is dependent on the results from the first sample or experiment (or series of experiments). Ways for controlling a desired maximum type I error probability or a desired type I error rate will be discussed. Results are applied to the setting of noninferiority comparisons in active controlled trials where the use of a placebo is unethical.

  10. Scene recognition based on integrating active learning with dictionary learning

    NASA Astrophysics Data System (ADS)

    Wang, Chengxi; Yin, Xueyan; Yang, Lin; Gong, Chengrong; Zheng, Caixia; Yi, Yugen

    2018-04-01

    Scene recognition is a significant topic in the field of computer vision. Most of the existing scene recognition models require a large amount of labeled training samples to achieve a good performance. However, labeling image manually is a time consuming task and often unrealistic in practice. In order to gain satisfying recognition results when labeled samples are insufficient, this paper proposed a scene recognition algorithm named Integrating Active Learning and Dictionary Leaning (IALDL). IALDL adopts projective dictionary pair learning (DPL) as classifier and introduces active learning mechanism into DPL for improving its performance. When constructing sampling criterion in active learning, IALDL considers both the uncertainty and representativeness as the sampling criteria to effectively select the useful unlabeled samples from a given sample set for expanding the training dataset. Experiment results on three standard databases demonstrate the feasibility and validity of the proposed IALDL.

  11. Experiments on the Effects of Confining Pressure During Reaction-Driven Cracking

    NASA Astrophysics Data System (ADS)

    Skarbek, R. M.; Savage, H. M.; Kelemen, P. B.; Lambart, S.; Robinson, B.

    2016-12-01

    Cracking caused by reaction-driven volume increase is an important process in many geological settings. In particular, the interaction of brittle rocks with reactive fluids can create fractures that modify the permeability and reactive surface area, leading to a large variety of feedbacks. The conditions controlling reaction-driven cracking are poorly understood, especially at geologically relevant confining pressures. We conducted two sets of experiments to study the effects of confining pressure on cracking during the formation of gypsum from anhydrite CaSO4 + 2H2O = CaSO4•2H2O, and portlandite from calcium oxide CaO + H2O = Ca(OH)2. In the first set of experiments, we cold-pressed CaSO4, or CaO powder to form cylinders. Samples were confined in steel, and compressed with an axial load of 0.1 to 4 MPa. Water was allowed to infiltrate the initially unsaturated samples through the bottom face via capillary and Darcian flow across a micro-porous frit. The height of the sample was recorded during the experiment, and serves as a measure of volume change due to the hydration reaction. We also recorded acoustic emissions (AEs) using piezoelectric transducers (PZTs), to serve as a measure of cracking during an experiment. Experiments were stopped when the recorded volume change reached 80% - 100% of the stoichiometrically calculated volume change of the reaction. In a second set of experiments, we pressed CaSO4 powder to form cylinders 8.9 cm in length and 3.5 cm in diameter for testing in a tri-axial press with ports for fluid input and output, across the top and bottom faces of the sample. The tri-axial experiments were set up to investigate the reaction-driven cracking process for a range of confining pressures. Cracking during experiments was monitored using strain gauges and PZTs attached to the sample. We measured permeability during experiments by imposing a fluid pressure gradient across the sample. These experiments elucidate the role of cracking caused by crystallization pressure in many important hydration reactions.

  12. Randomly picked cosmid clones overlap the pyrB and oriC gap in the physical map of the E. coli chromosome.

    PubMed Central

    Knott, V; Rees, D J; Cheng, Z; Brownlee, G G

    1988-01-01

    Sets of overlapping cosmid clones generated by random sampling and fingerprinting methods complement data at pyrB (96.5') and oriC (84') in the published physical map of E. coli. A new cloning strategy using sheared DNA, and a low copy, inducible cosmid vector were used in order to reduce bias in libraries, in conjunction with micro-methods for preparing cosmid DNA from a large number of clones. Our results are relevant to the design of the best approach to the physical mapping of large genomes. PMID:2834694

  13. A novel dicyanoisophorone based red-emitting fluorescent probe with a large Stokes shift for detection of hydrazine in solution and living cells

    NASA Astrophysics Data System (ADS)

    Lv, Hongshui; Sun, Haiyan; Wang, Shoujuan; Kong, Fangong

    2018-05-01

    A novel dicyanoisophorone based fluorescent probe HP was developed to detect hydrazine. Upon the addition of hydrazine, probe HP displayed turn-on fluorescence in the red region with a large Stokes shift (180 nm). This probe exhibited high selectivity and high sensitivity to hydrazine in solution. The detection limit of HP was found to be 3.26 ppb, which was lower than the threshold limit value set by USEPA (10 ppb). Moreover, the probe was successfully applied to detect hydrazine in different water samples and living cells.

  14. Of Small Beauties and Large Beasts: The Quality of Distractors on Multiple-Choice Tests Is More Important than Their Quantity

    ERIC Educational Resources Information Center

    Papenberg, Martin; Musch, Jochen

    2017-01-01

    In multiple-choice tests, the quality of distractors may be more important than their number. We therefore examined the joint influence of distractor quality and quantity on test functioning by providing a sample of 5,793 participants with five parallel test sets consisting of items that differed in the number and quality of distractors.…

  15. Northwest Forest Plan—the first 15 years (1994–2008): watershed condition status and trend

    Treesearch

    Steven H. Lanigan; Sean N. Gordon; Peter Eldred; Mark Isley; Steve Wilcox; Chris Moyer; Heidi Andersen

    2012-01-01

    We used two data sets to evaluate stream and watershed condition for sixth-field watersheds in each aquatic province within the Northwest Forest Plan (NWFP) area: stream data and upslope data. The stream evaluation was based on inchannel data (e.g., substrate, pieces of large wood, water temperature, pool frequency, and macroinvertebrates) we sampled from 2002 to 2009...

  16. Implications of alternative field-sampling designs on Landsat-based mapping of stand age and carbon stocks in Oregon forests

    Treesearch

    Maureen V. Duane; Warren B. Cohen; John L. Campbell; Tara Hudiburg; David P. Turner; Dale Weyermann

    2010-01-01

    Empirical models relating forest attributes to remotely sensed metrics are widespread in the literature and underpin many of our efforts to map forest structure across complex landscapes. In this study we compared empirical models relating Landsat reflectance to forest age across Oregon using two alternate sets of ground data: one from a large (n ~ 1500) systematic...

  17. Stability and Change in Interests: A Longitudinal Study of Adolescents from Grades 8 through 12

    ERIC Educational Resources Information Center

    Tracey, Terence J. G.; Robbins, Steven B.; Hofsess, Christy D.

    2005-01-01

    The pattern of RIASEC interests and academic skills were assessed longitudinally from a large-scale national database at three time points: eight grade, 10th grade, and 12th grade. Validation and cross-validation samples of 1000 males and 1000 females in each set were used to test the pattern of these scores over time relative to mean changes,…

  18. Cloud-based solution to identify statistically significant MS peaks differentiating sample categories.

    PubMed

    Ji, Jun; Ling, Jeffrey; Jiang, Helen; Wen, Qiaojun; Whitin, John C; Tian, Lu; Cohen, Harvey J; Ling, Xuefeng B

    2013-03-23

    Mass spectrometry (MS) has evolved to become the primary high throughput tool for proteomics based biomarker discovery. Until now, multiple challenges in protein MS data analysis remain: large-scale and complex data set management; MS peak identification, indexing; and high dimensional peak differential analysis with the concurrent statistical tests based false discovery rate (FDR). "Turnkey" solutions are needed for biomarker investigations to rapidly process MS data sets to identify statistically significant peaks for subsequent validation. Here we present an efficient and effective solution, which provides experimental biologists easy access to "cloud" computing capabilities to analyze MS data. The web portal can be accessed at http://transmed.stanford.edu/ssa/. Presented web application supplies large scale MS data online uploading and analysis with a simple user interface. This bioinformatic tool will facilitate the discovery of the potential protein biomarkers using MS.

  19. A multilayer membrane amperometric glucose sensor fabricated using planar techniques for large-scale production.

    PubMed

    Matsumoto, T; Saito, S; Ikeda, S

    2006-03-23

    This paper reports on a multilayer membrane amperometric glucose sensor fabricated using planar techniques. It is characterized by good reproducibility and suitable for large-scale production. The glucose sensor has 82 electrode sets formed on a single glass substrate, each with a platinum working electrode (WE), a platinum counter electrode (CE) and an Ag/AgCl reference electrode (RE). The electrode sets are coated with a membrane consisting of five layers: gamma-aminopropyltriethoxysilane (gamma-APTES), Nafion, glucose oxidase (GOX), gamma-APTES and perfluorocarbon polymer (PFCP), in that order. Tests have shown that the sensor has acceptably low dispersion (relative standard deviation, R.S.D.=42.9%, n=82), a wide measurement range (1.11-111 mM) and measurement stability over a 27-day period. Measurements of the glucose concentration in a control human urine sample demonstrated that the sensor has very low dispersion (R.S.D.=2.49%, n=10).

  20. Distribution of subtidal sedimentary bedforms in a macrotidal setting: The Bay of Fundy, Atlantic Canada

    NASA Astrophysics Data System (ADS)

    Todd, Brian J.; Shaw, John; Li, Michael Z.; Kostylev, Vladimir E.; Wu, Yongsheng

    2014-07-01

    The Bay of Fundy, Canada, a large macrotidal embayment with the World's highest recorded tides, was mapped using multibeam sonar systems. High-resolution imagery of seafloor terrain and backscatter strength, combined with geophysical and sampling data, reveal for the first time the morphology, architecture, and spatial relationships of a spectrum of bedforms: (1) flow-transverse bedforms occur as both discrete large two-dimensional dunes and as three-dimensional dunes in sand sheets; (2) flow-parallel bedforms are numerous straight ridges described by others as horse mussel bioherms; (3) sets of banner banks that flank prominent headlands and major shoals. The suite of bedforms developed during the Holocene, as tidal energy increased due to the bay approaching resonance. We consider the evolution of these bedforms, their migration potential and how they may place limitations on future in-stream tidal power development in the Bay of Fundy.

  1. Evaluation of precipitation nowcasting techniques for the Alpine region

    NASA Astrophysics Data System (ADS)

    Panziera, L.; Mandapaka, P.; Atencia, A.; Hering, A.; Germann, U.; Gabella, M.; Buzzi, M.

    2010-09-01

    This study presents a large sample evaluation of different nowcasting systems over the Southern Swiss Alps. Radar observations are taken as a reference against which to assess the performance of the following short-term quantitative precipitation forecasting methods: -Eulerian persistence: the current radar image is taken as forecast. -Lagrangian persistence: precipitation patterns are advected following the field of storm motion (the MAPLE algorithm is used). -NORA: novel nowcasting system which exploits the presence of the orographic forcing; by comparing meteorological predictors estimated in real-time with those from the large historical data set, the events with the highest resemblance are picked to produce the forecast. -COSMO2, the limited area numerical model operationally used at MeteoSwiss -Blending of the aforementioned nowcasting tools precipitation forecasts. The investigation is aimed to set up a probabilistic radar rainfall runoff model experiment for steep Alpine catchments as part of the European research project IMPRINTS.

  2. Modeling read counts for CNV detection in exome sequencing data.

    PubMed

    Love, Michael I; Myšičková, Alena; Sun, Ruping; Kalscheuer, Vera; Vingron, Martin; Haas, Stefan A

    2011-11-08

    Varying depth of high-throughput sequencing reads along a chromosome makes it possible to observe copy number variants (CNVs) in a sample relative to a reference. In exome and other targeted sequencing projects, technical factors increase variation in read depth while reducing the number of observed locations, adding difficulty to the problem of identifying CNVs. We present a hidden Markov model for detecting CNVs from raw read count data, using background read depth from a control set as well as other positional covariates such as GC-content. The model, exomeCopy, is applied to a large chromosome X exome sequencing project identifying a list of large unique CNVs. CNVs predicted by the model and experimentally validated are then recovered using a cross-platform control set from publicly available exome sequencing data. Simulations show high sensitivity for detecting heterozygous and homozygous CNVs, outperforming normalization and state-of-the-art segmentation methods.

  3. Automated X-Ray Diffraction of Irradiated Materials

    DOE PAGES

    Rodman, John; Lin, Yuewei; Sprouster, David; ...

    2017-10-26

    Synchrotron-based X-ray diffraction (XRD) and small-angle Xray scattering (SAXS) characterization techniques used on unirradiated and irradiated reactor pressure vessel steels yield large amounts of data. Machine learning techniques, including PCA, offer a novel method of analyzing and visualizing these large data sets in order to determine the effects of chemistry and irradiation conditions on the formation of radiation induced precipitates. In order to run analysis on these data sets, preprocessing must be carried out to convert the data to a usable format and mask the 2-D detector images to account for experimental variations. Once the data has been preprocessed, itmore » can be organized and visualized using principal component analysis (PCA), multi-dimensional scaling, and k-means clustering. In conclusion, from these techniques, it is shown that sample chemistry has a notable effect on the formation of the radiation induced precipitates in reactor pressure vessel steels.« less

  4. Deep learning for computational biology.

    PubMed

    Angermueller, Christof; Pärnamaa, Tanel; Parts, Leopold; Stegle, Oliver

    2016-07-29

    Technological advances in genomics and imaging have led to an explosion of molecular and cellular profiling data from large numbers of samples. This rapid increase in biological data dimension and acquisition rate is challenging conventional analysis strategies. Modern machine learning methods, such as deep learning, promise to leverage very large data sets for finding hidden structure within them, and for making accurate predictions. In this review, we discuss applications of this new breed of analysis approaches in regulatory genomics and cellular imaging. We provide background of what deep learning is, and the settings in which it can be successfully applied to derive biological insights. In addition to presenting specific applications and providing tips for practical use, we also highlight possible pitfalls and limitations to guide computational biologists when and how to make the most use of this new technology. © 2016 The Authors. Published under the terms of the CC BY 4.0 license.

  5. DaVIE: Database for the Visualization and Integration of Epigenetic data

    PubMed Central

    Fejes, Anthony P.; Jones, Meaghan J.; Kobor, Michael S.

    2014-01-01

    One of the challenges in the analysis of large data sets, particularly in a population-based setting, is the ability to perform comparisons across projects. This has to be done in such a way that the integrity of each individual project is maintained, while ensuring that the data are comparable across projects. These issues are beginning to be observed in human DNA methylation studies, as the Illumina 450k platform and next generation sequencing-based assays grow in popularity and decrease in price. This increase in productivity is enabling new insights into epigenetics, but also requires the development of pipelines and software capable of handling the large volumes of data. The specific problems inherent in creating a platform for the storage, comparison, integration, and visualization of DNA methylation data include data storage, algorithm efficiency and ability to interpret the results to derive biological meaning from them. Databases provide a ready-made solution to these issues, but as yet no tools exist that that leverage these advantages while providing an intuitive user interface for interpreting results in a genomic context. We have addressed this void by integrating a database to store DNA methylation data with a web interface to query and visualize the database and a set of libraries for more complex analysis. The resulting platform is called DaVIE: Database for the Visualization and Integration of Epigenetics data. DaVIE can use data culled from a variety of sources, and the web interface includes the ability to group samples by sub-type, compare multiple projects and visualize genomic features in relation to sites of interest. We have used DaVIE to identify patterns of DNA methylation in specific projects and across different projects, identify outlier samples, and cross-check differentially methylated CpG sites identified in specific projects across large numbers of samples. A demonstration server has been setup using GEO data at http://echelon.cmmt.ubc.ca/dbaccess/, with login “guest” and password “guest.” Groups may download and install their own version of the server following the instructions on the project's wiki. PMID:25278960

  6. Rapid on-line detection and grading of wooden breast myopathy in chicken fillets by near-infrared spectroscopy.

    PubMed

    Wold, Jens Petter; Veiseth-Kent, Eva; Høst, Vibeke; Løvland, Atle

    2017-01-01

    The main objective of this work was to develop a method for rapid and non-destructive detection and grading of wooden breast (WB) syndrome in chicken breast fillets. Near-infrared (NIR) spectroscopy was chosen as detection method, and an industrial NIR scanner was applied and tested for large scale on-line detection of the syndrome. Two approaches were evaluated for discrimination of WB fillets: 1) Linear discriminant analysis based on NIR spectra only, and 2) a regression model for protein was made based on NIR spectra and the estimated concentrations of protein were used for discrimination. A sample set of 197 fillets was used for training and calibration. A test set was recorded under industrial conditions and contained spectra from 79 fillets. The classification methods obtained 99.5-100% correct classification of the calibration set and 100% correct classification of the test set. The NIR scanner was then installed in a commercial chicken processing plant and could detect incidence rates of WB in large batches of fillets. Examples of incidence are shown for three broiler flocks where a high number of fillets (9063, 6330 and 10483) were effectively measured. Prevalence of WB of 0.1%, 6.6% and 8.5% were estimated for these flocks based on the complete sample volumes. Such an on-line system can be used to alleviate the challenges WB represents to the poultry meat industry. It enables automatic quality sorting of chicken fillets to different product categories. Manual laborious grading can be avoided. Incidences of WB from different farms and flocks can be tracked and information can be used to understand and point out main causes for WB in the chicken production. This knowledge can be used to improve the production procedures and reduce today's extensive occurrence of WB.

  7. Multiple regression and Artificial Neural Network for long-term rainfall forecasting using large scale climate modes

    NASA Astrophysics Data System (ADS)

    Mekanik, F.; Imteaz, M. A.; Gato-Trinidad, S.; Elmahdi, A.

    2013-10-01

    In this study, the application of Artificial Neural Networks (ANN) and Multiple regression analysis (MR) to forecast long-term seasonal spring rainfall in Victoria, Australia was investigated using lagged El Nino Southern Oscillation (ENSO) and Indian Ocean Dipole (IOD) as potential predictors. The use of dual (combined lagged ENSO-IOD) input sets for calibrating and validating ANN and MR Models is proposed to investigate the simultaneous effect of past values of these two major climate modes on long-term spring rainfall prediction. The MR models that did not violate the limits of statistical significance and multicollinearity were selected for future spring rainfall forecast. The ANN was developed in the form of multilayer perceptron using Levenberg-Marquardt algorithm. Both MR and ANN modelling were assessed statistically using mean square error (MSE), mean absolute error (MAE), Pearson correlation (r) and Willmott index of agreement (d). The developed MR and ANN models were tested on out-of-sample test sets; the MR models showed very poor generalisation ability for east Victoria with correlation coefficients of -0.99 to -0.90 compared to ANN with correlation coefficients of 0.42-0.93; ANN models also showed better generalisation ability for central and west Victoria with correlation coefficients of 0.68-0.85 and 0.58-0.97 respectively. The ability of multiple regression models to forecast out-of-sample sets is compatible with ANN for Daylesford in central Victoria and Kaniva in west Victoria (r = 0.92 and 0.67 respectively). The errors of the testing sets for ANN models are generally lower compared to multiple regression models. The statistical analysis suggest the potential of ANN over MR models for rainfall forecasting using large scale climate modes.

  8. Measure of functional independence dominates discharge outcome prediction after inpatient rehabilitation for stroke.

    PubMed

    Brown, Allen W; Therneau, Terry M; Schultz, Billie A; Niewczyk, Paulette M; Granger, Carl V

    2015-04-01

    Identifying clinical data acquired at inpatient rehabilitation admission for stroke that accurately predict key outcomes at discharge could inform the development of customized plans of care to achieve favorable outcomes. The purpose of this analysis was to use a large comprehensive national data set to consider a wide range of clinical elements known at admission to identify those that predict key outcomes at rehabilitation discharge. Sample data were obtained from the Uniform Data System for Medical Rehabilitation data set with the diagnosis of stroke for the years 2005 through 2007. This data set includes demographic, administrative, and medical variables collected at admission and discharge and uses the FIM (functional independence measure) instrument to assess functional independence. Primary outcomes of interest were functional independence measure gain, length of stay, and discharge to home. The sample included 148,367 people (75% white; mean age, 70.6±13.1 years; 97% with ischemic stroke) admitted to inpatient rehabilitation a mean of 8.2±12 days after symptom onset. The total functional independence measure score, the functional independence measure motor subscore, and the case-mix group were equally the strongest predictors for any of the primary outcomes. The most clinically relevant 3-variable model used the functional independence measure motor subscore, age, and walking distance at admission (r(2)=0.107). No important additional effect for any other variable was detected when added to this model. This analysis shows that a measure of functional independence in motor performance and age at rehabilitation hospital admission for stroke are predominant predictors of outcome at discharge in a uniquely large US national data set. © 2015 American Heart Association, Inc.

  9. A Novel Hybrid Dimension Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification

    PubMed Central

    Pamukçu, Esra; Bozdogan, Hamparsum; Çalık, Sinan

    2015-01-01

    Gene expression data typically are large, complex, and highly noisy. Their dimension is high with several thousand genes (i.e., features) but with only a limited number of observations (i.e., samples). Although the classical principal component analysis (PCA) method is widely used as a first standard step in dimension reduction and in supervised and unsupervised classification, it suffers from several shortcomings in the case of data sets involving undersized samples, since the sample covariance matrix degenerates and becomes singular. In this paper we address these limitations within the context of probabilistic PCA (PPCA) by introducing and developing a new and novel approach using maximum entropy covariance matrix and its hybridized smoothed covariance estimators. To reduce the dimensionality of the data and to choose the number of probabilistic PCs (PPCs) to be retained, we further introduce and develop celebrated Akaike's information criterion (AIC), consistent Akaike's information criterion (CAIC), and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan. Six publicly available undersized benchmark data sets were analyzed to show the utility, flexibility, and versatility of our approach with hybridized smoothed covariance matrix estimators, which do not degenerate to perform the PPCA to reduce the dimension and to carry out supervised classification of cancer groups in high dimensions. PMID:25838836

  10. BloodSpot: a database of gene expression profiles and transcriptional programs for healthy and malignant haematopoiesis

    PubMed Central

    Bagger, Frederik Otzen; Sasivarevic, Damir; Sohi, Sina Hadi; Laursen, Linea Gøricke; Pundhir, Sachin; Sønderby, Casper Kaae; Winther, Ole; Rapin, Nicolas; Porse, Bo T.

    2016-01-01

    Research on human and murine haematopoiesis has resulted in a vast number of gene-expression data sets that can potentially answer questions regarding normal and aberrant blood formation. To researchers and clinicians with limited bioinformatics experience, these data have remained available, yet largely inaccessible. Current databases provide information about gene-expression but fail to answer key questions regarding co-regulation, genetic programs or effect on patient survival. To address these shortcomings, we present BloodSpot (www.bloodspot.eu), which includes and greatly extends our previously released database HemaExplorer, a database of gene expression profiles from FACS sorted healthy and malignant haematopoietic cells. A revised interactive interface simultaneously provides a plot of gene expression along with a Kaplan–Meier analysis and a hierarchical tree depicting the relationship between different cell types in the database. The database now includes 23 high-quality curated data sets relevant to normal and malignant blood formation and, in addition, we have assembled and built a unique integrated data set, BloodPool. Bloodpool contains more than 2000 samples assembled from six independent studies on acute myeloid leukemia. Furthermore, we have devised a robust sample integration procedure that allows for sensitive comparison of user-supplied patient samples in a well-defined haematopoietic cellular space. PMID:26507857

  11. Machine learning of molecular properties: Locality and active learning

    NASA Astrophysics Data System (ADS)

    Gubaev, Konstantin; Podryabinkin, Evgeny V.; Shapeev, Alexander V.

    2018-06-01

    In recent years, the machine learning techniques have shown great potent1ial in various problems from a multitude of disciplines, including materials design and drug discovery. The high computational speed on the one hand and the accuracy comparable to that of density functional theory on another hand make machine learning algorithms efficient for high-throughput screening through chemical and configurational space. However, the machine learning algorithms available in the literature require large training datasets to reach the chemical accuracy and also show large errors for the so-called outliers—the out-of-sample molecules, not well-represented in the training set. In the present paper, we propose a new machine learning algorithm for predicting molecular properties that addresses these two issues: it is based on a local model of interatomic interactions providing high accuracy when trained on relatively small training sets and an active learning algorithm of optimally choosing the training set that significantly reduces the errors for the outliers. We compare our model to the other state-of-the-art algorithms from the literature on the widely used benchmark tests.

  12. Correlated Topic Vector for Scene Classification.

    PubMed

    Wei, Pengxu; Qin, Fei; Wan, Fang; Zhu, Yi; Jiao, Jianbin; Ye, Qixiang

    2017-07-01

    Scene images usually involve semantic correlations, particularly when considering large-scale image data sets. This paper proposes a novel generative image representation, correlated topic vector, to model such semantic correlations. Oriented from the correlated topic model, correlated topic vector intends to naturally utilize the correlations among topics, which are seldom considered in the conventional feature encoding, e.g., Fisher vector, but do exist in scene images. It is expected that the involvement of correlations can increase the discriminative capability of the learned generative model and consequently improve the recognition accuracy. Incorporated with the Fisher kernel method, correlated topic vector inherits the advantages of Fisher vector. The contributions to the topics of visual words have been further employed by incorporating the Fisher kernel framework to indicate the differences among scenes. Combined with the deep convolutional neural network (CNN) features and Gibbs sampling solution, correlated topic vector shows great potential when processing large-scale and complex scene image data sets. Experiments on two scene image data sets demonstrate that correlated topic vector improves significantly the deep CNN features, and outperforms existing Fisher kernel-based features.

  13. Large-Scale Point-Cloud Visualization through Localized Textured Surface Reconstruction.

    PubMed

    Arikan, Murat; Preiner, Reinhold; Scheiblauer, Claus; Jeschke, Stefan; Wimmer, Michael

    2014-09-01

    In this paper, we introduce a novel scene representation for the visualization of large-scale point clouds accompanied by a set of high-resolution photographs. Many real-world applications deal with very densely sampled point-cloud data, which are augmented with photographs that often reveal lighting variations and inaccuracies in registration. Consequently, the high-quality representation of the captured data, i.e., both point clouds and photographs together, is a challenging and time-consuming task. We propose a two-phase approach, in which the first (preprocessing) phase generates multiple overlapping surface patches and handles the problem of seamless texture generation locally for each patch. The second phase stitches these patches at render-time to produce a high-quality visualization of the data. As a result of the proposed localization of the global texturing problem, our algorithm is more than an order of magnitude faster than equivalent mesh-based texturing techniques. Furthermore, since our preprocessing phase requires only a minor fraction of the whole data set at once, we provide maximum flexibility when dealing with growing data sets.

  14. Prediction of near-surface soil moisture at large scale by digital terrain modeling and neural networks.

    PubMed

    Lavado Contador, J F; Maneta, M; Schnabel, S

    2006-10-01

    The capability of Artificial Neural Network models to forecast near-surface soil moisture at fine spatial scale resolution has been tested for a 99.5 ha watershed located in SW Spain using several easy to achieve digital models of topographic and land cover variables as inputs and a series of soil moisture measurements as training data set. The study methods were designed in order to determining the potentials of the neural network model as a tool to gain insight into soil moisture distribution factors and also in order to optimize the data sampling scheme finding the optimum size of the training data set. Results suggest the efficiency of the methods in forecasting soil moisture, as a tool to assess the optimum number of field samples, and the importance of the variables selected in explaining the final map obtained.

  15. A machine learning approach for efficient uncertainty quantification using multiscale methods

    NASA Astrophysics Data System (ADS)

    Chan, Shing; Elsheikh, Ahmed H.

    2018-02-01

    Several multiscale methods account for sub-grid scale features using coarse scale basis functions. For example, in the Multiscale Finite Volume method the coarse scale basis functions are obtained by solving a set of local problems over dual-grid cells. We introduce a data-driven approach for the estimation of these coarse scale basis functions. Specifically, we employ a neural network predictor fitted using a set of solution samples from which it learns to generate subsequent basis functions at a lower computational cost than solving the local problems. The computational advantage of this approach is realized for uncertainty quantification tasks where a large number of realizations has to be evaluated. We attribute the ability to learn these basis functions to the modularity of the local problems and the redundancy of the permeability patches between samples. The proposed method is evaluated on elliptic problems yielding very promising results.

  16. The limitations of simple gene set enrichment analysis assuming gene independence.

    PubMed

    Tamayo, Pablo; Steinhardt, George; Liberzon, Arthur; Mesirov, Jill P

    2016-02-01

    Since its first publication in 2003, the Gene Set Enrichment Analysis method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach using a one-sample t-test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes Gene Set Enrichment Analysis's nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with Gene Set Enrichment Analysis's on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods. © The Author(s) 2012.

  17. Combination of automated high throughput platforms, flow cytometry, and hierarchical clustering to detect cell state.

    PubMed

    Kitsos, Christine M; Bhamidipati, Phani; Melnikova, Irena; Cash, Ethan P; McNulty, Chris; Furman, Julia; Cima, Michael J; Levinson, Douglas

    2007-01-01

    This study examined whether hierarchical clustering could be used to detect cell states induced by treatment combinations that were generated through automation and high-throughput (HT) technology. Data-mining techniques were used to analyze the large experimental data sets to determine whether nonlinear, non-obvious responses could be extracted from the data. Unary, binary, and ternary combinations of pharmacological factors (examples of stimuli) were used to induce differentiation of HL-60 cells using a HT automated approach. Cell profiles were analyzed by incorporating hierarchical clustering methods on data collected by flow cytometry. Data-mining techniques were used to explore the combinatorial space for nonlinear, unexpected events. Additional small-scale, follow-up experiments were performed on cellular profiles of interest. Multiple, distinct cellular profiles were detected using hierarchical clustering of expressed cell-surface antigens. Data-mining of this large, complex data set retrieved cases of both factor dominance and cooperativity, as well as atypical cellular profiles. Follow-up experiments found that treatment combinations producing "atypical cell types" made those cells more susceptible to apoptosis. CONCLUSIONS Hierarchical clustering and other data-mining techniques were applied to analyze large data sets from HT flow cytometry. From each sample, the data set was filtered and used to define discrete, usable states that were then related back to their original formulations. Analysis of resultant cell populations induced by a multitude of treatments identified unexpected phenotypes and nonlinear response profiles.

  18. THE RADIO/GAMMA-RAY CONNECTION IN ACTIVE GALACTIC NUCLEI IN THE ERA OF THE FERMI LARGE AREA TELESCOPE

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ackermann, M.; Ajello, M.; Allafort, A.

    We present a detailed statistical analysis of the correlation between radio and gamma-ray emission of the active galactic nuclei (AGNs) detected by Fermi during its first year of operation, with the largest data sets ever used for this purpose. We use both archival interferometric 8.4 GHz data (from the Very Large Array and ATCA, for the full sample of 599 sources) and concurrent single-dish 15 GHz measurements from the Owens Valley Radio Observatory (OVRO, for a sub sample of 199 objects). Our unprecedentedly large sample permits us to assess with high accuracy the statistical significance of the correlation, using amore » surrogate data method designed to simultaneously account for common-distance bias and the effect of a limited dynamical range in the observed quantities. We find that the statistical significance of a positive correlation between the centimeter radio and the broadband (E > 100 MeV) gamma-ray energy flux is very high for the whole AGN sample, with a probability of <10{sup -7} for the correlation appearing by chance. Using the OVRO data, we find that concurrent data improve the significance of the correlation from 1.6 x 10{sup -6} to 9.0 x 10{sup -8}. Our large sample size allows us to study the dependence of correlation strength and significance on specific source types and gamma-ray energy band. We find that the correlation is very significant (chance probability < 10{sup -7}) for both flat spectrum radio quasars and BL Lac objects separately; a dependence of the correlation strength on the considered gamma-ray energy band is also present, but additional data will be necessary to constrain its significance.« less

  19. The radio/gamma-ray connection in active galactic nuclei in the era of the Fermi Large Area Telescope

    DOE PAGES

    Ackermann, M.; Ajello, M.; Allafort, A.; ...

    2011-10-12

    We present a detailed statistical analysis of the correlation between radio and gamma-ray emission of the active galactic nuclei (AGNs) detected by Fermi during its first year of operation, with the largest data sets ever used for this purpose. We use both archival interferometric 8.4 GHz data (from the Very Large Array and ATCA, for the full sample of 599 sources) and concurrent single-dish 15 GHz measurements from the Owens Valley Radio Observatory (OVRO, for a sub sample of 199 objects). Our unprecedentedly large sample permits us to assess with high accuracy the statistical significance of the correlation, using amore » surrogate data method designed to simultaneously account for common-distance bias and the effect of a limited dynamical range in the observed quantities. We find that the statistical significance of a positive correlation between the centimeter radio and the broadband (E > 100 MeV) gamma-ray energy flux is very high for the whole AGN sample, with a probability of <10 –7 for the correlation appearing by chance. Using the OVRO data, we find that concurrent data improve the significance of the correlation from 1.6 × 10 –6 to 9.0 × 10 –8. Our large sample size allows us to study the dependence of correlation strength and significance on specific source types and gamma-ray energy band. As a result, we find that the correlation is very significant (chance probability < 10 –7) for both flat spectrum radio quasars and BL Lac objects separately; a dependence of the correlation strength on the considered gamma-ray energy band is also present, but additional data will be necessary to constrain its significance.« less

  20. The Radio/Gamma-Ray Connection in Active Galactic Nuclei in the Era of the Fermi Large Area Telescope

    NASA Technical Reports Server (NTRS)

    Ackermann, M.; Ajello, M.; Allafort, A.; Angelakis, E.; Axelsson, M.; Baldini, L.; Ballet, J.; Barbiellini, G.; Bastieri, D.; Bellazzini, R.; hide

    2011-01-01

    We present a detailed statistical analysis of the correlation between radio and gamma-ray emission of the active galactic nuclei (AGNs) detected by Fermi during its first year of operation, with the largest data sets ever used for this purpose.We use both archival interferometric 8.4 GHz data (from the Very Large Array and ATCA, for the full sample of 599 sources) and concurrent single-dish 15 GHz measurements from the OwensValley RadioObservatory (OVRO, for a sub sample of 199 objects). Our unprecedentedly large sample permits us to assess with high accuracy the statistical significance of the correlation, using a surrogate data method designed to simultaneously account for common-distance bias and the effect of a limited dynamical range in the observed quantities. We find that the statistical significance of a positive correlation between the centimeter radio and the broadband (E > 100 MeV) gamma-ray energy flux is very high for the whole AGN sample, with a probability of <10(exp -7) for the correlation appearing by chance. Using the OVRO data, we find that concurrent data improve the significance of the correlation from 1.6 10(exp -6) to 9.0 10(exp -8). Our large sample size allows us to study the dependence of correlation strength and significance on specific source types and gamma-ray energy band. We find that the correlation is very significant (chance probability < 10(exp -7)) for both flat spectrum radio quasars and BL Lac objects separately; a dependence of the correlation strength on the considered gamma-ray energy band is also present, but additional data will be necessary to constrain its significance.

  1. A comparison of the social competence of children with moderate intellectual disability in inclusive versus segregated school settings.

    PubMed

    Hardiman, Sharon; Guerin, Suzanne; Fitzsimons, Elaine

    2009-01-01

    This is the first study to compare the social competence of children with moderate intellectual disability in inclusive versus segregated school settings in the Republic of Ireland. A convenience sample was recruited through two large ID services. The sample comprised 45 children across two groups: Group 1 (n=20; inclusive school) and Group 2 (n=25; segregated school). Parents and teachers completed the Strengths and Difficulties Questionnaire and the Adaptive Behaviour Scale-School: 2nd edition. A series of 2 x 2 ANOVAs were carried out on social competence scores using educational placement type (inclusive vs segregated school) and proxy rater (parent vs teacher) as the independent variables. Key findings indicated that children in inclusive schools did not differ significantly from children in segregated schools on the majority of proxy ratings of social competence. This supports the belief that children with intellectual disabilities can function well in different educational settings. Present findings highlight the importance of utilising the functional model of ID when selecting and designing school placements for children with moderate ID.

  2. How to test validity in orthodontic research: a mixed dentition analysis example.

    PubMed

    Donatelli, Richard E; Lee, Shin-Jae

    2015-02-01

    The data used to test the validity of a prediction method should be different from the data used to generate the prediction model. In this study, we explored whether an independent data set is mandatory for testing the validity of a new prediction method and how validity can be tested without independent new data. Several validation methods were compared in an example using the data from a mixed dentition analysis with a regression model. The validation errors of real mixed dentition analysis data and simulation data were analyzed for increasingly large data sets. The validation results of both the real and the simulation studies demonstrated that the leave-1-out cross-validation method had the smallest errors. The largest errors occurred in the traditional simple validation method. The differences between the validation methods diminished as the sample size increased. The leave-1-out cross-validation method seems to be an optimal validation method for improving the prediction accuracy in a data set with limited sample sizes. Copyright © 2015 American Association of Orthodontists. Published by Elsevier Inc. All rights reserved.

  3. Strategies for Interactive Visualization of Large Scale Climate Simulations

    NASA Astrophysics Data System (ADS)

    Xie, J.; Chen, C.; Ma, K.; Parvis

    2011-12-01

    With the advances in computational methods and supercomputing technology, climate scientists are able to perform large-scale simulations at unprecedented resolutions. These simulations produce data that are time-varying, multivariate, and volumetric, and the data may contain thousands of time steps with each time step having billions of voxels and each voxel recording dozens of variables. Visualizing such time-varying 3D data to examine correlations between different variables thus becomes a daunting task. We have been developing strategies for interactive visualization and correlation analysis of multivariate data. The primary task is to find connection and correlation among data. Given the many complex interactions among the Earth's oceans, atmosphere, land, ice and biogeochemistry, and the sheer size of observational and climate model data sets, interactive exploration helps identify which processes matter most for a particular climate phenomenon. We may consider time-varying data as a set of samples (e.g., voxels or blocks), each of which is associated with a vector of representative or collective values over time. We refer to such a vector as a temporal curve. Correlation analysis thus operates on temporal curves of data samples. A temporal curve can be treated as a two-dimensional function where the two dimensions are time and data value. It can also be treated as a point in the high-dimensional space. In this case, to facilitate effective analysis, it is often necessary to transform temporal curve data from the original space to a space of lower dimensionality. Clustering and segmentation of temporal curve data in the original or transformed space provides us a way to categorize and visualize data of different patterns, which reveals connection or correlation of data among different variables or at different spatial locations. We have employed the power of GPU to enable interactive correlation visualization for studying the variability and correlations of a single or a pair of variables. It is desired to create a succinct volume classification that summarizes the connection among all correlation volumes with respect to various reference locations. Providing a reference location must correspond to a voxel position, the number of correlation volumes equals the total number of voxels. A brute-force solution takes all correlation volumes as the input and classifies their corresponding voxels according to their correlation volumes' distance. For large-scale time-varying multivariate data, calculating all these correlation volumes on-the-fly and analyzing the relationships among them is not feasible. We have developed a sampling-based approach for volume classification in order to reduce the computation cost of computing the correlation volumes. Users are able to employ their domain knowledge in selecting important samples. The result is a static view that captures the essence of correlation relationships; i.e., for all voxels in the same cluster, their corresponding correlation volumes are similar. This sampling-based approach enables us to obtain an approximation of correlation relations in a cost-effective manner, thus leading to a scalable solution to investigate large-scale data sets. These techniques empower climate scientists to study large data from their simulations.

  4. Reducing Information Overload in Large Seismic Data Sets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    HAMPTON,JEFFERY W.; YOUNG,CHRISTOPHER J.; MERCHANT,BION J.

    2000-08-02

    Event catalogs for seismic data can become very large. Furthermore, as researchers collect multiple catalogs and reconcile them into a single catalog that is stored in a relational database, the reconciled set becomes even larger. The sheer number of these events makes searching for relevant events to compare with events of interest problematic. Information overload in this form can lead to the data sets being under-utilized and/or used incorrectly or inconsistently. Thus, efforts have been initiated to research techniques and strategies for helping researchers to make better use of large data sets. In this paper, the authors present their effortsmore » to do so in two ways: (1) the Event Search Engine, which is a waveform correlation tool and (2) some content analysis tools, which area combination of custom-built and commercial off-the-shelf tools for accessing, managing, and querying seismic data stored in a relational database. The current Event Search Engine is based on a hierarchical clustering tool known as the dendrogram tool, which is written as a MatSeis graphical user interface. The dendrogram tool allows the user to build dendrogram diagrams for a set of waveforms by controlling phase windowing, down-sampling, filtering, enveloping, and the clustering method (e.g. single linkage, complete linkage, flexible method). It also allows the clustering to be based on two or more stations simultaneously, which is important to bridge gaps in the sparsely recorded event sets anticipated in such a large reconciled event set. Current efforts are focusing on tools to help the researcher winnow the clusters defined using the dendrogram tool down to the minimum optimal identification set. This will become critical as the number of reference events in the reconciled event set continually grows. The dendrogram tool is part of the MatSeis analysis package, which is available on the Nuclear Explosion Monitoring Research and Engineering Program Web Site. As part of the research into how to winnow the reference events in these large reconciled event sets, additional database query approaches have been developed to provide windows into these datasets. These custom built content analysis tools help identify dataset characteristics that can potentially aid in providing a basis for comparing similar reference events in these large reconciled event sets. Once these characteristics can be identified, algorithms can be developed to create and add to the reduced set of events used by the Event Search Engine. These content analysis tools have already been useful in providing information on station coverage of the referenced events and basic statistical, information on events in the research datasets. The tools can also provide researchers with a quick way to find interesting and useful events within the research datasets. The tools could also be used as a means to review reference event datasets as part of a dataset delivery verification process. There has also been an effort to explore the usefulness of commercially available web-based software to help with this problem. The advantages of using off-the-shelf software applications, such as Oracle's WebDB, to manipulate, customize and manage research data are being investigated. These types of applications are being examined to provide access to large integrated data sets for regional seismic research in Asia. All of these software tools would provide the researcher with unprecedented power without having to learn the intricacies and complexities of relational database systems.« less

  5. TOFSIMS-P: a web-based platform for analysis of large-scale TOF-SIMS data.

    PubMed

    Yun, So Jeong; Park, Ji-Won; Choi, Il Ju; Kang, Byeongsoo; Kim, Hark Kyun; Moon, Dae Won; Lee, Tae Geol; Hwang, Daehee

    2011-12-15

    Time-of-flight secondary ion mass spectrometry (TOF-SIMS) has been a useful tool to profile secondary ions from the near surface region of specimens with its high molecular specificity and submicrometer spatial resolution. However, the TOF-SIMS analysis of even a moderately large size of samples has been hampered due to the lack of tools for automatically analyzing the huge amount of TOF-SIMS data. Here, we present a computational platform to automatically identify and align peaks, find discriminatory ions, build a classifier, and construct networks describing differential metabolic pathways. To demonstrate the utility of the platform, we analyzed 43 data sets generated from seven gastric cancer and eight normal tissues using TOF-SIMS. A total of 87 138 ions were detected from the 43 data sets by TOF-SIMS. We selected and then aligned 1286 ions. Among them, we found the 66 ions discriminating gastric cancer tissues from normal ones. Using these 66 ions, we then built a partial least square-discriminant analysis (PLS-DA) model resulting in a misclassification error rate of 0.024. Finally, network analysis of the 66 ions showed disregulation of amino acid metabolism in the gastric cancer tissues. The results show that the proposed framework was effective in analyzing TOF-SIMS data from a moderately large size of samples, resulting in discrimination of gastric cancer tissues from normal tissues and identification of biomarker candidates associated with the amino acid metabolism.

  6. Molecular diagnosis of malaria by photo-induced electron transfer fluorogenic primers: PET-PCR.

    PubMed

    Lucchi, Naomi W; Narayanan, Jothikumar; Karell, Mara A; Xayavong, Maniphet; Kariuki, Simon; DaSilva, Alexandre J; Hill, Vincent; Udhayakumar, Venkatachalam

    2013-01-01

    There is a critical need for developing new malaria diagnostic tools that are sensitive, cost effective and capable of performing large scale diagnosis. The real-time PCR methods are particularly robust for large scale screening and they can be used in malaria control and elimination programs. We have designed novel self-quenching photo-induced electron transfer (PET) fluorogenic primers for the detection of P. falciparum and the Plasmodium genus by real-time PCR. A total of 119 samples consisting of different malaria species and mixed infections were used to test the utility of the novel PET-PCR primers in the diagnosis of clinical samples. The sensitivity and specificity were calculated using a nested PCR as the gold standard and the novel primer sets demonstrated 100% sensitivity and specificity. The limits of detection for P. falciparum was shown to be 3.2 parasites/µl using both Plasmodium genus and P. falciparum-specific primers and 5.8 parasites/µl for P. ovale, 3.5 parasites/µl for P. malariae and 5 parasites/µl for P. vivax using the genus specific primer set. Moreover, the reaction can be duplexed to detect both Plasmodium spp. and P. falciparum in a single reaction. The PET-PCR assay does not require internal probes or intercalating dyes which makes it convenient to use and less expensive than other real-time PCR diagnostic formats. Further validation of this technique in the field will help to assess its utility for large scale screening in malaria control and elimination programs.

  7. A global database of ant species abundances.

    PubMed

    Gibb, Heloise; Dunn, Rob R; Sanders, Nathan J; Grossman, Blair F; Photakis, Manoli; Abril, Silvia; Agosti, Donat; Andersen, Alan N; Angulo, Elena; Armbrecht, Inge; Arnan, Xavier; Baccaro, Fabricio B; Bishop, Tom R; Boulay, Raphaël; Brühl, Carsten; Castracani, Cristina; Cerda, Xim; Del Toro, Israel; Delsinne, Thibaut; Diaz, Mireia; Donoso, David A; Ellison, Aaron M; Enriquez, Martha L; Fayle, Tom M; Feener, Donald H; Fisher, Brian L; Fisher, Robert N; Fitzpatrick, Matthew C; Gómez, Crisanto; Gotelli, Nicholas J; Gove, Aaron; Grasso, Donato A; Groc, Sarah; Guenard, Benoit; Gunawardene, Nihara; Heterick, Brian; Hoffmann, Benjamin; Janda, Milan; Jenkins, Clinton; Kaspari, Michael; Klimes, Petr; Lach, Lori; Laeger, Thomas; Lattke, John; Leponce, Maurice; Lessard, Jean-Philippe; Longino, John; Lucky, Andrea; Luke, Sarah H; Majer, Jonathan; McGlynn, Terrence P; Menke, Sean; Mezger, Dirk; Mori, Alessandra; Moses, Jimmy; Munyai, Thinandavha Caswell; Pacheco, Renata; Paknia, Omid; Pearce-Duvet, Jessica; Pfeiffer, Martin; Philpott, Stacy M; Resasco, Julian; Retana, Javier; Silva, Rogerio R; Sorger, Magdalena D; Souza, Jorge; Suarez, Andrew; Tista, Melanie; Vasconcelos, Heraldo L; Vonshak, Merav; Weiser, Michael D; Yates, Michelle; Parr, Catherine L

    2017-03-01

    What forces structure ecological assemblages? A key limitation to general insights about assemblage structure is the availability of data that are collected at a small spatial grain (local assemblages) and a large spatial extent (global coverage). Here, we present published and unpublished data from 51 ,388 ant abundance and occurrence records of more than 2,693 species and 7,953 morphospecies from local assemblages collected at 4,212 locations around the world. Ants were selected because they are diverse and abundant globally, comprise a large fraction of animal biomass in most terrestrial communities, and are key contributors to a range of ecosystem functions. Data were collected between 1949 and 2014, and include, for each geo-referenced sampling site, both the identity of the ants collected and details of sampling design, habitat type, and degree of disturbance. The aim of compiling this data set was to provide comprehensive species abundance data in order to test relationships between assemblage structure and environmental and biogeographic factors. Data were collected using a variety of standardized methods, such as pitfall and Winkler traps, and will be valuable for studies investigating large-scale forces structuring local assemblages. Understanding such relationships is particularly critical under current rates of global change. We encourage authors holding additional data on systematically collected ant assemblages, especially those in dry and cold, and remote areas, to contact us and contribute their data to this growing data set. © 2016 by the Ecological Society of America.

  8. Identifying personal microbiomes using metagenomic codes

    PubMed Central

    Franzosa, Eric A.; Huang, Katherine; Meadow, James F.; Gevers, Dirk; Lemon, Katherine P.; Bohannan, Brendan J. M.; Huttenhower, Curtis

    2015-01-01

    Community composition within the human microbiome varies across individuals, but it remains unknown if this variation is sufficient to uniquely identify individuals within large populations or stable enough to identify them over time. We investigated this by developing a hitting set-based coding algorithm and applying it to the Human Microbiome Project population. Our approach defined body site-specific metagenomic codes: sets of microbial taxa or genes prioritized to uniquely and stably identify individuals. Codes capturing strain variation in clade-specific marker genes were able to distinguish among 100s of individuals at an initial sampling time point. In comparisons with follow-up samples collected 30–300 d later, ∼30% of individuals could still be uniquely pinpointed using metagenomic codes from a typical body site; coincidental (false positive) matches were rare. Codes based on the gut microbiome were exceptionally stable and pinpointed >80% of individuals. The failure of a code to match its owner at a later time point was largely explained by the loss of specific microbial strains (at current limits of detection) and was only weakly associated with the length of the sampling interval. In addition to highlighting patterns of temporal variation in the ecology of the human microbiome, this work demonstrates the feasibility of microbiome-based identifiability—a result with important ethical implications for microbiome study design. The datasets and code used in this work are available for download from huttenhower.sph.harvard.edu/idability. PMID:25964341

  9. Comparison of Feature Selection Techniques in Machine Learning for Anatomical Brain MRI in Dementia.

    PubMed

    Tohka, Jussi; Moradi, Elaheh; Huttunen, Heikki

    2016-07-01

    We present a comparative split-half resampling analysis of various data driven feature selection and classification methods for the whole brain voxel-based classification analysis of anatomical magnetic resonance images. We compared support vector machines (SVMs), with or without filter based feature selection, several embedded feature selection methods and stability selection. While comparisons of the accuracy of various classification methods have been reported previously, the variability of the out-of-training sample classification accuracy and the set of selected features due to independent training and test sets have not been previously addressed in a brain imaging context. We studied two classification problems: 1) Alzheimer's disease (AD) vs. normal control (NC) and 2) mild cognitive impairment (MCI) vs. NC classification. In AD vs. NC classification, the variability in the test accuracy due to the subject sample did not vary between different methods and exceeded the variability due to different classifiers. In MCI vs. NC classification, particularly with a large training set, embedded feature selection methods outperformed SVM-based ones with the difference in the test accuracy exceeding the test accuracy variability due to the subject sample. The filter and embedded methods produced divergent feature patterns for MCI vs. NC classification that suggests the utility of the embedded feature selection for this problem when linked with the good generalization performance. The stability of the feature sets was strongly correlated with the number of features selected, weakly correlated with the stability of classification accuracy, and uncorrelated with the average classification accuracy.

  10. Reinforced dynamics for enhanced sampling in large atomic and molecular systems

    NASA Astrophysics Data System (ADS)

    Zhang, Linfeng; Wang, Han; E, Weinan

    2018-03-01

    A new approach for efficiently exploring the configuration space and computing the free energy of large atomic and molecular systems is proposed, motivated by an analogy with reinforcement learning. There are two major components in this new approach. Like metadynamics, it allows for an efficient exploration of the configuration space by adding an adaptively computed biasing potential to the original dynamics. Like deep reinforcement learning, this biasing potential is trained on the fly using deep neural networks, with data collected judiciously from the exploration and an uncertainty indicator from the neural network model playing the role of the reward function. Parameterization using neural networks makes it feasible to handle cases with a large set of collective variables. This has the potential advantage that selecting precisely the right set of collective variables has now become less critical for capturing the structural transformations of the system. The method is illustrated by studying the full-atom explicit solvent models of alanine dipeptide and tripeptide, as well as the system of a polyalanine-10 molecule with 20 collective variables.

  11. A robust method of thin plate spline and its application to DEM construction

    NASA Astrophysics Data System (ADS)

    Chen, Chuanfa; Li, Yanyan

    2012-11-01

    In order to avoid the ill-conditioning problem of thin plate spline (TPS), the orthogonal least squares (OLS) method was introduced, and a modified OLS (MOLS) was developed. The MOLS of TPS (TPS-M) can not only select significant points, termed knots, from large and dense sampling data sets, but also easily compute the weights of the knots in terms of back-substitution. For interpolating large sampling points, we developed a local TPS-M, where some neighbor sampling points around the point being estimated are selected for computation. Numerical tests indicate that irrespective of sampling noise level, the average performance of TPS-M can advantage with smoothing TPS. Under the same simulation accuracy, the computational time of TPS-M decreases with the increase of the number of sampling points. The smooth fitting results on lidar-derived noise data indicate that TPS-M has an obvious smoothing effect, which is on par with smoothing TPS. The example of constructing a series of large scale DEMs, located in Shandong province, China, was employed to comparatively analyze the estimation accuracies of the two versions of TPS and the classical interpolation methods including inverse distance weighting (IDW), ordinary kriging (OK) and universal kriging with the second-order drift function (UK). Results show that regardless of sampling interval and spatial resolution, TPS-M is more accurate than the classical interpolation methods, except for the smoothing TPS at the finest sampling interval of 20 m, and the two versions of kriging at the spatial resolution of 15 m. In conclusion, TPS-M, which avoids the ill-conditioning problem, is considered as a robust method for DEM construction.

  12. Effects of strongman training on salivary testosterone levels in a sample of trained men.

    PubMed

    Ghigiarelli, Jamie J; Sell, Katie M; Raddock, Jessica M; Taveras, Kurt

    2013-03-01

    Strongman exercises consist of multi-joint movements that incorporate large muscle mass groups and impose a substantial amount of neuromuscular stress. The purpose of this study was to examine salivary testosterone responses from 2 novel strongman training (ST) protocols in comparison with an established hypertrophic (H) protocol reported to acutely elevate testosterone levels. Sixteen men (24 ± 4.4 years, 181.2 ± 6.8 cm, and 95.3 ± 20.3 kg) volunteered to participate in this study. Subjects completed 3 protocols designed to ensure equal total volume (sets and repetitions), rest period, and intensity between the groups. Exercise sets were performed to failure. Exercise selection and intensity (3 sets × 10 repetitions at 75% 1 repetition maximum) were chosen as they reflected commonly prescribed resistance exercise protocols recognized to elicit a large acute hormonal response. In each of the protocols, subjects were required to perform 3 sets to muscle failure of 5 different exercises (tire flip, chain drag, farmers walk, keg carry, and atlas stone lift) with a 2-minute rest interval between sets and a 3-minute rest interval between exercises. Saliva samples were collected pre-exercise (PRE), immediate postexercise (PST), and 30 minutes postexercise (30PST). Delta scores indicated a significant difference between PRE and PST testosterone level within each group (p ≤ 0.05), with no significant difference between the groups. Testosterone levels spiked 136% (225.23 ± 148.01 pg·ml(-1)) for the H group, 74% (132.04 ± 98.09 pg·ml(-1)) for the ST group, and 54% (122.10 ± 140.67 pg·ml) for the mixed strongman/hypertrophy (XST) group. A significant difference for testosterone level occurred over time (PST to 30PST) for the H group p ≤ 0.05. In conclusion, ST elicits an acute endocrine response similar to a recognized H protocol when equated for duration and exercise intensity.

  13. Methods and apparatus of analyzing electrical power grid data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hafen, Ryan P.; Critchlow, Terence J.; Gibson, Tara D.

    Apparatus and methods of processing large-scale data regarding an electrical power grid are described. According to one aspect, a method of processing large-scale data regarding an electrical power grid includes accessing a large-scale data set comprising information regarding an electrical power grid; processing data of the large-scale data set to identify a filter which is configured to remove erroneous data from the large-scale data set; using the filter, removing erroneous data from the large-scale data set; and after the removing, processing data of the large-scale data set to identify an event detector which is configured to identify events of interestmore » in the large-scale data set.« less

  14. Period Estimation for Sparsely-sampled Quasi-periodic Light Curves Applied to Miras

    NASA Astrophysics Data System (ADS)

    He, Shiyuan; Yuan, Wenlong; Huang, Jianhua Z.; Long, James; Macri, Lucas M.

    2016-12-01

    We develop a nonlinear semi-parametric Gaussian process model to estimate periods of Miras with sparsely sampled light curves. The model uses a sinusoidal basis for the periodic variation and a Gaussian process for the stochastic changes. We use maximum likelihood to estimate the period and the parameters of the Gaussian process, while integrating out the effects of other nuisance parameters in the model with respect to a suitable prior distribution obtained from earlier studies. Since the likelihood is highly multimodal for period, we implement a hybrid method that applies the quasi-Newton algorithm for Gaussian process parameters and search the period/frequency parameter space over a dense grid. A large-scale, high-fidelity simulation is conducted to mimic the sampling quality of Mira light curves obtained by the M33 Synoptic Stellar Survey. The simulated data set is publicly available and can serve as a testbed for future evaluation of different period estimation methods. The semi-parametric model outperforms an existing algorithm on this simulated test data set as measured by period recovery rate and quality of the resulting period-luminosity relations.

  15. Evaluation of Two Outlier-Detection-Based Methods for Detecting Tissue-Selective Genes from Microarray Data

    PubMed Central

    Kadota, Koji; Konishi, Tomokazu; Shimizu, Kentaro

    2007-01-01

    Large-scale expression profiling using DNA microarrays enables identification of tissue-selective genes for which expression is considerably higher and/or lower in some tissues than in others. Among numerous possible methods, only two outlier-detection-based methods (an AIC-based method and Sprent’s non-parametric method) can treat equally various types of selective patterns, but they produce substantially different results. We investigated the performance of these two methods for different parameter settings and for a reduced number of samples. We focused on their ability to detect selective expression patterns robustly. We applied them to public microarray data collected from 36 normal human tissue samples and analyzed the effects of both changing the parameter settings and reducing the number of samples. The AIC-based method was more robust in both cases. The findings confirm that the use of the AIC-based method in the recently proposed ROKU method for detecting tissue-selective expression patterns is correct and that Sprent’s method is not suitable for ROKU. PMID:19936074

  16. Fast Reduction Method in Dominance-Based Information Systems

    NASA Astrophysics Data System (ADS)

    Li, Yan; Zhou, Qinghua; Wen, Yongchuan

    2018-01-01

    In real world applications, there are often some data with continuous values or preference-ordered values. Rough sets based on dominance relations can effectively deal with these kinds of data. Attribute reduction can be done in the framework of dominance-relation based approach to better extract decision rules. However, the computational cost of the dominance classes greatly affects the efficiency of attribute reduction and rule extraction. This paper presents an efficient method of computing dominance classes, and further compares it with traditional method with increasing attributes and samples. Experiments on UCI data sets show that the proposed algorithm obviously improves the efficiency of the traditional method, especially for large-scale data.

  17. Using Remote Sensing to Determine the Spatial Scales of Estuaries

    NASA Astrophysics Data System (ADS)

    Davis, C. O.; Tufillaro, N.; Nahorniak, J.

    2016-02-01

    One challenge facing Earth system science is to understand and quantify the complexity of rivers, estuaries, and coastal zone regions. Earlier studies using data from airborne hyperspectral imagers (Bissett et al., 2004, Davis et al., 2007) demonstrated from a very limited data set that the spatial scales of the coastal ocean could be resolved with spatial sampling of 100 m Ground Sample Distance (GSD) or better. To develop a much larger data set (Aurin et al., 2013) used MODIS 250 m data for a wide range of coastal regions. Their conclusion was that farther offshore 500 m GSD was adequate to resolve large river plume features while nearshore regions (a few kilometers from the coast) needed higher spatial resolution data not available from MODIS. Building on our airborne experience, the Hyperspectral Imager for the Coastal Ocean (HICO, Lucke et al., 2011) was designed to provide hyperspectral data for the coastal ocean at 100 m GSD. HICO operated on the International Space Station for 5 years and collected over 10,000 scenes of the coastal ocean and other regions around the world. Here we analyze HICO data from an example set of major river delta regions to assess the spatial scales of variability in those systems. In one system, the San Francisco Bay and Delta, we also analyze Landsat 8 OLI data at 30 m and 15 m to validate the 100 m GSD sampling scale for the Bay and assess spatial sampling needed as you move up river.

  18. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

    PubMed Central

    Hero, Alfred O.; Rajaratnam, Bala

    2015-01-01

    When can reliable inference be drawn in fue “Big Data” context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for “Big Data”. Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks. PMID:27087700

  19. Closure and ratio correlation analysis of lunar chemical and grain size data

    NASA Technical Reports Server (NTRS)

    Butler, J. C.

    1976-01-01

    Major element and major element plus trace element analyses were selected from the lunar data base for Apollo 11, 12 and 15 basalt and regolith samples. Summary statistics for each of the six data sets were compiled, and the effects of closure on the Pearson product moment correlation coefficient were investigated using the Chayes and Kruskal approximation procedure. In general, there are two types of closure effects evident in these data sets: negative correlations of intermediate size which are solely the result of closure, and correlations of small absolute value which depart significantly from their expected closure correlations which are of intermediate size. It is shown that a positive closure correlation will arise only when the product of the coefficients of variation is very small (less than 0.01 for most data sets) and, in general, trace elements in the lunar data sets exhibit relatively large coefficients of variation.

  20. Search for neutrinoless double-electron capture of 156Dy

    NASA Astrophysics Data System (ADS)

    Finch, S. W.; Tornow, W.

    2015-12-01

    Background: Multiple large collaborations are currently searching for neutrinoless double-β decay, with the ultimate goal of differentiating the Majorana-Dirac nature of the neutrino. Purpose: Investigate the feasibility of resonant neutrinoless double-electron capture, an experimental alternative to neutrinoless double-β decay. Method: Two clover germanium detectors were operated underground in coincidence to search for the de-excitation γ rays of 156Gd following the neutrinoless double-electron capture of 156Dy. 231.95 d of data were collected at the Kimballton underground research facility with a 231.57 mg enriched 156Dy sample. Results: No counts were seen above background and half-life limits are set at O (1016-1018) yr for the various decay modes of 156Dy. Conclusion: Low background spectra were efficiently collected in the search for neutrinoless double-electron capture of 156Dy, although the low natural abundance and associated lack of large quantities of enriched samples hinders the experimental reach.

  1. A k-Vector Approach to Sampling, Interpolation, and Approximation

    NASA Astrophysics Data System (ADS)

    Mortari, Daniele; Rogers, Jonathan

    2013-12-01

    The k-vector search technique is a method designed to perform extremely fast range searching of large databases at computational cost independent of the size of the database. k-vector search algorithms have historically found application in satellite star-tracker navigation systems which index very large star catalogues repeatedly in the process of attitude estimation. Recently, the k-vector search algorithm has been applied to numerous other problem areas including non-uniform random variate sampling, interpolation of 1-D or 2-D tables, nonlinear function inversion, and solution of systems of nonlinear equations. This paper presents algorithms in which the k-vector search technique is used to solve each of these problems in a computationally-efficient manner. In instances where these tasks must be performed repeatedly on a static (or nearly-static) data set, the proposed k-vector-based algorithms offer an extremely fast solution technique that outperforms standard methods.

  2. Tools for phospho- and glycoproteomics of plasma membranes.

    PubMed

    Wiśniewski, Jacek R

    2011-07-01

    Analysis of plasma membrane proteins and their posttranslational modifications is considered as important for identification of disease markers and targets for drug treatment. Due to their insolubility in water, studying of plasma membrane proteins using mass spectrometry has been difficult for a long time. Recent technological developments in sample preparation together with important improvements in mass spectrometric analysis have facilitated analysis of these proteins and their posttranslational modifications. Now, large scale proteomic analyses allow identification of thousands of membrane proteins from minute amounts of sample. Optimized protocols for affinity enrichment of phosphorylated and glycosylated peptides have set new dimensions in the depth of characterization of these posttranslational modifications of plasma membrane proteins. Here, I summarize recent advances in proteomic technology for the characterization of the cell surface proteins and their modifications. In the focus are approaches allowing large scale mapping rather than analytical methods suitable for studying individual proteins or non-complex mixtures.

  3. Detecting a Weak Association by Testing its Multiple Perturbations: a Data Mining Approach

    NASA Astrophysics Data System (ADS)

    Lo, Min-Tzu; Lee, Wen-Chung

    2014-05-01

    Many risk factors/interventions in epidemiologic/biomedical studies are of minuscule effects. To detect such weak associations, one needs a study with a very large sample size (the number of subjects, n). The n of a study can be increased but unfortunately only to an extent. Here, we propose a novel method which hinges on increasing sample size in a different direction-the total number of variables (p). We construct a p-based `multiple perturbation test', and conduct power calculations and computer simulations to show that it can achieve a very high power to detect weak associations when p can be made very large. As a demonstration, we apply the method to analyze a genome-wide association study on age-related macular degeneration and identify two novel genetic variants that are significantly associated with the disease. The p-based method may set a stage for a new paradigm of statistical tests.

  4. Combining physical galaxy models with radio observations to constrain the SFRs of high-z dusty star-forming galaxies

    NASA Astrophysics Data System (ADS)

    Lo Faro, B.; Silva, L.; Franceschini, A.; Miller, N.; Efstathiou, A.

    2015-03-01

    We complement our previous analysis of a sample of z ˜ 1-2 luminous and ultraluminous infrared galaxies [(U)LIRGs], by adding deep Very Large Array radio observations at 1.4 GHz to a large data set from the far-UV to the submillimetre, including Spitzer and Herschel data. Given the relatively small number of (U)LIRGs in our sample with high signal-to-noise (S/N) radio data, and to extend our study to a different family of galaxies, we also include six well-sampled near-infrared (near-IR)-selected BzK galaxies at z ˜ 1.5. From our analysis based on the radtran spectral synthesis code GRASIL, we find that, while the IR luminosity may be a biased tracer of the star formation rate (SFR) depending on the age of stars dominating the dust heating, the inclusion of the radio flux offers significantly tighter constraints on SFR. Our predicted SFRs are in good agreement with the estimates based on rest-frame radio luminosity and the Bell calibration. The extensive spectrophotometric coverage of our sample allows us to set important constraints on the star formation (SF) history of individual objects. For essentially all galaxies, we find evidence for a rather continuous SFR and a peak epoch of SF preceding that of the observation by a few Gyr. This seems to correspond to a formation redshift of z ˜ 5-6. We finally show that our physical analysis may affect the interpretation of the SFR-M⋆ diagram, by possibly shifting, with respect to previous works, the position of the most dust obscured objects to higher M⋆ and lower SFRs.

  5. Falcon: Visual analysis of large, irregularly sampled, and multivariate time series data in additive manufacturing

    DOE PAGES

    Steed, Chad A.; Halsey, William; Dehoff, Ryan; ...

    2017-02-16

    Flexible visual analysis of long, high-resolution, and irregularly sampled time series data from multiple sensor streams is a challenge in several domains. In the field of additive manufacturing, this capability is critical for realizing the full potential of large-scale 3D printers. Here, we propose a visual analytics approach that helps additive manufacturing researchers acquire a deep understanding of patterns in log and imagery data collected by 3D printers. Our specific goals include discovering patterns related to defects and system performance issues, optimizing build configurations to avoid defects, and increasing production efficiency. We introduce Falcon, a new visual analytics system thatmore » allows users to interactively explore large, time-oriented data sets from multiple linked perspectives. Falcon provides overviews, detailed views, and unique segmented time series visualizations, all with adjustable scale options. To illustrate the effectiveness of Falcon at providing thorough and efficient knowledge discovery, we present a practical case study involving experts in additive manufacturing and data from a large-scale 3D printer. The techniques described are applicable to the analysis of any quantitative time series, though the focus of this paper is on additive manufacturing.« less

  6. Falcon: Visual analysis of large, irregularly sampled, and multivariate time series data in additive manufacturing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Steed, Chad A.; Halsey, William; Dehoff, Ryan

    Flexible visual analysis of long, high-resolution, and irregularly sampled time series data from multiple sensor streams is a challenge in several domains. In the field of additive manufacturing, this capability is critical for realizing the full potential of large-scale 3D printers. Here, we propose a visual analytics approach that helps additive manufacturing researchers acquire a deep understanding of patterns in log and imagery data collected by 3D printers. Our specific goals include discovering patterns related to defects and system performance issues, optimizing build configurations to avoid defects, and increasing production efficiency. We introduce Falcon, a new visual analytics system thatmore » allows users to interactively explore large, time-oriented data sets from multiple linked perspectives. Falcon provides overviews, detailed views, and unique segmented time series visualizations, all with adjustable scale options. To illustrate the effectiveness of Falcon at providing thorough and efficient knowledge discovery, we present a practical case study involving experts in additive manufacturing and data from a large-scale 3D printer. The techniques described are applicable to the analysis of any quantitative time series, though the focus of this paper is on additive manufacturing.« less

  7. Multiple object redshift determinations in clusters of galaxies using OCTOPUS

    NASA Astrophysics Data System (ADS)

    Mazure, A.; Proust, D.; Sodre, L.; Capelato, H. V.; Lund, G.

    1988-04-01

    The ESO multiobject facility, Octopus, was used to observe a sample of galaxy clusters such as SC2008-565 in an attempt to collect a large set of individual radial velocities. A dispersion of 114 A/mm was used, providing spectral coverage from 3800 to 5180 A. Octopus was found to be a well-adapted instrument for the rapid and simultaneous determination of redshifts in cataloged galaxy clusters.

  8. Multiple object redshift determinations in clusters of galaxies using OCTOPUS

    NASA Astrophysics Data System (ADS)

    Mazure, A.; Proust, D.; Sodre, L.; Lund, G.; Capelato, H.

    1987-03-01

    The ESO multiobject facility, Octopus, was used to observe a sample of galaxy clusters such as SC2008-565 in an attempt to collect a large set of individual radial velocities. A dispersion of 114 A/mm was used, providing spectral coverage from 3800 to 5180 A. Octopus was found to be a well-adapted instrument for the rapid and simultaneous determination of redshifts in cataloged galaxy clusters.

  9. Intensive Behavioral Treatment of Urinary Incontinence of Children with Autism Spectrum Disorders: An Archival Analysis of Procedures and Outcomes from an Outpatient Clinic

    ERIC Educational Resources Information Center

    Hanney, Nicole M.; Jostad, Candice M.; LeBlanc, Linda A.; Carr, James E.; Castile, Allison J.

    2013-01-01

    LeBlanc, Crossett, Bennett, Detweiler, and Carr (2005) described an outpatient model for conducting intensive toilet training with young children with autism using a modified Azrin and Foxx, protocol. In this article, we summarize the use of the protocol in an outpatient setting and the outcomes achieved with a large sample of children with autism…

  10. Classification of Clinically Relevant Microorganisms in Non-Medical Environments

    DTIC Science & Technology

    2004-05-06

    settings largely due to the rapidity of its evolutionary response to treatment. The first antibiotic-resistant strains of S . aureus were isolated only...studies have assigned isolates of the bacteria to known strains. The objectives of this study were to collect, isolate and characterize samples of S ...internal fragments of seven genes were obtained for 36 S . aureus isolates and assigned a unique allelic profile. These profiles, like fingerprints

  11. Pre-fire fuel reduction treatments influence plant communities and exotic species 9 years after a large wildfire

    Treesearch

    Kristen L. Shive; Amanda M. Kuenzi; Carolyn H. Sieg; Peter Z. Fule

    2013-01-01

    We used a multi-year data set from the 2002 Rodeo-Chediski Fire to detect post-fire trends in plant community response in burned ponderosa pine forests. Within the burn perimeter, we examined the effects of pre-fire fuels treatments on post-fire vegetation by comparing paired treated and untreated sites on the Apache-Sitgreaves National Forest.We sampled these paired...

  12. Greenstone belts: Their boundaries, surrounding rock terrains and interrelationships

    NASA Technical Reports Server (NTRS)

    Percival, J. A.; Card, K. D.

    1986-01-01

    Greenstone belts are an important part of the fragmented record of crustal evolution, representing samples of the magmatic activity that formed much of the Earth's crust. Most belts developed rapidly, in less than 100 Ma, leaving large gaps in the geological record. Surrounding terrains provide information on the context of greenstone belts. The effects of tectonic setting, structural geometry and evolution, associated plutonic activity and sedimentation are discussed.

  13. Connecting HL Tau to the observed exoplanet sample

    NASA Astrophysics Data System (ADS)

    Simbulan, Christopher; Tamayo, Daniel; Petrovich, Cristobal; Rein, Hanno; Murray, Norman

    2017-08-01

    The Atacama Large Millimeter/submilimeter Array (ALMA) recently revealed a set of nearly concentric gaps in the protoplanetary disc surrounding the young star HL Tauri (HL Tau). If these are carved by forming gas giants, this provides the first set of orbital initial conditions for planets as they emerge from their birth discs. Using N-body integrations, we have followed the evolution of the system for 5 Gyr to explore the possible outcomes. We find that HL Tau initial conditions scaled down to the size of typically observed exoplanet orbits naturally produce several populations in the observed exoplanet sample. First, for a plausible range of planetary masses, we can match the observed eccentricity distribution of dynamically excited radial velocity giant planets with eccentricities >0.2. Secondly, we roughly obtain the observed rate of hot Jupiters around FGK stars. Finally, we obtain a large efficiency of planetary ejections of ≈2 per HL Tau-like system, but the small fraction of stars observed to host giant planets makes it hard to match the rate of free-floating planets inferred from microlensing observations. In view of upcoming Gaia results, we also provide predictions for the expected mutual inclination distribution, which is significantly broader than the absolute inclination distributions typically considered by previous studies.

  14. Communication: importance sampling including path correlation in semiclassical initial value representation calculations for time correlation functions.

    PubMed

    Pan, Feng; Tao, Guohua

    2013-03-07

    Full semiclassical (SC) initial value representation (IVR) for time correlation functions involves a double phase space average over a set of two phase points, each of which evolves along a classical path. Conventionally, the two initial phase points are sampled independently for all degrees of freedom (DOF) in the Monte Carlo procedure. Here, we present an efficient importance sampling scheme by including the path correlation between the two initial phase points for the bath DOF, which greatly improves the performance of the SC-IVR calculations for large molecular systems. Satisfactory convergence in the study of quantum coherence in vibrational relaxation has been achieved for a benchmark system-bath model with up to 21 DOF.

  15. Efficient Robust Regression via Two-Stage Generalized Empirical Likelihood

    PubMed Central

    Bondell, Howard D.; Stefanski, Leonard A.

    2013-01-01

    Large- and finite-sample efficiency and resistance to outliers are the key goals of robust statistics. Although often not simultaneously attainable, we develop and study a linear regression estimator that comes close. Efficiency obtains from the estimator’s close connection to generalized empirical likelihood, and its favorable robustness properties are obtained by constraining the associated sum of (weighted) squared residuals. We prove maximum attainable finite-sample replacement breakdown point, and full asymptotic efficiency for normal errors. Simulation evidence shows that compared to existing robust regression estimators, the new estimator has relatively high efficiency for small sample sizes, and comparable outlier resistance. The estimator is further illustrated and compared to existing methods via application to a real data set with purported outliers. PMID:23976805

  16. Fast Ordered Sampling of DNA Sequence Variants.

    PubMed

    Greenberg, Anthony J

    2018-05-04

    Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects. Copyright © 2018 Greenberg.

  17. Circum-Arctic petroleum systems identified using decision-tree chemometrics

    USGS Publications Warehouse

    Peters, K.E.; Ramos, L.S.; Zumberge, J.E.; Valin, Z.C.; Scotese, C.R.; Gautier, D.L.

    2007-01-01

    Source- and age-related biomarker and isotopic data were measured for more than 1000 crude oil samples from wells and seeps collected above approximately 55??N latitude. A unique, multitiered chemometric (multivariate statistical) decision tree was created that allowed automated classification of 31 genetically distinct circumArctic oil families based on a training set of 622 oil samples. The method, which we call decision-tree chemometrics, uses principal components analysis and multiple tiers of K-nearest neighbor and SIMCA (soft independent modeling of class analogy) models to classify and assign confidence limits for newly acquired oil samples and source rock extracts. Geochemical data for each oil sample were also used to infer the age, lithology, organic matter input, depositional environment, and identity of its source rock. These results demonstrate the value of large petroleum databases where all samples were analyzed using the same procedures and instrumentation. Copyright ?? 2007. The American Association of Petroleum Geologists. All rights reserved.

  18. Acoustic Enrichment of Extracellular Vesicles from Biological Fluids.

    PubMed

    Ku, Anson; Lim, Hooi Ching; Evander, Mikael; Lilja, Hans; Laurell, Thomas; Scheding, Stefan; Ceder, Yvonne

    2018-06-11

    Extracellular vesicles (EVs) have emerged as a rich source of biomarkers providing diagnostic and prognostic information in diseases such as cancer. Large-scale investigations into the contents of EVs in clinical cohorts are warranted, but a major obstacle is the lack of a rapid, reproducible, efficient, and low-cost methodology to enrich EVs. Here, we demonstrate the applicability of an automated acoustic-based technique to enrich EVs, termed acoustic trapping. Using this technology, we have successfully enriched EVs from cell culture conditioned media and urine and blood plasma from healthy volunteers. The acoustically trapped samples contained EVs ranging from exosomes to microvesicles in size and contained detectable levels of intravesicular microRNAs. Importantly, this method showed high reproducibility and yielded sufficient quantities of vesicles for downstream analysis. The enrichment could be obtained from a sample volume of 300 μL or less, an equivalent to 30 min of enrichment time, depending on the sensitivity of downstream analysis. Taken together, acoustic trapping provides a rapid, automated, low-volume compatible, and robust method to enrich EVs from biofluids. Thus, it may serve as a novel tool for EV enrichment from large number of samples in a clinical setting with minimum sample preparation.

  19. Breathing life into fisheries stock assessments with citizen science

    PubMed Central

    Fairclough, D. V.; Brown, J. I.; Carlish, B. J.; Crisafulli, B. M.; Keay, I. S.

    2014-01-01

    Citizen science offers a potentially cost-effective way for researchers to obtain large data sets over large spatial scales. However, it is not used widely to support biological data collection for fisheries stock assessments. Overfishing of demersal fishes along 1,000 km of the west Australian coast led to restrictive management to recover stocks. This diminished opportunities for scientists to cost-effectively monitor stock recovery via fishery-dependent sampling, particularly of the recreational fishing sector. As fishery-independent methods would be too expensive and logistically-challenging to implement, a citizen science program, Send us your skeletons (SUYS), was developed. SUYS asks recreational fishers to voluntarily donate fish skeletons of important species from their catch to allow biological data extraction by scientists to produce age structures and conduct stock assessment analyses. During SUYS, recreational fisher involvement, sample sizes and spatial and temporal coverage of samples have dramatically increased, while the collection cost per skeleton has declined substantially. SUYS is ensuring sampling objectives for stock assessments are achieved via fishery-dependent collection and reliable and timely scientific advice can be provided to managers. The program is also encouraging public ownership through involvement in the monitoring process, which can lead to greater acceptance of management decisions. PMID:25431103

  20. Breathing life into fisheries stock assessments with citizen science.

    PubMed

    Fairclough, D V; Brown, J I; Carlish, B J; Crisafulli, B M; Keay, I S

    2014-11-28

    Citizen science offers a potentially cost-effective way for researchers to obtain large data sets over large spatial scales. However, it is not used widely to support biological data collection for fisheries stock assessments. Overfishing of demersal fishes along 1,000 km of the west Australian coast led to restrictive management to recover stocks. This diminished opportunities for scientists to cost-effectively monitor stock recovery via fishery-dependent sampling, particularly of the recreational fishing sector. As fishery-independent methods would be too expensive and logistically-challenging to implement, a citizen science program, Send us your skeletons (SUYS), was developed. SUYS asks recreational fishers to voluntarily donate fish skeletons of important species from their catch to allow biological data extraction by scientists to produce age structures and conduct stock assessment analyses. During SUYS, recreational fisher involvement, sample sizes and spatial and temporal coverage of samples have dramatically increased, while the collection cost per skeleton has declined substantially. SUYS is ensuring sampling objectives for stock assessments are achieved via fishery-dependent collection and reliable and timely scientific advice can be provided to managers. The program is also encouraging public ownership through involvement in the monitoring process, which can lead to greater acceptance of management decisions.

  1. Sampling with poling-based flux balance analysis: optimal versus sub-optimal flux space analysis of Actinobacillus succinogenes.

    PubMed

    Binns, Michael; de Atauri, Pedro; Vlysidis, Anestis; Cascante, Marta; Theodoropoulos, Constantinos

    2015-02-18

    Flux balance analysis is traditionally implemented to identify the maximum theoretical flux for some specified reaction and a single distribution of flux values for all the reactions present which achieve this maximum value. However it is well known that the uncertainty in reaction networks due to branches, cycles and experimental errors results in a large number of combinations of internal reaction fluxes which can achieve the same optimal flux value. In this work, we have modified the applied linear objective of flux balance analysis to include a poling penalty function, which pushes each new set of reaction fluxes away from previous solutions generated. Repeated poling-based flux balance analysis generates a sample of different solutions (a characteristic set), which represents all the possible functionality of the reaction network. Compared to existing sampling methods, for the purpose of generating a relatively "small" characteristic set, our new method is shown to obtain a higher coverage than competing methods under most conditions. The influence of the linear objective function on the sampling (the linear bias) constrains optimisation results to a subspace of optimal solutions all producing the same maximal fluxes. Visualisation of reaction fluxes plotted against each other in 2 dimensions with and without the linear bias indicates the existence of correlations between fluxes. This method of sampling is applied to the organism Actinobacillus succinogenes for the production of succinic acid from glycerol. A new method of sampling for the generation of different flux distributions (sets of individual fluxes satisfying constraints on the steady-state mass balances of intermediates) has been developed using a relatively simple modification of flux balance analysis to include a poling penalty function inside the resulting optimisation objective function. This new methodology can achieve a high coverage of the possible flux space and can be used with and without linear bias to show optimal versus sub-optimal solution spaces. Basic analysis of the Actinobacillus succinogenes system using sampling shows that in order to achieve the maximal succinic acid production CO₂ must be taken into the system. Solutions involving release of CO₂ all give sub-optimal succinic acid production.

  2. X-ray micro-CT and neutron CT as complementary imaging tools for non-destructive 3D imaging of rare silicified fossil plants

    NASA Astrophysics Data System (ADS)

    Karch, J.; Dudák, J.; Žemlička, J.; Vavřík, D.; Kumpová, I.; Kvaček, J.; Heřmanová, Z.; Šoltés, J.; Viererbl, L.; Morgano, M.; Kaestner, A.; Trtík, P.

    2017-12-01

    Computed tomography provides 3D information of inner structures of investigated objects. The obtained information is, however, strongly dependent on the used radiation type. It is known that as X-rays interact with electron cloud and neutrons with atomic nucleus, the obtained data often provide different contrast of sample structures. In this work we present a set of comparative radiographic and CT measurements of rare fossil plant samples using X-rays and thermal neutrons. The X-ray measurements were performed using large area photon counting detectors Timepix at IEAP CTU in Prague and Perkin Elmer flat-panel detector at Center of Excellence Telč. The neutron CT measurement was carried out at Paul Scherrer Institute using BOA beam-line. Furthermore, neutron radiography of fossil samples, provided by National Museum, were performed using a large-area Timepix detector with a neutron-sensitive converting 6LiF layer at Research Centre Rez, Czech Republic. The obtained results show different capabilities of both imaging approaches. While X-ray micro-CT provides very high resolution and enables visualization of fine cracks or small cavities in the samples neutron imaging provides high contrast of morphological structures of fossil plant samples, where X-ray imaging provides insufficient contrast.

  3. Datasets for Ostrava PMF paper

    EPA Pesticide Factsheets

    These data support a published journal paper described as follows:A 14-week investigation during a warm and cold seasons was conducted to improve understanding of airpollution sources that might be impacting air quality in Ostrava, the Czech Republic. Fine particulatematter (PM2.5) samples were collected in consecutive 12-h day and night increments during spring andfall 2012 sampling campaigns. Sampling sites were strategically located to evaluate conditions in closeproximity of a large steel works industrial complex, as well as away from direct influence of theindustrial complex. These samples were analyzed for metals and other elements, organic and elemental(black) carbon, and polycyclic aromatic hydrocarbons (PAHs). The PM2.5 samples were supplementedwith pollutant gases and meteorological parameters. We applied the EPA PMF v5.1 model with uncertainty estimate features to the Ostrava data set. Using the model's bootstrapping procedure and other considerations, six factors were determined to provide the optimum solution. Each model run consisted of 100 iterations to ensure that the solution represents a global minimum. The resulting factors were identified as representing coal (power plants), mixed Cl, crustal, industrial 1 (alkali metals and PAHs), industrial 2 (transition metals), and home heat/transportation. The home heating source is thought to be largely domestic boilers burning low quality fuels such as lignite, wood, and domestic waste.Transportation-r

  4. Digitally programmable microfluidic automaton for multiscale combinatorial mixing and sample processing†

    PubMed Central

    Jensen, Erik C.; Stockton, Amanda M.; Chiesl, Thomas N.; Kim, Jungkyu; Bera, Abhisek; Mathies, Richard A.

    2013-01-01

    A digitally programmable microfluidic Automaton consisting of a 2-dimensional array of pneumatically actuated microvalves is programmed to perform new multiscale mixing and sample processing operations. Large (µL-scale) volume processing operations are enabled by precise metering of multiple reagents within individual nL-scale valves followed by serial repetitive transfer to programmed locations in the array. A novel process exploiting new combining valve concepts is developed for continuous rapid and complete mixing of reagents in less than 800 ms. Mixing, transfer, storage, and rinsing operations are implemented combinatorially to achieve complex assay automation protocols. The practical utility of this technology is demonstrated by performing automated serial dilution for quantitative analysis as well as the first demonstration of on-chip fluorescent derivatization of biomarker targets (carboxylic acids) for microchip capillary electrophoresis on the Mars Organic Analyzer. A language is developed to describe how unit operations are combined to form a microfluidic program. Finally, this technology is used to develop a novel microfluidic 6-sample processor for combinatorial mixing of large sets (>26 unique combinations) of reagents. The digitally programmable microfluidic Automaton is a versatile programmable sample processor for a wide range of process volumes, for multiple samples, and for different types of analyses. PMID:23172232

  5. METHOD FOR MICRORNA ISOLATION FROM CLINICAL SERUM SAMPLES

    PubMed Central

    Li, Yu; Kowdley, Kris V.

    2012-01-01

    MicroRNAs are a group of intracellular non-coding RNA molecules that have been implicated in a variety of human diseases. Due to their high stability in blood, microRNAs released into circulation could be potentially utilized as non-invasive biomarkers for diagnosis or prognosis. Current microRNA isolation protocols are specifically designed for solid tissues and are impractical for biomarker development utilizing small-volume serum samples on a large scale. Thus, a protocol for microRNA isolation from serum is needed to accommodate these conditions in biomarker development. To establish such a protocol, we developed a simplified approach to normalize sample input by using single synthetic spike-in microRNA. We evaluated three commonly used commercial microRNA isolation kits for the best performance by comparing RNA quality and yield. The manufacturer’s protocol was further modified to improve the microRNA yield from 200 μL of human serum. MicroRNAs isolated from a large set of clinical serum samples were tested on the miRCURY LNA real-time PCR panel and confirmed to be suitable for high-throughput microRNA profiling. In conclusion, we have established a proven method for microRNA isolation from clinical serum samples suitable for microRNA biomarker development. PMID:22982505

  6. Computational dissection of human episodic memory reveals mental process-specific genetic profiles

    PubMed Central

    Luksys, Gediminas; Fastenrath, Matthias; Coynel, David; Freytag, Virginie; Gschwind, Leo; Heck, Angela; Jessen, Frank; Maier, Wolfgang; Milnik, Annette; Riedel-Heller, Steffi G.; Scherer, Martin; Spalek, Klara; Vogler, Christian; Wagner, Michael; Wolfsgruber, Steffen; Papassotiropoulos, Andreas; de Quervain, Dominique J.-F.

    2015-01-01

    Episodic memory performance is the result of distinct mental processes, such as learning, memory maintenance, and emotional modulation of memory strength. Such processes can be effectively dissociated using computational models. Here we performed gene set enrichment analyses of model parameters estimated from the episodic memory performance of 1,765 healthy young adults. We report robust and replicated associations of the amine compound SLC (solute-carrier) transporters gene set with the learning rate, of the collagen formation and transmembrane receptor protein tyrosine kinase activity gene sets with the modulation of memory strength by negative emotional arousal, and of the L1 cell adhesion molecule (L1CAM) interactions gene set with the repetition-based memory improvement. Furthermore, in a large functional MRI sample of 795 subjects we found that the association between L1CAM interactions and memory maintenance revealed large clusters of differences in brain activity in frontal cortical areas. Our findings provide converging evidence that distinct genetic profiles underlie specific mental processes of human episodic memory. They also provide empirical support to previous theoretical and neurobiological studies linking specific neuromodulators to the learning rate and linking neural cell adhesion molecules to memory maintenance. Furthermore, our study suggests additional memory-related genetic pathways, which may contribute to a better understanding of the neurobiology of human memory. PMID:26261317

  7. Computational dissection of human episodic memory reveals mental process-specific genetic profiles.

    PubMed

    Luksys, Gediminas; Fastenrath, Matthias; Coynel, David; Freytag, Virginie; Gschwind, Leo; Heck, Angela; Jessen, Frank; Maier, Wolfgang; Milnik, Annette; Riedel-Heller, Steffi G; Scherer, Martin; Spalek, Klara; Vogler, Christian; Wagner, Michael; Wolfsgruber, Steffen; Papassotiropoulos, Andreas; de Quervain, Dominique J-F

    2015-09-01

    Episodic memory performance is the result of distinct mental processes, such as learning, memory maintenance, and emotional modulation of memory strength. Such processes can be effectively dissociated using computational models. Here we performed gene set enrichment analyses of model parameters estimated from the episodic memory performance of 1,765 healthy young adults. We report robust and replicated associations of the amine compound SLC (solute-carrier) transporters gene set with the learning rate, of the collagen formation and transmembrane receptor protein tyrosine kinase activity gene sets with the modulation of memory strength by negative emotional arousal, and of the L1 cell adhesion molecule (L1CAM) interactions gene set with the repetition-based memory improvement. Furthermore, in a large functional MRI sample of 795 subjects we found that the association between L1CAM interactions and memory maintenance revealed large clusters of differences in brain activity in frontal cortical areas. Our findings provide converging evidence that distinct genetic profiles underlie specific mental processes of human episodic memory. They also provide empirical support to previous theoretical and neurobiological studies linking specific neuromodulators to the learning rate and linking neural cell adhesion molecules to memory maintenance. Furthermore, our study suggests additional memory-related genetic pathways, which may contribute to a better understanding of the neurobiology of human memory.

  8. ReactionMap: an efficient atom-mapping algorithm for chemical reactions.

    PubMed

    Fooshee, David; Andronico, Alessio; Baldi, Pierre

    2013-11-25

    Large databases of chemical reactions provide new data-mining opportunities and challenges. Key challenges result from the imperfect quality of the data and the fact that many of these reactions are not properly balanced or atom-mapped. Here, we describe ReactionMap, an efficient atom-mapping algorithm. Our approach uses a combination of maximum common chemical subgraph search and minimization of an assignment cost function derived empirically from training data. We use a set of over 259,000 balanced atom-mapped reactions from the SPRESI commercial database to train the system, and we validate it on random sets of 1000 and 17,996 reactions sampled from this pool. These large test sets represent a broad range of chemical reaction types, and ReactionMap correctly maps about 99% of the atoms and about 96% of the reactions, with a mean time per mapping of 2 s. Most correctly mapped reactions are mapped with high confidence. Mapping accuracy compares favorably with ChemAxon's AutoMapper, versions 5 and 6.1, and the DREAM Web tool. These approaches correctly map 60.7%, 86.5%, and 90.3% of the reactions, respectively, on the same data set. A ReactionMap server is available on the ChemDB Web portal at http://cdb.ics.uci.edu .

  9. Constructing a Reward-Related Quality of Life Statistic in Daily Life—a Proof of Concept Study Using Positive Affect

    PubMed Central

    Verhagen, Simone J. W.; Simons, Claudia J. P.; van Zelst, Catherine; Delespaul, Philippe A. E. G.

    2017-01-01

    Background: Mental healthcare needs person-tailored interventions. Experience Sampling Method (ESM) can provide daily life monitoring of personal experiences. This study aims to operationalize and test a measure of momentary reward-related Quality of Life (rQoL). Intuitively, quality of life improves by spending more time on rewarding experiences. ESM clinical interventions can use this information to coach patients to find a realistic, optimal balance of positive experiences (maximize reward) in daily life. rQoL combines the frequency of engaging in a relevant context (a ‘behavior setting’) with concurrent (positive) affect. High rQoL occurs when the most frequent behavior settings are combined with positive affect or infrequent behavior settings co-occur with low positive affect. Methods: Resampling procedures (Monte Carlo experiments) were applied to assess the reliability of rQoL using various behavior setting definitions under different sampling circumstances, for real or virtual subjects with low-, average- and high contextual variability. Furthermore, resampling was used to assess whether rQoL is a distinct concept from positive affect. Virtual ESM beep datasets were extracted from 1,058 valid ESM observations for virtual and real subjects. Results: Behavior settings defined by Who-What contextual information were most informative. Simulations of at least 100 ESM observations are needed for reliable assessment. Virtual ESM beep datasets of a real subject can be defined by Who-What-Where behavior setting combinations. Large sample sizes are necessary for reliable rQoL assessments, except for subjects with low contextual variability. rQoL is distinct from positive affect. Conclusion: rQoL is a feasible concept. Monte Carlo experiments should be used to assess the reliable implementation of an ESM statistic. Future research in ESM should asses the behavior of summary statistics under different sampling situations. This exploration is especially relevant in clinical implementation, where often only small datasets are available. PMID:29163294

  10. A large-scale study of the ultrawideband microwave dielectric properties of normal breast tissue obtained from reduction surgeries.

    PubMed

    Lazebnik, Mariya; McCartney, Leah; Popovic, Dijana; Watkins, Cynthia B; Lindstrom, Mary J; Harter, Josephine; Sewall, Sarah; Magliocco, Anthony; Booske, John H; Okoniewski, Michal; Hagness, Susan C

    2007-05-21

    The efficacy of emerging microwave breast cancer detection and treatment techniques will depend, in part, on the dielectric properties of normal breast tissue. However, knowledge of these properties at microwave frequencies has been limited due to gaps and discrepancies in previously reported small-scale studies. To address these issues, we experimentally characterized the wideband microwave-frequency dielectric properties of a large number of normal breast tissue samples obtained from breast reduction surgeries at the University of Wisconsin and University of Calgary hospitals. The dielectric spectroscopy measurements were conducted from 0.5 to 20 GHz using a precision open-ended coaxial probe. The tissue composition within the probe's sensing region was quantified in terms of percentages of adipose, fibroconnective and glandular tissues. We fit a one-pole Cole-Cole model to the complex permittivity data set obtained for each sample and determined median Cole-Cole parameters for three groups of normal breast tissues, categorized by adipose tissue content (0-30%, 31-84% and 85-100%). Our analysis of the dielectric properties data for 354 tissue samples reveals that there is a large variation in the dielectric properties of normal breast tissue due to substantial tissue heterogeneity. We observed no statistically significant difference between the within-patient and between-patient variability in the dielectric properties.

  11. Haematology and plasma chemistry of the red top ice blue mbuna cichlid (Metriaclima greshakei).

    PubMed

    Snellgrove, Donna L; Alexander, Lucille G

    2011-10-01

    Clinical haematology and blood plasma chemistry can be used as a valuable tool to provide substantial diagnostic information for fish. A wide range of parameters can be used to assess nutritional status, digestive function, disease identification, routine metabolic levels, general physiological status and even the assessment and management of wild fish populations. However to evaluate such data accurately, baseline reference intervals for each measurable parameter must be established for the species of fish in question. Baseline data for ornamental fish species are limited, as research is more commonly conducted using commercially cultured fish. Blood samples were collected from sixteen red top ice blue cichlids (Metriaclima greshakei), an ornamental freshwater fish, to describe a range of haematology and plasma chemistry parameters. Since this cichlid is fairly large in comparison with most tropical ornamental fish, two independent blood samples were taken to assess a large range of parameters. No significant differences were noted between sample periods for any parameter. Values obtained for a large number of parameters were similar to those established for other closely related fish species such as tilapia (Oreochromis spp.). In addition to reporting the first set of blood values for M. Greshakei, to our knowledge, this study highlights the possibility of using previously established data for cultured cichlid species in studies with ornamental cichlid fish.

  12. Learning maximum entropy models from finite-size data sets: A fast data-driven algorithm allows sampling from the posterior distribution.

    PubMed

    Ferrari, Ulisse

    2016-08-01

    Maximum entropy models provide the least constrained probability distributions that reproduce statistical properties of experimental datasets. In this work we characterize the learning dynamics that maximizes the log-likelihood in the case of large but finite datasets. We first show how the steepest descent dynamics is not optimal as it is slowed down by the inhomogeneous curvature of the model parameters' space. We then provide a way for rectifying this space which relies only on dataset properties and does not require large computational efforts. We conclude by solving the long-time limit of the parameters' dynamics including the randomness generated by the systematic use of Gibbs sampling. In this stochastic framework, rather than converging to a fixed point, the dynamics reaches a stationary distribution, which for the rectified dynamics reproduces the posterior distribution of the parameters. We sum up all these insights in a "rectified" data-driven algorithm that is fast and by sampling from the parameters' posterior avoids both under- and overfitting along all the directions of the parameters' space. Through the learning of pairwise Ising models from the recording of a large population of retina neurons, we show how our algorithm outperforms the steepest descent method.

  13. Vaginal microbial flora analysis by next generation sequencing and microarrays; can microbes indicate vaginal origin in a forensic context?

    PubMed

    Benschop, Corina C G; Quaak, Frederike C A; Boon, Mathilde E; Sijen, Titia; Kuiper, Irene

    2012-03-01

    Forensic analysis of biological traces generally encompasses the investigation of both the person who contributed to the trace and the body site(s) from which the trace originates. For instance, for sexual assault cases, it can be beneficial to distinguish vaginal samples from skin or saliva samples. In this study, we explored the use of microbial flora to indicate vaginal origin. First, we explored the vaginal microbiome for a large set of clinical vaginal samples (n = 240) by next generation sequencing (n = 338,184 sequence reads) and found 1,619 different sequences. Next, we selected 389 candidate probes targeting genera or species and designed a microarray, with which we analysed a diverse set of samples; 43 DNA extracts from vaginal samples and 25 DNA extracts from samples from other body sites, including sites in close proximity of or in contact with the vagina. Finally, we used the microarray results and next generation sequencing dataset to assess the potential for a future approach that uses microbial markers to indicate vaginal origin. Since no candidate genera/species were found to positively identify all vaginal DNA extracts on their own, while excluding all non-vaginal DNA extracts, we deduce that a reliable statement about the cellular origin of a biological trace should be based on the detection of multiple species within various genera. Microarray analysis of a sample will then render a microbial flora pattern that is probably best analysed in a probabilistic approach.

  14. Analysis of suspicious powders following the post 9/11 anthrax scare.

    PubMed

    Wills, Brandon; Leikin, Jerrold; Rhee, James; Saeedi, Bijan

    2008-06-01

    Following the 9/11 terrorist attacks, SET Environmental, Inc., a Chicago-based environmental and hazardous materials management company received a large number of suspicious powders for analysis. Samples of powders were submitted to SET for anthrax screening and/or unknown identification (UI). Anthrax screening was performed on-site using a ruggedized analytical pathogen identification device (R.A.P.I.D.) (Idaho Technologies, Salt Lake City, UT). UI was performed at SET headquarters (Wheeling, IL) utilizing a combination of wet chemistry techniques, infrared spectroscopy, and gas chromatography/mass spectroscopy. Turnaround time was approximately 2-3 hours for either anthrax or UI. Between October 10, 2001 and October 11, 2002, 161 samples were analyzed. Of these, 57 were for anthrax screening only, 78 were for anthrax and UI, and 26 were for UI only. Sources of suspicious powders included industries (66%), U.S. Postal Service (19%), law enforcement (9%), and municipalities (7%). There were 0/135 anthrax screens that were positive. There were no positive anthrax screens performed by SET in the Chicago area following the post-9/11 anthrax scare. The only potential biological or chemical warfare agent identified (cyanide) was provided by law enforcement. Rapid anthrax screening and identification of unknown substances at the scene are useful to prevent costly interruption of services and potential referral for medical evaluation.

  15. Interactive Exploration on Large Genomic Datasets.

    PubMed

    Tu, Eric

    2016-01-01

    The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project [1], which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount of larger, more complex data. This difficulty is acute when viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently viewing multiple samples of data. While genomic processing pipelines have shifted towards using distributed computing techniques, such as with ADAM [4], genomic visualization tools have not. In this work we present Mango, a scalable genome browser built on top of ADAM that can run both locally and on a cluster. Mango presents a combination of different optimizations that can be combined in a single application to drive novel genomic visualization techniques over terabytes of genomic data. By building visualization on top of a distributed processing pipeline, we can perform visualization queries over large regions that are not possible with current tools, and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics project at University of California-Berkeley [25] and is published under the Apache 2 license. Mango is available at https://github.com/bigdatagenomics/mango.

  16. Active Learning to Overcome Sample Selection Bias: Application to Photometric Variable Star Classification

    NASA Astrophysics Data System (ADS)

    Richards, Joseph W.; Starr, Dan L.; Brink, Henrik; Miller, Adam A.; Bloom, Joshua S.; Butler, Nathaniel R.; James, J. Berian; Long, James P.; Rice, John

    2012-01-01

    Despite the great promise of machine-learning algorithms to classify and predict astrophysical parameters for the vast numbers of astrophysical sources and transients observed in large-scale surveys, the peculiarities of the training data often manifest as strongly biased predictions on the data of interest. Typically, training sets are derived from historical surveys of brighter, more nearby objects than those from more extensive, deeper surveys (testing data). This sample selection bias can cause catastrophic errors in predictions on the testing data because (1) standard assumptions for machine-learned model selection procedures break down and (2) dense regions of testing space might be completely devoid of training data. We explore possible remedies to sample selection bias, including importance weighting, co-training, and active learning (AL). We argue that AL—where the data whose inclusion in the training set would most improve predictions on the testing set are queried for manual follow-up—is an effective approach and is appropriate for many astronomical applications. For a variable star classification problem on a well-studied set of stars from Hipparcos and Optical Gravitational Lensing Experiment, AL is the optimal method in terms of error rate on the testing data, beating the off-the-shelf classifier by 3.4% and the other proposed methods by at least 3.0%. To aid with manual labeling of variable stars, we developed a Web interface which allows for easy light curve visualization and querying of external databases. Finally, we apply AL to classify variable stars in the All Sky Automated Survey, finding dramatic improvement in our agreement with the ASAS Catalog of Variable Stars, from 65.5% to 79.5%, and a significant increase in the classifier's average confidence for the testing set, from 14.6% to 42.9%, after a few AL iterations.

  17. Utilizing Maximal Independent Sets as Dominating Sets in Scale-Free Networks

    NASA Astrophysics Data System (ADS)

    Derzsy, N.; Molnar, F., Jr.; Szymanski, B. K.; Korniss, G.

    Dominating sets provide key solution to various critical problems in networked systems, such as detecting, monitoring, or controlling the behavior of nodes. Motivated by graph theory literature [Erdos, Israel J. Math. 4, 233 (1966)], we studied maximal independent sets (MIS) as dominating sets in scale-free networks. We investigated the scaling behavior of the size of MIS in artificial scale-free networks with respect to multiple topological properties (size, average degree, power-law exponent, assortativity), evaluated its resilience to network damage resulting from random failure or targeted attack [Molnar et al., Sci. Rep. 5, 8321 (2015)], and compared its efficiency to previously proposed dominating set selection strategies. We showed that, despite its small set size, MIS provides very high resilience against network damage. Using extensive numerical analysis on both synthetic and real-world (social, biological, technological) network samples, we demonstrate that our method effectively satisfies four essential requirements of dominating sets for their practical applicability on large-scale real-world systems: 1.) small set size, 2.) minimal network information required for their construction scheme, 3.) fast and easy computational implementation, and 4.) resiliency to network damage. Supported by DARPA, DTRA, and NSF.

  18. Recognition Using Hybrid Classifiers.

    PubMed

    Osadchy, Margarita; Keren, Daniel; Raviv, Dolev

    2016-04-01

    A canonical problem in computer vision is category recognition (e.g., find all instances of human faces, cars etc., in an image). Typically, the input for training a binary classifier is a relatively small sample of positive examples, and a huge sample of negative examples, which can be very diverse, consisting of images from a large number of categories. The difficulty of the problem sharply increases with the dimension and size of the negative example set. We propose to alleviate this problem by applying a "hybrid" classifier, which replaces the negative samples by a prior, and then finds a hyperplane which separates the positive samples from this prior. The method is extended to kernel space and to an ensemble-based approach. The resulting binary classifiers achieve an identical or better classification rate than SVM, while requiring far smaller memory and lower computational complexity to train and apply.

  19. Probing the underlying physics of ejecta production from shocked Sn samples

    NASA Astrophysics Data System (ADS)

    Zellner, M. B.; McNeil, W. Vogan; Hammerberg, J. E.; Hixson, R. S.; Obst, A. W.; Olson, R. T.; Payton, J. R.; Rigg, P. A.; Routley, N.; Stevens, G. D.; Turley, W. D.; Veeser, L.; Buttler, W. T.

    2008-06-01

    This effort investigates the underlying physics of ejecta production for high explosive (HE) shocked Sn surfaces prepared with finishes typical to those roughened by tool marks left from machining processes. To investigate the physical mechanisms of ejecta production, we compiled and re-examined ejecta data from two experimental campaigns [W. S. Vogan et al., J. Appl. Phys. 98, 113508 (1998); M. B. Zellner et al., ibid. 102, 013522 (2007)] to form a self-consistent data set spanning a large parameter space. In the first campaign, ejecta created upon shock release at the back side of HE shocked Sn samples were characterized for samples with varying surface finishes but at similar shock-breakout pressures PSB. In the second campaign, ejecta were characterized for HE shocked Sn samples with a constant surface finish but at varying PSB.

  20. Incorporating partially identified sample segments into acreage estimation procedures: Estimates using only observations from the current year

    NASA Technical Reports Server (NTRS)

    Sielken, R. L., Jr. (Principal Investigator)

    1981-01-01

    Several methods of estimating individual crop acreages using a mixture of completely identified and partially identified (generic) segments from a single growing year are derived and discussed. A small Monte Carlo study of eight estimators is presented. The relative empirical behavior of these estimators is discussed as are the effects of segment sample size and amount of partial identification. The principle recommendations are (1) to not exclude, but rather incorporate partially identified sample segments into the estimation procedure, (2) try to avoid having a large percentage (say 80%) of only partially identified segments, in the sample, and (3) use the maximum likelihood estimator although the weighted least squares estimator and least squares ratio estimator both perform almost as well. Sets of spring small grains (North Dakota) data were used.

  1. Treatment of atomic and molecular line blanketing by opacity sampling

    NASA Technical Reports Server (NTRS)

    Johnson, H. R.; Krupp, B. M.

    1976-01-01

    A sampling technique for treating the radiative opacity of large numbers of atomic and molecular lines in cool stellar atmospheres is subjected to several tests. In this opacity sampling (OS) technique, the global opacity is sampled at only a selected set of frequencies, and at each of these frequencies the total monochromatic opacity is obtained by summing the contribution of every relevant atomic and molecular line. In accord with previous results, we find that the structure of atmospheric models is accurately fixed by the use of 1000 frequency points, and 100 frequency points are adequate for many purposes. The effects of atomic and molecular lines are separately studied. A test model computed using the OS method agrees very well with a model having identical atmospheric parameters, but computed with the giant line (opacity distribution function) method.

  2. Testing for Salmonella in raw meat and poultry products collected at federally inspected establishments in the United States, 1998 through 2000.

    PubMed

    Rose, Bonnie E; Hill, Walter E; Umholtz, Robert; Ransom, Gerri M; James, William O

    2002-06-01

    The Food Safety and Inspection Service (FSIS) issued Pathogen Reduction; Hazard Analysis and Critical Control Point (HACCP) Systems; Final Rule (the PR/HACCP rule) on 25 July 1996. To verify that industry PR/HACCP systems are effective in controlling the contamination of raw meat and poultry products with human disease-causing bacteria, this rule sets product-specific Salmonella performance standards that must be met by slaughter establishments and establishments producing raw ground products. These performance standards are based on the prevalence of Salmonella as determined from the FSIS's nationwide microbial baseline studies and are expressed in terms of the maximum number of Salmonella-positive samples that are allowed in a given sample set. From 26 January 1998 through 31 December 2000, federal inspectors collected 98,204 samples and 1,502 completed sample sets for Salmonella analysis from large, small, and very small establishments that produced at least one of seven raw meat and poultry products: broilers, market hogs, cows and bulls, steers and heifers, ground beef, ground chicken, and ground turkey. Salmonella prevalence in most of the product categories was lower after the implementation of PR/HACCP than in pre-PR/HACCP baseline studies and surveys conducted by the FSIS. The results of 3 years of testing at establishments of all sizes combined show that >80% of the sample sets met the following Salmonella prevalence performance standards: 20.0% for broilers, 8.7% for market hogs, 2.7% for cows and bulls, 1.0% for steers and heifers, 7.5% for ground beef, 44.6% for ground chicken, and 49.9% for ground turkey. The decreased Salmonella prevalences may partly reflect industry improvements, such as improved process control, incorporation of antimicrobial interventions, and increased microbial-process control monitoring, in conjunction with PR/HACCP implementation.

  3. The evolution of phylogeographic data sets.

    PubMed

    Garrick, Ryan C; Bonatelli, Isabel A S; Hyseni, Chaz; Morales, Ariadna; Pelletier, Tara A; Perez, Manolo F; Rice, Edwin; Satler, Jordan D; Symula, Rebecca E; Thomé, Maria Tereza C; Carstens, Bryan C

    2015-03-01

    Empirical phylogeographic studies have progressively sampled greater numbers of loci over time, in part motivated by theoretical papers showing that estimates of key demographic parameters improve as the number of loci increases. Recently, next-generation sequencing has been applied to questions about organismal history, with the promise of revolutionizing the field. However, no systematic assessment of how phylogeographic data sets have changed over time with respect to overall size and information content has been performed. Here, we quantify the changing nature of these genetic data sets over the past 20 years, focusing on papers published in Molecular Ecology. We found that the number of independent loci, the total number of alleles sampled and the total number of single nucleotide polymorphisms (SNPs) per data set has improved over time, with particularly dramatic increases within the past 5 years. Interestingly, uniparentally inherited organellar markers (e.g. animal mitochondrial and plant chloroplast DNA) continue to represent an important component of phylogeographic data. Single-species studies (cf. comparative studies) that focus on vertebrates (particularly fish and to some extent, birds) represent the gold standard of phylogeographic data collection. Based on the current trajectory seen in our survey data, forecast modelling indicates that the median number of SNPs per data set for studies published by the end of the year 2016 may approach ~20,000. This survey provides baseline information for understanding the evolution of phylogeographic data sets and underscores the fact that development of analytical methods for handling very large genetic data sets will be critical for facilitating growth of the field. © 2015 John Wiley & Sons Ltd.

  4. Estimating the probability that the sample mean is within a desired fraction of the standard deviation of the true mean.

    PubMed

    Schillaci, Michael A; Schillaci, Mario E

    2009-02-01

    The use of small sample sizes in human and primate evolutionary research is commonplace. Estimating how well small samples represent the underlying population, however, is not commonplace. Because the accuracy of determinations of taxonomy, phylogeny, and evolutionary process are dependant upon how well the study sample represents the population of interest, characterizing the uncertainty, or potential error, associated with analyses of small sample sizes is essential. We present a method for estimating the probability that the sample mean is within a desired fraction of the standard deviation of the true mean using small (n<10) or very small (n < or = 5) sample sizes. This method can be used by researchers to determine post hoc the probability that their sample is a meaningful approximation of the population parameter. We tested the method using a large craniometric data set commonly used by researchers in the field. Given our results, we suggest that sample estimates of the population mean can be reasonable and meaningful even when based on small, and perhaps even very small, sample sizes.

  5. Valid statistical inference methods for a case-control study with missing data.

    PubMed

    Tian, Guo-Liang; Zhang, Chi; Jiang, Xuejun

    2018-04-01

    The main objective of this paper is to derive the valid sampling distribution of the observed counts in a case-control study with missing data under the assumption of missing at random by employing the conditional sampling method and the mechanism augmentation method. The proposed sampling distribution, called the case-control sampling distribution, can be used to calculate the standard errors of the maximum likelihood estimates of parameters via the Fisher information matrix and to generate independent samples for constructing small-sample bootstrap confidence intervals. Theoretical comparisons of the new case-control sampling distribution with two existing sampling distributions exhibit a large difference. Simulations are conducted to investigate the influence of the three different sampling distributions on statistical inferences. One finding is that the conclusion by the Wald test for testing independency under the two existing sampling distributions could be completely different (even contradictory) from the Wald test for testing the equality of the success probabilities in control/case groups under the proposed distribution. A real cervical cancer data set is used to illustrate the proposed statistical methods.

  6. Clonal evolution in relapsed and refractory diffuse large B-cell lymphoma is characterized by high dynamics of subclones.

    PubMed

    Melchardt, Thomas; Hufnagl, Clemens; Weinstock, David M; Kopp, Nadja; Neureiter, Daniel; Tränkenschuh, Wolfgang; Hackl, Hubert; Weiss, Lukas; Rinnerthaler, Gabriel; Hartmann, Tanja N; Greil, Richard; Weigert, Oliver; Egle, Alexander

    2016-08-09

    Little information is available about the role of certain mutations for clonal evolution and the clinical outcome during relapse in diffuse large B-cell lymphoma (DLBCL). Therefore, we analyzed formalin-fixed-paraffin-embedded tumor samples from first diagnosis, relapsed or refractory disease from 28 patients using next-generation sequencing of the exons of 104 coding genes. Non-synonymous mutations were present in 74 of the 104 genes tested. Primary tumor samples showed a median of 8 non-synonymous mutations (range: 0-24) with the used gene set. Lower numbers of non-synonymous mutations in the primary tumor were associated with a better median OS compared with higher numbers (28 versus 15 months, p=0.031). We observed three patterns of clonal evolution during relapse of disease: large global change, subclonal selection and no or minimal change possibly suggesting preprogrammed resistance. We conclude that targeted re-sequencing is a feasible and informative approach to characterize the molecular pattern of relapse and it creates novel insights into the role of dynamics of individual genes.

  7. An Alternative to the Search for Single Polymorphisms: Toward Molecular Personality Scales for the Five-Factor Model

    PubMed Central

    McCrae, Robert R.; Scally, Matthew; Terracciano, Antonio; Abecasis, Gonçalo R.; Costa, Paul T.

    2011-01-01

    There is growing evidence that personality traits are affected by many genes, all of which have very small effects. As an alternative to the largely-unsuccessful search for individual polymorphisms associated with personality traits, we identified large sets of potentially related single nucleotide polymorphisms (SNPs) and summed them to form molecular personality scales (MPSs) with from 4 to 2,497 SNPs. Scales were derived from two-thirds of a large (N = 3,972) sample of individuals from Sardinia who completed the Revised NEO Personality Inventory and were assessed in a genome-wide association scan. When MPSs were correlated with the phenotype in the remaining third of the sample, very small but significant associations were found for four of the five personality factors when the longest scales were examined. These data suggest that MPSs for Neuroticism, Openness to Experience, Agreeableness, and Conscientiousness (but not Extraversion) contain genetic information that can be refined in future studies, and the procedures described here should be applicable to other quantitative traits. PMID:21114353

  8. Somatic mutation of EZH2 (Y641) in follicular and diffuse large B-cell lymphomas of germinal center origin | Office of Cancer Genomics

    Cancer.gov

    Morin et al. describe recurrent somatic mutations in EZH2, a polycomb group oncogene. The mutation, found in the SET domain of this gene encoding a histone methyltransferase, is found only in a subset of lymphoma samples. Specifically, EZH2 mutations are found in about 12% of follicular lymphomas (FL) and almost 23% of diffuse large B-cell lymphomas (DLBCL) of germinal center origin. This paper goes on to demonstrate that altered EZH2 proteins, corresponding to the most frequent mutations found in human lymphomas, have reduced activity using in vitro histone methylation assays.

  9. The effects of inference method, population sampling, and gene sampling on species tree inferences: an empirical study in slender salamanders (Plethodontidae: Batrachoseps).

    PubMed

    Jockusch, Elizabeth L; Martínez-Solano, Iñigo; Timpe, Elizabeth K

    2015-01-01

    Species tree methods are now widely used to infer the relationships among species from multilocus data sets. Many methods have been developed, which differ in whether gene and species trees are estimated simultaneously or sequentially, and in how gene trees are used to infer the species tree. While these methods perform well on simulated data, less is known about what impacts their performance on empirical data. We used a data set including five nuclear genes and one mitochondrial gene for 22 species of Batrachoseps to compare the effects of method of analysis, within-species sampling and gene sampling on species tree inferences. For this data set, the choice of inference method had the largest effect on the species tree topology. Exclusion of individual loci had large effects in *BEAST and STEM, but not in MP-EST. Different loci carried the greatest leverage in these different methods, showing that the causes of their disproportionate effects differ. Even though substantial information was present in the nuclear loci, the mitochondrial gene dominated the *BEAST species tree. This leverage is inherent to the mtDNA locus and results from its high variation and lower assumed ploidy. This mtDNA leverage may be problematic when mtDNA has undergone introgression, as is likely in this data set. By contrast, the leverage of RAG1 in STEM analyses does not reflect properties inherent to the locus, but rather results from a gene tree that is strongly discordant with all others, and is best explained by introgression between distantly related species. Within-species sampling was also important, especially in *BEAST analyses, as shown by differences in tree topology across 100 subsampled data sets. Despite the sensitivity of the species tree methods to multiple factors, five species groups, the relationships among these, and some relationships within them, are generally consistently resolved for Batrachoseps. © The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  10. Measuring consistent masses for 25 Milky Way globular clusters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kimmig, Brian; Seth, Anil; Ivans, Inese I.

    2015-02-01

    We present central velocity dispersions, masses, mass-to-light ratios (M/Ls ), and rotation strengths for 25 Galactic globular clusters (GCs). We derive radial velocities of 1951 stars in 12 GCs from single order spectra taken with Hectochelle on the MMT telescope. To this sample we add an analysis of available archival data of individual stars. For the full set of data we fit King models to derive consistent dynamical parameters for the clusters. We find good agreement between single-mass King models and the observed radial dispersion profiles. The large, uniform sample of dynamical masses we derive enables us to examine trendsmore » of M/L with cluster mass and metallicity. The overall values of M/L and the trends with mass and metallicity are consistent with existing measurements from a large sample of M31 clusters. This includes a clear trend of increasing M/L with cluster mass and lower than expected M/Ls for the metal-rich clusters. We find no clear trend of increasing rotation with increasing cluster metallicity suggested in previous work.« less

  11. From large-eddy simulation to multi-UAVs sampling of shallow cumulus clouds

    NASA Astrophysics Data System (ADS)

    Lamraoui, Fayçal; Roberts, Greg; Burnet, Frédéric

    2016-04-01

    In-situ sampling of clouds that can provide simultaneous measurements at satisfying spatio-temporal resolutions to capture 3D small scale physical processes continues to present challenges. This project (SKYSCANNER) aims at bringing together cloud sampling strategies using a swarm of unmanned aerial vehicles (UAVs) based on Large-eddy simulation (LES). The multi-UAV-based field campaigns with a personalized sampling strategy for individual clouds and cloud fields will significantly improve the understanding of the unresolved cloud physical processes. An extensive set of LES experiments for case studies from ARM-SGP site have been performed using MesoNH model at high resolutions down to 10 m. The carried out simulations led to establishing a macroscopic model that quantifies the interrelationship between micro- and macrophysical properties of shallow convective clouds. Both the geometry and evolution of individual clouds are critical to multi-UAV cloud sampling and path planning. The preliminary findings of the current project reveal several linear relationships that associate many cloud geometric parameters to cloud related meteorological variables. In addition, the horizontal wind speed indicates a proportional impact on cloud number concentration as well as triggering and prolonging the occurrence of cumulus clouds. In the framework of the joint collaboration that involves a Multidisciplinary Team (including institutes specializing in aviation, robotics and atmospheric science), this model will be a reference point for multi-UAVs sampling strategies and path planning.

  12. Using regression methods to estimate stream phosphorus loads at the Illinois River, Arkansas

    USGS Publications Warehouse

    Haggard, B.E.; Soerens, T.S.; Green, W.R.; Richards, R.P.

    2003-01-01

    The development of total maximum daily loads (TMDLs) requires evaluating existing constituent loads in streams. Accurate estimates of constituent loads are needed to calibrate watershed and reservoir models for TMDL development. The best approach to estimate constituent loads is high frequency sampling, particularly during storm events, and mass integration of constituents passing a point in a stream. Most often, resources are limited and discrete water quality samples are collected on fixed intervals and sometimes supplemented with directed sampling during storm events. When resources are limited, mass integration is not an accurate means to determine constituent loads and other load estimation techniques such as regression models are used. The objective of this work was to determine a minimum number of water-quality samples needed to provide constituent concentration data adequate to estimate constituent loads at a large stream. Twenty sets of water quality samples with and without supplemental storm samples were randomly selected at various fixed intervals from a database at the Illinois River, northwest Arkansas. The random sets were used to estimate total phosphorus (TP) loads using regression models. The regression-based annual TP loads were compared to the integrated annual TP load estimated using all the data. At a minimum, monthly sampling plus supplemental storm samples (six samples per year) was needed to produce a root mean square error of less than 15%. Water quality samples should be collected at least semi-monthly (every 15 days) in studies less than two years if seasonal time factors are to be used in the regression models. Annual TP loads estimated from independently collected discrete water quality samples further demonstrated the utility of using regression models to estimate annual TP loads in this stream system.

  13. Large-scale diversity of slope fishes: pattern inconsistency between multiple diversity indices.

    PubMed

    Gaertner, Jean-Claude; Maiorano, Porzia; Mérigot, Bastien; Colloca, Francesco; Politou, Chrissi-Yianna; Gil De Sola, Luis; Bertrand, Jacques A; Murenu, Matteo; Durbec, Jean-Pierre; Kallianiotis, Argyris; Mannini, Alessandro

    2013-01-01

    Large-scale studies focused on the diversity of continental slope ecosystems are still rare, usually restricted to a limited number of diversity indices and mainly based on the empirical comparison of heterogeneous local data sets. In contrast, we investigate large-scale fish diversity on the basis of multiple diversity indices and using 1454 standardized trawl hauls collected throughout the upper and middle slope of the whole northern Mediterranean Sea (36°3'- 45°7' N; 5°3'W - 28°E). We have analyzed (1) the empirical relationships between a set of 11 diversity indices in order to assess their degree of complementarity/redundancy and (2) the consistency of spatial patterns exhibited by each of the complementary groups of indices. Regarding species richness, our results contrasted both the traditional view based on the hump-shaped theory for bathymetric pattern and the commonly-admitted hypothesis of a large-scale decreasing trend correlated with a similar gradient of primary production in the Mediterranean Sea. More generally, we found that the components of slope fish diversity we analyzed did not always show a consistent pattern of distribution according either to depth or to spatial areas, suggesting that they are not driven by the same factors. These results, which stress the need to extend the number of indices traditionally considered in diversity monitoring networks, could provide a basis for rethinking not only the methodological approach used in monitoring systems, but also the definition of priority zones for protection. Finally, our results call into question the feasibility of properly investigating large-scale diversity patterns using a widespread approach in ecology, which is based on the compilation of pre-existing heterogeneous and disparate data sets, in particular when focusing on indices that are very sensitive to sampling design standardization, such as species richness.

  14. The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection.

    PubMed

    Tang, Zaixiang; Shen, Yueping; Zhang, Xinyan; Yi, Nengjun

    2017-01-01

    Large-scale "omics" data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, there are considerable challenges in analyzing high-dimensional molecular data, including the large number of potential molecular predictors, limited number of samples, and small effect of each predictor. We propose new Bayesian hierarchical generalized linear models, called spike-and-slab lasso GLMs, for prognostic prediction and detection of associated genes using large-scale molecular data. The proposed model employs a spike-and-slab mixture double-exponential prior for coefficients that can induce weak shrinkage on large coefficients, and strong shrinkage on irrelevant coefficients. We have developed a fast and stable algorithm to fit large-scale hierarchal GLMs by incorporating expectation-maximization (EM) steps into the fast cyclic coordinate descent algorithm. The proposed approach integrates nice features of two popular methods, i.e., penalized lasso and Bayesian spike-and-slab variable selection. The performance of the proposed method is assessed via extensive simulation studies. The results show that the proposed approach can provide not only more accurate estimates of the parameters, but also better prediction. We demonstrate the proposed procedure on two cancer data sets: a well-known breast cancer data set consisting of 295 tumors, and expression data of 4919 genes; and the ovarian cancer data set from TCGA with 362 tumors, and expression data of 5336 genes. Our analyses show that the proposed procedure can generate powerful models for predicting outcomes and detecting associated genes. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/). Copyright © 2017 by the Genetics Society of America.

  15. Why Do Phylogenomic Data Sets Yield Conflicting Trees? Data Type Influences the Avian Tree of Life more than Taxon Sampling.

    PubMed

    Reddy, Sushma; Kimball, Rebecca T; Pandey, Akanksha; Hosner, Peter A; Braun, Michael J; Hackett, Shannon J; Han, Kin-Lan; Harshman, John; Huddleston, Christopher J; Kingston, Sarah; Marks, Ben D; Miglia, Kathleen J; Moore, William S; Sheldon, Frederick H; Witt, Christopher C; Yuri, Tamaki; Braun, Edward L

    2017-09-01

    Phylogenomics, the use of large-scale data matrices in phylogenetic analyses, has been viewed as the ultimate solution to the problem of resolving difficult nodes in the tree of life. However, it has become clear that analyses of these large genomic data sets can also result in conflicting estimates of phylogeny. Here, we use the early divergences in Neoaves, the largest clade of extant birds, as a "model system" to understand the basis for incongruence among phylogenomic trees. We were motivated by the observation that trees from two recent avian phylogenomic studies exhibit conflicts. Those studies used different strategies: 1) collecting many characters [$\\sim$ 42 mega base pairs (Mbp) of sequence data] from 48 birds, sometimes including only one taxon for each major clade; and 2) collecting fewer characters ($\\sim$ 0.4 Mbp) from 198 birds, selected to subdivide long branches. However, the studies also used different data types: the taxon-poor data matrix comprised 68% non-coding sequences whereas coding exons dominated the taxon-rich data matrix. This difference raises the question of whether the primary reason for incongruence is the number of sites, the number of taxa, or the data type. To test among these alternative hypotheses we assembled a novel, large-scale data matrix comprising 90% non-coding sequences from 235 bird species. Although increased taxon sampling appeared to have a positive impact on phylogenetic analyses the most important variable was data type. Indeed, by analyzing different subsets of the taxa in our data matrix we found that increased taxon sampling actually resulted in increased congruence with the tree from the previous taxon-poor study (which had a majority of non-coding data) instead of the taxon-rich study (which largely used coding data). We suggest that the observed differences in the estimates of topology for these studies reflect data-type effects due to violations of the models used in phylogenetic analyses, some of which may be difficult to detect. If incongruence among trees estimated using phylogenomic methods largely reflects problems with model fit developing more "biologically-realistic" models is likely to be critical for efforts to reconstruct the tree of life. [Birds; coding exons; GTR model; model fit; Neoaves; non-coding DNA; phylogenomics; taxon sampling.]. © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  16. Power and instrument strength requirements for Mendelian randomization studies using multiple genetic variants.

    PubMed

    Pierce, Brandon L; Ahsan, Habibul; Vanderweele, Tyler J

    2011-06-01

    Mendelian Randomization (MR) studies assess the causality of an exposure-disease association using genetic determinants [i.e. instrumental variables (IVs)] of the exposure. Power and IV strength requirements for MR studies using multiple genetic variants have not been explored. We simulated cohort data sets consisting of a normally distributed disease trait, a normally distributed exposure, which affects this trait and a biallelic genetic variant that affects the exposure. We estimated power to detect an effect of exposure on disease for varying allele frequencies, effect sizes and samples sizes (using two-stage least squares regression on 10,000 data sets-Stage 1 is a regression of exposure on the variant. Stage 2 is a regression of disease on the fitted exposure). Similar analyses were conducted using multiple genetic variants (5, 10, 20) as independent or combined IVs. We assessed IV strength using the first-stage F statistic. Simulations of realistic scenarios indicate that MR studies will require large (n > 1000), often very large (n > 10,000), sample sizes. In many cases, so-called 'weak IV' problems arise when using multiple variants as independent IVs (even with as few as five), resulting in biased effect estimates. Combining genetic factors into fewer IVs results in modest power decreases, but alleviates weak IV problems. Ideal methods for combining genetic factors depend upon knowledge of the genetic architecture underlying the exposure. The feasibility of well-powered, unbiased MR studies will depend upon the amount of variance in the exposure that can be explained by known genetic factors and the 'strength' of the IV set derived from these genetic factors.

  17. Angular dispersion of oblique phonon modes in BiFeO3 from micro-Raman scattering

    NASA Astrophysics Data System (ADS)

    Hlinka, J.; Pokorny, J.; Karimi, S.; Reaney, I. M.

    2011-01-01

    The angular dispersion of oblique phonon modes in a multiferroic BiFeO3 has been obtained from a micro-Raman spectroscopic investigation of a coarse grain ceramic sample. Continuity of the measured angular dispersion curves allows conclusive identification of all pure zone-center polar modes. The method employed here to reconstruct the anisotropic crystal property from a large set of independent local measurements on a macroscopically isotropic ceramic sample profits from the considerable dispersion of the oblique modes in ferroelectric perovskites and it can be in principle conveniently applied to any other optically uniaxial ferroelectric material.

  18. Molecular dynamics based enhanced sampling of collective variables with very large time steps.

    PubMed

    Chen, Pei-Yang; Tuckerman, Mark E

    2018-01-14

    Enhanced sampling techniques that target a set of collective variables and that use molecular dynamics as the driving engine have seen widespread application in the computational molecular sciences as a means to explore the free-energy landscapes of complex systems. The use of molecular dynamics as the fundamental driver of the sampling requires the introduction of a time step whose magnitude is limited by the fastest motions in a system. While standard multiple time-stepping methods allow larger time steps to be employed for the slower and computationally more expensive forces, the maximum achievable increase in time step is limited by resonance phenomena, which inextricably couple fast and slow motions. Recently, we introduced deterministic and stochastic resonance-free multiple time step algorithms for molecular dynamics that solve this resonance problem and allow ten- to twenty-fold gains in the large time step compared to standard multiple time step algorithms [P. Minary et al., Phys. Rev. Lett. 93, 150201 (2004); B. Leimkuhler et al., Mol. Phys. 111, 3579-3594 (2013)]. These methods are based on the imposition of isokinetic constraints that couple the physical system to Nosé-Hoover chains or Nosé-Hoover Langevin schemes. In this paper, we show how to adapt these methods for collective variable-based enhanced sampling techniques, specifically adiabatic free-energy dynamics/temperature-accelerated molecular dynamics, unified free-energy dynamics, and by extension, metadynamics, thus allowing simulations employing these methods to employ similarly very large time steps. The combination of resonance-free multiple time step integrators with free-energy-based enhanced sampling significantly improves the efficiency of conformational exploration.

  19. Molecular dynamics based enhanced sampling of collective variables with very large time steps

    NASA Astrophysics Data System (ADS)

    Chen, Pei-Yang; Tuckerman, Mark E.

    2018-01-01

    Enhanced sampling techniques that target a set of collective variables and that use molecular dynamics as the driving engine have seen widespread application in the computational molecular sciences as a means to explore the free-energy landscapes of complex systems. The use of molecular dynamics as the fundamental driver of the sampling requires the introduction of a time step whose magnitude is limited by the fastest motions in a system. While standard multiple time-stepping methods allow larger time steps to be employed for the slower and computationally more expensive forces, the maximum achievable increase in time step is limited by resonance phenomena, which inextricably couple fast and slow motions. Recently, we introduced deterministic and stochastic resonance-free multiple time step algorithms for molecular dynamics that solve this resonance problem and allow ten- to twenty-fold gains in the large time step compared to standard multiple time step algorithms [P. Minary et al., Phys. Rev. Lett. 93, 150201 (2004); B. Leimkuhler et al., Mol. Phys. 111, 3579-3594 (2013)]. These methods are based on the imposition of isokinetic constraints that couple the physical system to Nosé-Hoover chains or Nosé-Hoover Langevin schemes. In this paper, we show how to adapt these methods for collective variable-based enhanced sampling techniques, specifically adiabatic free-energy dynamics/temperature-accelerated molecular dynamics, unified free-energy dynamics, and by extension, metadynamics, thus allowing simulations employing these methods to employ similarly very large time steps. The combination of resonance-free multiple time step integrators with free-energy-based enhanced sampling significantly improves the efficiency of conformational exploration.

  20. Optimal spatial sampling techniques for ground truth data in microwave remote sensing of soil moisture

    NASA Technical Reports Server (NTRS)

    Rao, R. G. S.; Ulaby, F. T.

    1977-01-01

    The paper examines optimal sampling techniques for obtaining accurate spatial averages of soil moisture, at various depths and for cell sizes in the range 2.5-40 acres, with a minimum number of samples. Both simple random sampling and stratified sampling procedures are used to reach a set of recommended sample sizes for each depth and for each cell size. Major conclusions from statistical sampling test results are that (1) the number of samples required decreases with increasing depth; (2) when the total number of samples cannot be prespecified or the moisture in only one single layer is of interest, then a simple random sample procedure should be used which is based on the observed mean and SD for data from a single field; (3) when the total number of samples can be prespecified and the objective is to measure the soil moisture profile with depth, then stratified random sampling based on optimal allocation should be used; and (4) decreasing the sensor resolution cell size leads to fairly large decreases in samples sizes with stratified sampling procedures, whereas only a moderate decrease is obtained in simple random sampling procedures.

  1. On the use of total aerobic spore bacteria to make treatment decisions due to Cryptosporidium risk at public water system wells.

    PubMed

    Berger, Philip; Messner, Michael J; Crosby, Jake; Vacs Renwick, Deborah; Heinrich, Austin

    2018-05-01

    Spore reduction can be used as a surrogate measure of Cryptosporidium natural filtration efficiency. Estimates of log10 (log) reduction were derived from spore measurements in paired surface and well water samples in Casper Wyoming and Kearney Nebraska. We found that these data were suitable for testing the hypothesis (H 0 ) that the average reduction at each site was 2 log or less, using a one-sided Student's t-test. After establishing data quality objectives for the test (expressed as tolerable Type I and Type II error rates), we evaluated the test's performance as a function of the (a) true log reduction, (b) number of paired samples assayed and (c) variance of observed log reductions. We found that 36 paired spore samples are sufficient to achieve the objectives over a wide range of variance, including the variances observed in the two data sets. We also explored the feasibility of using smaller numbers of paired spore samples to supplement bioparticle counts for screening purposes in alluvial aquifers, to differentiate wells with large volume surface water induced recharge from wells with negligible surface water induced recharge. With key assumptions, we propose a normal statistical test of the same hypothesis (H 0 ), but with different performance objectives. As few as six paired spore samples appear adequate as a screening metric to supplement bioparticle counts to differentiate wells in alluvial aquifers with large volume surface water induced recharge. For the case when all available information (including failure to reject H 0 based on the limited paired spore data) leads to the conclusion that wells have large surface water induced recharge, we recommend further evaluation using additional paired biweekly spore samples. Published by Elsevier GmbH.

  2. An Excel Workbook for Identifying Redox Processes in Ground Water

    USGS Publications Warehouse

    Jurgens, Bryant C.; McMahon, Peter B.; Chapelle, Francis H.; Eberts, Sandra M.

    2009-01-01

    The reduction/oxidation (redox) condition of ground water affects the concentration, transport, and fate of many anthropogenic and natural contaminants. The redox state of a ground-water sample is defined by the dominant type of reduction/oxidation reaction, or redox process, occurring in the sample, as inferred from water-quality data. However, because of the difficulty in defining and applying a systematic redox framework to samples from diverse hydrogeologic settings, many regional water-quality investigations do not attempt to determine the predominant redox process in ground water. Recently, McMahon and Chapelle (2008) devised a redox framework that was applied to a large number of samples from 15 principal aquifer systems in the United States to examine the effect of redox processes on water quality. This framework was expanded by Chapelle and others (in press) to use measured sulfide data to differentiate between iron(III)- and sulfate-reducing conditions. These investigations showed that a systematic approach to characterize redox conditions in ground water could be applied to datasets from diverse hydrogeologic settings using water-quality data routinely collected in regional water-quality investigations. This report describes the Microsoft Excel workbook, RedoxAssignment_McMahon&Chapelle.xls, that assigns the predominant redox process to samples using the framework created by McMahon and Chapelle (2008) and expanded by Chapelle and others (in press). Assignment of redox conditions is based on concentrations of dissolved oxygen (O2), nitrate (NO3-), manganese (Mn2+), iron (Fe2+), sulfate (SO42-), and sulfide (sum of dihydrogen sulfide [aqueous H2S], hydrogen sulfide [HS-], and sulfide [S2-]). The logical arguments for assigning the predominant redox process to each sample are performed by a program written in Microsoft Visual Basic for Applications (VBA). The program is called from buttons on the main worksheet. The number of samples that can be analyzed is only limited by the number of rows in Excel (65,536 for Excel 2003 and XP; and 1,048,576 for Excel 2007), and is therefore appropriate for large datasets.

  3. Programs to reduce teen pregnancy, sexually transmitted infections, and associated sexual risk behaviors: a systematic review.

    PubMed

    Goesling, Brian; Colman, Silvie; Trenholm, Christopher; Terzian, Mary; Moore, Kristin

    2014-05-01

    This systematic review provides a comprehensive, updated assessment of programs with evidence of effectiveness in reducing teen pregnancy, sexually transmitted infections (STIs), or associated sexual risk behaviors. The review was conducted in four steps. First, multiple literature search strategies were used to identify relevant studies released from 1989 through January 2011. Second, identified studies were screened against prespecified eligibility criteria. Third, studies were assessed by teams of two trained reviewers for the quality and execution of their research designs. Fourth, for studies that passed the quality assessment, the review team extracted and analyzed information on the research design, study sample, evaluation setting, and program impacts. A total of 88 studies met the review criteria for study quality and were included in the data extraction and analysis. The studies examined a range of programs delivered in diverse settings. Most studies had mixed-gender and predominately African-American research samples (70% and 51%, respectively). Randomized controlled trials accounted for the large majority (87%) of included studies. Most studies (76%) included multiple follow-ups, with sample sizes ranging from 62 to 5,244. Analysis of the study impact findings identified 31 programs with evidence of effectiveness. Research conducted since the late 1980s has identified more than two dozen teen pregnancy and STI prevention programs with evidence of effectiveness. Key strengths of this research are the large number of randomized controlled trials, the common use of multiple follow-up periods, and attention to a broad range of programs delivered in diverse settings. Two main gaps are a lack of replication studies and the need for more research on Latino youth and other high-risk populations. In addressing these gaps, researchers must overcome common limitations in study design, analysis, and reporting that have negatively affected prior research. Copyright © 2014 Society for Adolescent Health and Medicine. All rights reserved.

  4. Absolute Isotopic Abundance Ratios and the Accuracy of Δ47 Measurements

    NASA Astrophysics Data System (ADS)

    Daeron, M.; Blamart, D.; Peral, M.; Affek, H. P.

    2016-12-01

    Conversion from raw IRMS data to clumped isotope anomalies in CO2 (Δ47) relies on four external parameters: the (13C/12C) ratio of VPDB, the (17O/16O) and (18O/16O) ratios of VSMOW (or VPDB-CO2), and the slope of the triple oxygen isotope line (λ). Here we investigate the influence that these isotopic parameters exert on measured Δ47 values, using real-world data corresponding to 7 months of measurements; simulations based on randomly generated data; precise comparisons between water-equilibrated CO2 samples and between carbonate standards believed to share quasi-identical Δ47 values; reprocessing of two carbonate calibration data sets with different slopes of Δ47 versus T. Using different sets of isotopic parameters generally produces systematic offsets as large as 0.04 ‰ in final Δ47 values. What's more, even using a single set of isotopic parameters can produce intra- and inter-laboratory discrepancies in final Δ47 values, if some of these parameters are inaccurate. Depending on the isotopic compositions of the standards used for conversion to "absolute" values, these errors should correlate strongly with either δ13C or δ18O, or more weakly with both. Based on measurements of samples expected to display identical Δ47 values, such as 25°C water-equilibrated CO2 with different carbon and oxygen isotope compositions, or high-temperature standards ETH-1 and ETH-2, we conclude that the isotopic parameters used so far in most clumped isotope studies produces large, systematic errors controlled by the relative bulk isotopic compositions of samples and standards, which should be one of the key factors responsible for current inter-laboratory discrepancies. By contrast, the isotopic parameters of Brand et al. [2010] appear to yield accurate Δ47 values regardless of bulk isotopic composition. References:Brand, Assonov and Coplen [2010] http://dx.doi.org/10.1351/PAC-REP-09-01-05

  5. How to Handle Speciose Clades? Mass Taxon-Sampling as a Strategy towards Illuminating the Natural History of Campanula (Campanuloideae)

    PubMed Central

    Mansion, Guilhem; Parolly, Gerald; Crowl, Andrew A.; Mavrodiev, Evgeny; Cellinese, Nico; Oganesian, Marine; Fraunhofer, Katharina; Kamari, Georgia; Phitos, Dimitrios; Haberle, Rosemarie; Akaydin, Galip; Ikinci, Nursel; Raus, Thomas; Borsch, Thomas

    2012-01-01

    Background Speciose clades usually harbor species with a broad spectrum of adaptive strategies and complex distribution patterns, and thus constitute ideal systems to disentangle biotic and abiotic causes underlying species diversification. The delimitation of such study systems to test evolutionary hypotheses is difficult because they often rely on artificial genus concepts as starting points. One of the most prominent examples is the bellflower genus Campanula with some 420 species, but up to 600 species when including all lineages to which Campanula is paraphyletic. We generated a large alignment of petD group II intron sequences to include more than 70% of described species as a reference. By comparison with partial data sets we could then assess the impact of selective taxon sampling strategies on phylogenetic reconstruction and subsequent evolutionary conclusions. Methodology/Principal Findings Phylogenetic analyses based on maximum parsimony (PAUP, PRAP), Bayesian inference (MrBayes), and maximum likelihood (RAxML) were first carried out on the large reference data set (D680). Parameters including tree topology, branch support, and age estimates, were then compared to those obtained from smaller data sets resulting from “classification-guided” (D088) and “phylogeny-guided sampling” (D101). Analyses of D088 failed to fully recover the phylogenetic diversity in Campanula, whereas D101 inferred significantly different branch support and age estimates. Conclusions/Significance A short genomic region with high phylogenetic utility allowed us to easily generate a comprehensive phylogenetic framework for the speciose Campanula clade. Our approach recovered 17 well-supported and circumscribed sub-lineages. Knowing these will be instrumental for developing more specific evolutionary hypotheses and guide future research, we highlight the predictive value of a mass taxon-sampling strategy as a first essential step towards illuminating the detailed evolutionary history of diverse clades. PMID:23209646

  6. Alchemical prediction of hydration free energies for SAMPL

    PubMed Central

    Mobley, David L.; Liu, Shaui; Cerutti, David S.; Swope, William C.; Rice, Julia E.

    2013-01-01

    Hydration free energy calculations have become important tests of force fields. Alchemical free energy calculations based on molecular dynamics simulations provide a rigorous way to calculate these free energies for a particular force field, given sufficient sampling. Here, we report results of alchemical hydration free energy calculations for the set of small molecules comprising the 2011 Statistical Assessment of Modeling of Proteins and Ligands (SAMPL) challenge. Our calculations are largely based on the Generalized Amber Force Field (GAFF) with several different charge models, and we achieved RMS errors in the 1.4-2.2 kcal/mol range depending on charge model, marginally higher than what we typically observed in previous studies1-5. The test set consists of ethane, biphenyl, and a dibenzyl dioxin, as well as a series of chlorinated derivatives of each. We found that, for this set, using high-quality partial charges from MP2/cc-PVTZ SCRF RESP fits provided marginally improved agreement with experiment over using AM1-BCC partial charges as we have more typically done, in keeping with our recent findings5. Switching to OPLS Lennard-Jones parameters with AM1-BCC charges also improves agreement with experiment. We also find a number of chemical trends within each molecular series which we can explain, but there are also some surprises, including some that are captured by the calculations and some that are not. PMID:22198475

  7. BloodSpot: a database of gene expression profiles and transcriptional programs for healthy and malignant haematopoiesis.

    PubMed

    Bagger, Frederik Otzen; Sasivarevic, Damir; Sohi, Sina Hadi; Laursen, Linea Gøricke; Pundhir, Sachin; Sønderby, Casper Kaae; Winther, Ole; Rapin, Nicolas; Porse, Bo T

    2016-01-04

    Research on human and murine haematopoiesis has resulted in a vast number of gene-expression data sets that can potentially answer questions regarding normal and aberrant blood formation. To researchers and clinicians with limited bioinformatics experience, these data have remained available, yet largely inaccessible. Current databases provide information about gene-expression but fail to answer key questions regarding co-regulation, genetic programs or effect on patient survival. To address these shortcomings, we present BloodSpot (www.bloodspot.eu), which includes and greatly extends our previously released database HemaExplorer, a database of gene expression profiles from FACS sorted healthy and malignant haematopoietic cells. A revised interactive interface simultaneously provides a plot of gene expression along with a Kaplan-Meier analysis and a hierarchical tree depicting the relationship between different cell types in the database. The database now includes 23 high-quality curated data sets relevant to normal and malignant blood formation and, in addition, we have assembled and built a unique integrated data set, BloodPool. Bloodpool contains more than 2000 samples assembled from six independent studies on acute myeloid leukemia. Furthermore, we have devised a robust sample integration procedure that allows for sensitive comparison of user-supplied patient samples in a well-defined haematopoietic cellular space. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  8. Assessing Pictograph Recognition: A Comparison of Crowdsourcing and Traditional Survey Approaches.

    PubMed

    Kuang, Jinqiu; Argo, Lauren; Stoddard, Greg; Bray, Bruce E; Zeng-Treitler, Qing

    2015-12-17

    Compared to traditional methods of participant recruitment, online crowdsourcing platforms provide a fast and low-cost alternative. Amazon Mechanical Turk (MTurk) is a large and well-known crowdsourcing service. It has developed into the leading platform for crowdsourcing recruitment. To explore the application of online crowdsourcing for health informatics research, specifically the testing of medical pictographs. A set of pictographs created for cardiovascular hospital discharge instructions was tested for recognition. This set of illustrations (n=486) was first tested through an in-person survey in a hospital setting (n=150) and then using online MTurk participants (n=150). We analyzed these survey results to determine their comparability. Both the demographics and the pictograph recognition rates of online participants were different from those of the in-person participants. In the multivariable linear regression model comparing the 2 groups, the MTurk group scored significantly higher than the hospital sample after adjusting for potential demographic characteristics (adjusted mean difference 0.18, 95% CI 0.08-0.28, P<.001). The adjusted mean ratings were 2.95 (95% CI 2.89-3.02) for the in-person hospital sample and 3.14 (95% CI 3.07-3.20) for the online MTurk sample on a 4-point Likert scale (1=totally incorrect, 4=totally correct). The findings suggest that crowdsourcing is a viable complement to traditional in-person surveys, but it cannot replace them.

  9. Just the right age: well-clustered exposure ages from a global glacial 10Be compilation

    NASA Astrophysics Data System (ADS)

    Heyman, Jakob; Margold, Martin

    2017-04-01

    Cosmogenic exposure dating has been used extensively for defining glacial chronologies, both in ice sheet and alpine settings, and the global set of published ages today reaches well beyond 10,000 samples. Over the last few years, a number of important developments have improved the measurements (with well-defined AMS standards) and exposure age calculations (with updated data and methods for calculating production rates), in the best case enabling high precision dating of past glacial events. A remaining problem, however, is the fact that a large portion of all dated samples have been affected by prior and/or incomplete exposure, yielding erroneous exposure ages under the standard assumptions. One way to address this issue is to only use exposure ages that can be confidently considered as unaffected by prior/incomplete exposure, such as groups of samples with statistically identical ages. Here we use objective statistical criteria to identify groups of well-clustered exposure ages from the global glacial "expage" 10Be compilation. Out of ˜1700 groups with at least 3 individual samples ˜30% are well-clustered, increasing to ˜45% if allowing outlier rejection of a maximum of 1/3 of the samples (still requiring a minimum of 3 well-clustered ages). The dataset of well-clustered ages is heavily dominated by ages <30 ka, showing that well-defined cosmogenic chronologies primarily exist for the last glaciation. We observe a large-scale global synchronicity in the timing of the last deglaciation from ˜20 to 10 ka. There is also a general correlation between the timing of deglaciation and latitude (or size of the individual ice mass), with earlier deglaciation in lower latitudes and later deglaciation towards the poles. Grouping the data into regions and comparing with available paleoclimate data we can start to untangle regional differences in the last deglaciation and the climate events controlling the ice mass loss. The extensive dataset and the statistical analysis enables an unprecedented global view on the last deglaciation.

  10. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver.

    PubMed

    Wymant, Chris; Blanquart, François; Golubchik, Tanya; Gall, Astrid; Bakker, Margreet; Bezemer, Daniela; Croucher, Nicholas J; Hall, Matthew; Hillebregt, Mariska; Ong, Swee Hoe; Ratmann, Oliver; Albert, Jan; Bannert, Norbert; Fellay, Jacques; Fransen, Katrien; Gourlay, Annabelle; Grabowski, M Kate; Gunsenheimer-Bartmeyer, Barbara; Günthard, Huldrych F; Kivelä, Pia; Kouyos, Roger; Laeyendecker, Oliver; Liitsola, Kirsi; Meyer, Laurence; Porter, Kholoud; Ristola, Matti; van Sighem, Ard; Berkhout, Ben; Cornelissen, Marion; Kellam, Paul; Reiss, Peter; Fraser, Christophe

    2018-01-01

    Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.

  11. Household Energy Consumption Segmentation Using Hourly Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kwac, J; Flora, J; Rajagopal, R

    2014-01-01

    The increasing US deployment of residential advanced metering infrastructure (AMI) has made hourly energy consumption data widely available. Using CA smart meter data, we investigate a household electricity segmentation methodology that uses an encoding system with a pre-processed load shape dictionary. Structured approaches using features derived from the encoded data drive five sample program and policy relevant energy lifestyle segmentation strategies. We also ensure that the methodologies developed scale to large data sets.

  12. 2017 ARL Summer Student Program Volume 2: Compendium of Abstracts

    DTIC Science & Technology

    2017-12-01

    useful for equipping quadrotors with advanced capabilities, such as running deep learning networks. A second purpose of this project is to quantify the...Multiple samples were run in the LEAP 5000-XR generating large data sets (hundreds of millions of ions composing hundreds of cubic nanometers of...produce viable walking and running gaits on the final product. Even further, the monetary and time cost of this increases significantly when working

  13. 7 CFR 27.23 - Duplicate sets of samples of cotton.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 7 Agriculture 2 2011-01-01 2011-01-01 false Duplicate sets of samples of cotton. 27.23 Section 27... REGULATIONS COTTON CLASSIFICATION UNDER COTTON FUTURES LEGISLATION Regulations Inspection and Samples § 27.23 Duplicate sets of samples of cotton. The duplicate sets of samples shall be inclosed in wrappers or...

  14. 7 CFR 27.23 - Duplicate sets of samples of cotton.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 7 Agriculture 2 2010-01-01 2010-01-01 false Duplicate sets of samples of cotton. 27.23 Section 27... REGULATIONS COTTON CLASSIFICATION UNDER COTTON FUTURES LEGISLATION Regulations Inspection and Samples § 27.23 Duplicate sets of samples of cotton. The duplicate sets of samples shall be inclosed in wrappers or...

  15. 7 CFR 27.23 - Duplicate sets of samples of cotton.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... 7 Agriculture 2 2014-01-01 2014-01-01 false Duplicate sets of samples of cotton. 27.23 Section 27... REGULATIONS COTTON CLASSIFICATION UNDER COTTON FUTURES LEGISLATION Regulations Inspection and Samples § 27.23 Duplicate sets of samples of cotton. The duplicate sets of samples shall be inclosed in wrappers or...

  16. 7 CFR 27.23 - Duplicate sets of samples of cotton.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... 7 Agriculture 2 2013-01-01 2013-01-01 false Duplicate sets of samples of cotton. 27.23 Section 27... REGULATIONS COTTON CLASSIFICATION UNDER COTTON FUTURES LEGISLATION Regulations Inspection and Samples § 27.23 Duplicate sets of samples of cotton. The duplicate sets of samples shall be inclosed in wrappers or...

  17. 7 CFR 27.23 - Duplicate sets of samples of cotton.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... 7 Agriculture 2 2012-01-01 2012-01-01 false Duplicate sets of samples of cotton. 27.23 Section 27... REGULATIONS COTTON CLASSIFICATION UNDER COTTON FUTURES LEGISLATION Regulations Inspection and Samples § 27.23 Duplicate sets of samples of cotton. The duplicate sets of samples shall be inclosed in wrappers or...

  18. Solar-thermal complex sample processing for nucleic acid based diagnostics in limited resource settings

    PubMed Central

    Gumus, Abdurrahman; Ahsan, Syed; Dogan, Belgin; Jiang, Li; Snodgrass, Ryan; Gardner, Andrea; Lu, Zhengda; Simpson, Kenneth; Erickson, David

    2016-01-01

    The use of point-of-care (POC) devices in limited resource settings where access to commonly used infrastructure, such as water and electricity, can be restricted represents simultaneously one of the best application fits for POC systems as well as one of the most challenging places to deploy them. Of the many challenges involved in these systems, the preparation and processing of complex samples like stool, vomit, and biopsies are particularly difficult due to the high number and varied nature of mechanical and chemical interferents present in the sample. Previously we have demonstrated the ability to use solar-thermal energy to perform PCR based nucleic acid amplifications. In this work demonstrate how the technique, using similar infrastructure, can also be used to perform solar-thermal based sample processing system for extracting and isolating Vibrio Cholerae nucleic acids from fecal samples. The use of opto-thermal energy enables the use of sunlight to drive thermal lysing reactions in large volumes without the need for external electrical power. Using the system demonstrate the ability to reach a 95°C threshold in less than 5 minutes and maintain a stable sample temperature of +/− 2°C following the ramp up. The system is demonstrated to provide linear results between 104 and 108 CFU/mL when the released nucleic acids were quantified via traditional means. Additionally, we couple the sample processing unit with our previously demonstrated solar-thermal PCR and tablet based detection system to demonstrate very low power sample-in-answer-out detection. PMID:27231636

  19. Molecular Diagnosis of Malaria by Photo-Induced Electron Transfer Fluorogenic Primers: PET-PCR

    PubMed Central

    Lucchi, Naomi W.; Narayanan, Jothikumar; Karell, Mara A.; Xayavong, Maniphet; Kariuki, Simon; DaSilva, Alexandre J.; Hill, Vincent; Udhayakumar, Venkatachalam

    2013-01-01

    There is a critical need for developing new malaria diagnostic tools that are sensitive, cost effective and capable of performing large scale diagnosis. The real-time PCR methods are particularly robust for large scale screening and they can be used in malaria control and elimination programs. We have designed novel self-quenching photo-induced electron transfer (PET) fluorogenic primers for the detection of P. falciparum and the Plasmodium genus by real-time PCR. A total of 119 samples consisting of different malaria species and mixed infections were used to test the utility of the novel PET-PCR primers in the diagnosis of clinical samples. The sensitivity and specificity were calculated using a nested PCR as the gold standard and the novel primer sets demonstrated 100% sensitivity and specificity. The limits of detection for P. falciparum was shown to be 3.2 parasites/µl using both Plasmodium genus and P. falciparum-specific primers and 5.8 parasites/µl for P. ovale, 3.5 parasites/µl for P. malariae and 5 parasites/µl for P. vivax using the genus specific primer set. Moreover, the reaction can be duplexed to detect both Plasmodium spp. and P. falciparum in a single reaction. The PET-PCR assay does not require internal probes or intercalating dyes which makes it convenient to use and less expensive than other real-time PCR diagnostic formats. Further validation of this technique in the field will help to assess its utility for large scale screening in malaria control and elimination programs. PMID:23437209

  20. Automated selected reaction monitoring data analysis workflow for large-scale targeted proteomic studies.

    PubMed

    Surinova, Silvia; Hüttenhain, Ruth; Chang, Ching-Yun; Espona, Lucia; Vitek, Olga; Aebersold, Ruedi

    2013-08-01

    Targeted proteomics based on selected reaction monitoring (SRM) mass spectrometry is commonly used for accurate and reproducible quantification of protein analytes in complex biological mixtures. Strictly hypothesis-driven, SRM assays quantify each targeted protein by collecting measurements on its peptide fragment ions, called transitions. To achieve sensitive and accurate quantitative results, experimental design and data analysis must consistently account for the variability of the quantified transitions. This consistency is especially important in large experiments, which increasingly require profiling up to hundreds of proteins over hundreds of samples. Here we describe a robust and automated workflow for the analysis of large quantitative SRM data sets that integrates data processing, statistical protein identification and quantification, and dissemination of the results. The integrated workflow combines three software tools: mProphet for peptide identification via probabilistic scoring; SRMstats for protein significance analysis with linear mixed-effect models; and PASSEL, a public repository for storage, retrieval and query of SRM data. The input requirements for the protocol are files with SRM traces in mzXML format, and a file with a list of transitions in a text tab-separated format. The protocol is especially suited for data with heavy isotope-labeled peptide internal standards. We demonstrate the protocol on a clinical data set in which the abundances of 35 biomarker candidates were profiled in 83 blood plasma samples of subjects with ovarian cancer or benign ovarian tumors. The time frame to realize the protocol is 1-2 weeks, depending on the number of replicates used in the experiment.

  1. Identifying Microlensing Events in Large, Non-Uniformly Sampled Surveys: The Case of the Palomar Transient Factory

    NASA Astrophysics Data System (ADS)

    Price-Whelan, Adrian M.; Agueros, M. A.; Fournier, A.; Street, R.; Ofek, E.; Levitan, D. B.; PTF Collaboration

    2013-01-01

    Many current photometric, time-domain surveys are driven by specific goals such as searches for supernovae or transiting exoplanets, or studies of stellar variability. These goals in turn set the cadence with which individual fields are re-imaged. In the case of the Palomar Transient Factory (PTF), several such sub-surveys are being conducted in parallel, leading to extremely non-uniform sampling over the survey's nearly 20,000 sq. deg. footprint. While the typical 7.26 sq. deg. PTF field has been imaged 20 times in R-band, ~2300 sq. deg. have been observed more than 100 times. We use the existing PTF data 6.4x107 light curves) to study the trade-off that occurs when searching for microlensing events when one has access to a large survey footprint with irregular sampling. To examine the probability that microlensing events can be recovered in these data, we also test previous statistics used on uniformly sampled data to identify variables and transients. We find that one such statistic, the von Neumann ratio, performs best for identifying simulated microlensing events. We develop a selection method using this statistic and apply it to data from all PTF fields with >100 observations to uncover a number of interesting candidate events. This work can help constrain all-sky event rate predictions and tests microlensing signal recovery in large datasets, both of which will be useful to future wide-field, time-domain surveys such as the LSST.

  2. Quantitative Assessment of Molecular Dynamics Sampling for Flexible Systems.

    PubMed

    Nemec, Mike; Hoffmann, Daniel

    2017-02-14

    Molecular dynamics (MD) simulation is a natural method for the study of flexible molecules but at the same time is limited by the large size of the conformational space of these molecules. We ask by how much the MD sampling quality for flexible molecules can be improved by two means: the use of diverse sets of trajectories starting from different initial conformations to detect deviations between samples and sampling with enhanced methods such as accelerated MD (aMD) or scaled MD (sMD) that distort the energy landscape in controlled ways. To this end, we test the effects of these approaches on MD simulations of two flexible biomolecules in aqueous solution, Met-Enkephalin (5 amino acids) and HIV-1 gp120 V3 (a cycle of 35 amino acids). We assess the convergence of the sampling quantitatively with known, extensive measures of cluster number N c and cluster distribution entropy S c and with two new quantities, conformational overlap O conf and density overlap O dens , both conveniently ranging from 0 to 1. These new overlap measures quantify self-consistency of sampling in multitrajectory MD experiments, a necessary condition for converged sampling. A comprehensive assessment of sampling quality of MD experiments identifies the combination of diverse trajectory sets and aMD as the most efficient approach among those tested. However, analysis of O dens between conventional and aMD trajectories also reveals that we have not completely corrected aMD sampling for the distorted energy landscape. Moreover, for V3, the courses of N c and O dens indicate that much higher resources than those generally invested today will probably be needed to achieve convergence. The comparative analysis also shows that conventional MD simulations with insufficient sampling can be easily misinterpreted as being converged.

  3. BOREAS TE-2 NSA Soil Lab Data

    NASA Technical Reports Server (NTRS)

    Veldhuis, Hugo; Hall, Forrest G. (Editor); Knapp, David E. (Editor)

    2000-01-01

    This data set contains the major soil properties of soil samples collected in 1994 at the tower flux sites in the Northern Study Area (NSA). The soil samples were collected by Hugo Veldhuis and his staff from the University of Manitoba. The mineral soil samples were largely analyzed by Barry Goetz, under the supervision of Dr. Harold Rostad at the University of Saskatchewan. The organic soil samples were largely analyzed by Peter Haluschak, under the supervision of Hugo Veldhuis at the Centre for Land and Biological Resources Research in Winnipeg, Manitoba. During the course of field investigation and mapping, selected surface and subsurface soil samples were collected for laboratory analysis. These samples were used as benchmark references for specific soil attributes in general soil characterization. Detailed soil sampling, description, and laboratory analysis were performed on selected modal soils to provide examples of common soil physical and chemical characteristics in the study area. The soil properties that were determined include soil horizon; dry soil color; pH; bulk density; total, organic, and inorganic carbon; electric conductivity; cation exchange capacity; exchangeable sodium, potassium, calcium, magnesium, and hydrogen; water content at 0.01, 0.033, and 1.5 MPascals; nitrogen; phosphorus: particle size distribution; texture; pH of the mineral soil and of the organic soil; extractable acid; and sulfur. These data are stored in ASCII text files. The data files are available on a CD-ROM (see document number 20010000884), or from the Oak Ridge National Laboratory (ORNL) Distributed Active Archive Center (DAAC).

  4. Identifying airborne fungi in Seoul, Korea using metagenomics.

    PubMed

    Oh, Seung-Yoon; Fong, Jonathan J; Park, Myung Soo; Chang, Limseok; Lim, Young Woon

    2014-06-01

    Fungal spores are widespread and common in the atmosphere. In this study, we use a metagenomic approach to study the fungal diversity in six total air samples collected from April to May 2012 in Seoul, Korea. This springtime period is important in Korea because of the peak in fungal spore concentration and Asian dust storms, although the year of this study (2012) was unique in that were no major Asian dust events. Clustering sequences for operational taxonomic unit (OTU) identification recovered 1,266 unique OTUs in the combined dataset, with between 223᾿96 OTUs present in individual samples. OTUs from three fungal phyla were identified. For Ascomycota, Davidiella (anamorph: Cladosporium) was the most common genus in all samples, often accounting for more than 50% of all sequences in a sample. Other common Ascomycota genera identified were Alternaria, Didymella, Khuskia, Geosmitha, Penicillium, and Aspergillus. While several Basidiomycota genera were observed, Chytridiomycota OTUs were only present in one sample. Consistency was observed within sampling days, but there was a large shift in species composition from Ascomycota dominant to Basidiomycota dominant in the middle of the sampling period. This marked change may have been caused by meteorological events. A potential set of 40 allergy-inducing genera were identified, accounting for a large proportion of the diversity present (22.5᾿7.2%). Our study identifies high fungal diversity and potentially high levels of fungal allergens in springtime air of Korea, and provides a good baseline for future comparisons with Asian dust storms.

  5. Telling plant species apart with DNA: from barcodes to genomes

    PubMed Central

    Li, De-Zhu; van der Bank, Michelle

    2016-01-01

    Land plants underpin a multitude of ecosystem functions, support human livelihoods and represent a critically important component of terrestrial biodiversity—yet many tens of thousands of species await discovery, and plant identification remains a substantial challenge, especially where material is juvenile, fragmented or processed. In this opinion article, we tackle two main topics. Firstly, we provide a short summary of the strengths and limitations of plant DNA barcoding for addressing these issues. Secondly, we discuss options for enhancing current plant barcodes, focusing on increasing discriminatory power via either gene capture of nuclear markers or genome skimming. The former has the advantage of establishing a defined set of target loci maximizing efficiency of sequencing effort, data storage and analysis. The challenge is developing a probe set for large numbers of nuclear markers that works over sufficient phylogenetic breadth. Genome skimming has the advantage of using existing protocols and being backward compatible with existing barcodes; and the depth of sequence coverage can be increased as sequencing costs fall. Its non-targeted nature does, however, present a major informatics challenge for upscaling to large sample sets. This article is part of the themed issue ‘From DNA barcodes to biomes’. PMID:27481790

  6. Effects of Active Sting Damping on Common Research Model Data Quality

    NASA Technical Reports Server (NTRS)

    Acheson, Michael J.; Balakrishna, S.

    2011-01-01

    Recent tests using the Common Research Model (CRM) at the Langley National Transonic Facility (NTF) and the Ames 11-foot Transonic Wind Tunnel (11' TWT) produced large sets of data that have been used to examine the effects of active damping on transonic tunnel aerodynamic data quality. In particular, large statistically significant sets of repeat data demonstrate that the active damping system had no apparent effect on drag, lift and pitching moment repeatability during warm testing conditions, while simultaneously enabling aerodynamic data to be obtained post stall. A small set of cryogenic (high Reynolds number) repeat data was obtained at the NTF and again showed a negligible effect on data repeatability. However, due to a degradation of control power in the active damping system cryogenically, the ability to obtain test data post-stall was not achieved during cryogenic testing. Additionally, comparisons of data repeatability between NTF and 11-ft TWT CRM data led to further (warm) testing at the NTF which demonstrated that for a modest increase in data sampling time, a 2-3 factor improvement in drag, and pitching moment repeatability was readily achieved not related with the active damping system.

  7. A large-scale, long-term study of scale drift: The micro view and the macro view

    NASA Astrophysics Data System (ADS)

    He, W.; Li, S.; Kingsbury, G. G.

    2016-11-01

    The development of measurement scales for use across years and grades in educational settings provides unique challenges, as instructional approaches, instructional materials, and content standards all change periodically. This study examined the measurement stability of a set of Rasch measurement scales that have been in place for almost 40 years. In order to investigate the stability of these scales, item responses were collected from a large set of students who took operational adaptive tests using items calibrated to the measurement scales. For the four scales that were examined, item samples ranged from 2183 to 7923 items. Each item was administered to at least 500 students in each grade level, resulting in approximately 3000 responses per item. Stability was examined at the micro level analysing change in item parameter estimates that have occurred since the items were first calibrated. It was also examined at the macro level, involving groups of items and overall test scores for students. Results indicated that individual items had changes in their parameter estimates, which require further analysis and possible recalibration. At the same time, the results at the total score level indicate substantial stability in the measurement scales over the span of their use.

  8. Evaluating data mining algorithms using molecular dynamics trajectories.

    PubMed

    Tatsis, Vasileios A; Tjortjis, Christos; Tzirakis, Panagiotis

    2013-01-01

    Molecular dynamics simulations provide a sample of a molecule's conformational space. Experiments on the mus time scale, resulting in large amounts of data, are nowadays routine. Data mining techniques such as classification provide a way to analyse such data. In this work, we evaluate and compare several classification algorithms using three data sets which resulted from computer simulations, of a potential enzyme mimetic biomolecule. We evaluated 65 classifiers available in the well-known data mining toolkit Weka, using 'classification' errors to assess algorithmic performance. Results suggest that: (i) 'meta' classifiers perform better than the other groups, when applied to molecular dynamics data sets; (ii) Random Forest and Rotation Forest are the best classifiers for all three data sets; and (iii) classification via clustering yields the highest classification error. Our findings are consistent with bibliographic evidence, suggesting a 'roadmap' for dealing with such data.

  9. Large-scale transcriptome analysis reveals arabidopsis metabolic pathways are frequently influenced by different pathogens.

    PubMed

    Jiang, Zhenhong; He, Fei; Zhang, Ziding

    2017-07-01

    Through large-scale transcriptional data analyses, we highlighted the importance of plant metabolism in plant immunity and identified 26 metabolic pathways that were frequently influenced by the infection of 14 different pathogens. Reprogramming of plant metabolism is a common phenomenon in plant defense responses. Currently, a large number of transcriptional profiles of infected tissues in Arabidopsis (Arabidopsis thaliana) have been deposited in public databases, which provides a great opportunity to understand the expression patterns of metabolic pathways during plant defense responses at the systems level. Here, we performed a large-scale transcriptome analysis based on 135 previously published expression samples, including 14 different pathogens, to explore the expression pattern of Arabidopsis metabolic pathways. Overall, metabolic genes are significantly changed in expression during plant defense responses. Upregulated metabolic genes are enriched on defense responses, and downregulated genes are enriched on photosynthesis, fatty acid and lipid metabolic processes. Gene set enrichment analysis (GSEA) identifies 26 frequently differentially expressed metabolic pathways (FreDE_Paths) that are differentially expressed in more than 60% of infected samples. These pathways are involved in the generation of energy, fatty acid and lipid metabolism as well as secondary metabolite biosynthesis. Clustering analysis based on the expression levels of these 26 metabolic pathways clearly distinguishes infected and control samples, further suggesting the importance of these metabolic pathways in plant defense responses. By comparing with FreDE_Paths from abiotic stresses, we find that the expression patterns of 26 FreDE_Paths from biotic stresses are more consistent across different infected samples. By investigating the expression correlation between transcriptional factors (TFs) and FreDE_Paths, we identify several notable relationships. Collectively, the current study will deepen our understanding of plant metabolism in plant immunity and provide new insights into disease-resistant crop improvement.

  10. Stability, resolution, and ultra-low wear amplitude modulation atomic force microscopy of DNA: Small amplitude small set-point imaging

    NASA Astrophysics Data System (ADS)

    Santos, Sergio; Barcons, Victor; Christenson, Hugo K.; Billingsley, Daniel J.; Bonass, William A.; Font, Josep; Thomson, Neil H.

    2013-08-01

    A way to operate fundamental mode amplitude modulation atomic force microscopy is introduced which optimizes stability and resolution for a given tip size and shows negligible tip wear over extended time periods (˜24 h). In small amplitude small set-point (SASS) imaging, the cantilever oscillates with sub-nanometer amplitudes in the proximity of the sample, without the requirement of using large drive forces, as the dynamics smoothly lead the tip to the surface through the water layer. SASS is demonstrated on single molecules of double-stranded DNA in ambient conditions where sharp silicon tips (R ˜ 2-5 nm) can resolve the right-handed double helix.

  11. The "DGPPN-Cohort": A national collaboration initiative by the German Association for Psychiatry and Psychotherapy (DGPPN) for establishing a large-scale cohort of psychiatric patients.

    PubMed

    Anderson-Schmidt, Heike; Adler, Lothar; Aly, Chadiga; Anghelescu, Ion-George; Bauer, Michael; Baumgärtner, Jessica; Becker, Joachim; Bianco, Roswitha; Becker, Thomas; Bitter, Cosima; Bönsch, Dominikus; Buckow, Karoline; Budde, Monika; Bührig, Martin; Deckert, Jürgen; Demiroglu, Sara Y; Dietrich, Detlef; Dümpelmann, Michael; Engelhardt, Uta; Fallgatter, Andreas J; Feldhaus, Daniel; Figge, Christian; Folkerts, Here; Franz, Michael; Gade, Katrin; Gaebel, Wolfgang; Grabe, Hans-Jörgen; Gruber, Oliver; Gullatz, Verena; Gusky, Linda; Heilbronner, Urs; Helbing, Krister; Hegerl, Ulrich; Heinz, Andreas; Hensch, Tilman; Hiemke, Christoph; Jäger, Markus; Jahn-Brodmann, Anke; Juckel, Georg; Kandulski, Franz; Kaschka, Wolfgang P; Kircher, Tilo; Koller, Manfred; Konrad, Carsten; Kornhuber, Johannes; Krause, Marina; Krug, Axel; Lee, Mahsa; Leweke, Markus; Lieb, Klaus; Mammes, Mechthild; Meyer-Lindenberg, Andreas; Mühlbacher, Moritz; Müller, Matthias J; Nieratschker, Vanessa; Nierste, Barbara; Ohle, Jacqueline; Pfennig, Andrea; Pieper, Marlenna; Quade, Matthias; Reich-Erkelenz, Daniela; Reif, Andreas; Reitt, Markus; Reininghaus, Bernd; Reininghaus, Eva Z; Riemenschneider, Matthias; Rienhoff, Otto; Roser, Patrik; Rujescu, Dan; Schennach, Rebecca; Scherk, Harald; Schmauss, Max; Schneider, Frank; Schosser, Alexandra; Schott, Björn H; Schwab, Sybille G; Schwanke, Jens; Skrowny, Daniela; Spitzer, Carsten; Stierl, Sebastian; Stöckel, Judith; Stübner, Susanne; Thiel, Andreas; Volz, Hans-Peter; von Hagen, Martin; Walter, Henrik; Witt, Stephanie H; Wobrock, Thomas; Zielasek, Jürgen; Zimmermann, Jörg; Zitzelsberger, Antje; Maier, Wolfgang; Falkai, Peter G; Rietschel, Marcella; Schulze, Thomas G

    2013-12-01

    The German Association for Psychiatry and Psychotherapy (DGPPN) has committed itself to establish a prospective national cohort of patients with major psychiatric disorders, the so-called DGPPN-Cohort. This project will enable the scientific exploitation of high-quality data and biomaterial from psychiatric patients for research. It will be set up using harmonised data sets and procedures for sample generation and guided by transparent rules for data access and data sharing regarding the central research database. While the main focus lies on biological research, it will be open to all kinds of scientific investigations, including epidemiological, clinical or health-service research.

  12. Multivariate statistical techniques for the evaluation of groundwater quality of Amaravathi River Basin: South India

    NASA Astrophysics Data System (ADS)

    Loganathan, K.; Ahamed, A. Jafar

    2017-12-01

    The study of groundwater in Amaravathi River basin of Karur District resulted in large geochemical data set. A total of 24 water samples were collected and analyzed for physico-chemical parameters, and the abundance of cation and anion concentrations was in the following order: Na+ > Ca2+ > Mg2+ > K+ = Cl- > HCO3 - > SO4 2-. Correlation matrix shows that the basic ionic chemistry is influenced by Na+, Ca2+, Mg2+, and Cl-, and also suggests that the samples contain Na+-Cl-, Ca2+-Cl- an,d mixed Ca2+-Mg2+-Cl- types of water. HCO3 -, SO4 2-, and F- association is less than that of other parameters due to poor or less available of bearing minerals. PCA extracted six components, which are accountable for the data composition explaining 81% of the total variance of the data set and allowed to set the selected parameters according to regular features as well as to evaluate the frequency of each group on the overall variation in water quality. Cluster analysis results show that groundwater quality does not vary extensively as a function of seasons, but shows two main clusters.

  13. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pyatina, Tatiana; Sugama, Toshifumi; Moon, Juhyuk

    An alkali-activated blend of aluminum cement and class F fly ash is an attractive solution for geothermal wells where cement is exposed to significant thermal shocks and aggressive environments. Set-control additives enable the safe cement placement in a well but may compromise its mechanical properties. Here, this work evaluates the effect of a tartaric-acid set retarder on phase composition, microstructure, and strength development of a sodium-metasilicate-activated calcium aluminate/fly ash class F blend after curing at 85 °C, 200 °C or 300 °C. The hardened materials were characterized with X-ray diffraction, thermogravimetric analysis, X-ray computed tomography, and combined scanning electron microscopy/energy-dispersivemore » X-ray spectroscopy and tested for mechanical strength. With increasing temperature, a higher number of phase transitions in non-retarded specimens was found as a result of fast cement hydration. The differences in the phase compositions were also attributed to tartaric acid interactions with metal ions released by the blend in retarded samples. The retarded samples showed higher total porosity but reduced percentage of large pores (above 500 µm) and greater compressive strength after 300 °C curing. Lastly, mechanical properties of the set cements were not compromised by the retarder.« less

  14. True Randomness from Big Data.

    PubMed

    Papakonstantinou, Periklis A; Woodruff, David P; Yang, Guang

    2016-09-26

    Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests.

  15. ForceGen 3D structure and conformer generation: from small lead-like molecules to macrocyclic drugs

    NASA Astrophysics Data System (ADS)

    Cleves, Ann E.; Jain, Ajay N.

    2017-05-01

    We introduce the ForceGen method for 3D structure generation and conformer elaboration of drug-like small molecules. ForceGen is novel, avoiding use of distance geometry, molecular templates, or simulation-oriented stochastic sampling. The method is primarily driven by the molecular force field, implemented using an extension of MMFF94s and a partial charge estimator based on electronegativity-equalization. The force field is coupled to algorithms for direct sampling of realistic physical movements made by small molecules. Results are presented on a standard benchmark from the Cambridge Crystallographic Database of 480 drug-like small molecules, including full structure generation from SMILES strings. Reproduction of protein-bound crystallographic ligand poses is demonstrated on four carefully curated data sets: the ConfGen Set (667 ligands), the PINC cross-docking benchmark (1062 ligands), a large set of macrocyclic ligands (182 total with typical ring sizes of 12-23 atoms), and a commonly used benchmark for evaluating macrocycle conformer generation (30 ligands total). Results compare favorably to alternative methods, and performance on macrocyclic compounds approaches that observed on non-macrocycles while yielding a roughly 100-fold speed improvement over alternative MD-based methods with comparable performance.

  16. True Randomness from Big Data

    NASA Astrophysics Data System (ADS)

    Papakonstantinou, Periklis A.; Woodruff, David P.; Yang, Guang

    2016-09-01

    Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests.

  17. A field-to-desktop toolchain for X-ray CT densitometry enables tree ring analysis

    PubMed Central

    De Mil, Tom; Vannoppen, Astrid; Beeckman, Hans; Van Acker, Joris; Van den Bulcke, Jan

    2016-01-01

    Background and Aims Disentangling tree growth requires more than ring width data only. Densitometry is considered a valuable proxy, yet laborious wood sample preparation and lack of dedicated software limit the widespread use of density profiling for tree ring analysis. An X-ray computed tomography-based toolchain of tree increment cores is presented, which results in profile data sets suitable for visual exploration as well as density-based pattern matching. Methods Two temperate (Quercus petraea, Fagus sylvatica) and one tropical species (Terminalia superba) were used for density profiling using an X-ray computed tomography facility with custom-made sample holders and dedicated processing software. Key Results Density-based pattern matching is developed and able to detect anomalies in ring series that can be corrected via interactive software. Conclusions A digital workflow allows generation of structure-corrected profiles of large sets of cores in a short time span that provide sufficient intra-annual density information for tree ring analysis. Furthermore, visual exploration of such data sets is of high value. The dated profiles can be used for high-resolution chronologies and also offer opportunities for fast screening of lesser studied tropical tree species. PMID:27107414

  18. The Mira-Titan Universe. II. Matter Power Spectrum Emulation

    DOE PAGES

    Lawrence, Earl; Heitmann, Katrin; Kwan, Juliana; ...

    2017-09-20

    We introduce a new cosmic emulator for the matter power spectrum covering eight cosmological parameters. Targeted at optical surveys, the emulator provides accurate predictions out to a wavenumber k ~ 5Mpc -1 and redshift z ≤ 2. Besides covering the standard set of CDM parameters, massive neutrinos and a dynamical dark energy of state are included. The emulator is built on a sample set of 36 cosmological models, carefully chosen to provide accurate predictions over the wide and large parameter space. For each model, we have performed a high-resolution simulation, augmented with sixteen medium-resolution simulations and TimeRG perturbation theory resultsmore » to provide accurate coverage of a wide k-range; the dataset generated as part of this project is more than 1.2Pbyte. With the current set of simulated models, we achieve an accuracy of approximately 4%. Because the sampling approach used here has established convergence and error-control properties, follow-on results with more than a hundred cosmological models will soon achieve ~1% accuracy. We compare our approach with other prediction schemes that are based on halo model ideas and remapping approaches. The new emulator code is publicly available.« less

  19. True Randomness from Big Data

    PubMed Central

    Papakonstantinou, Periklis A.; Woodruff, David P.; Yang, Guang

    2016-01-01

    Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests. PMID:27666514

  20. The Mira-Titan Universe. II. Matter Power Spectrum Emulation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lawrence, Earl; Heitmann, Katrin; Kwan, Juliana

    We introduce a new cosmic emulator for the matter power spectrum covering eight cosmological parameters. Targeted at optical surveys, the emulator provides accurate predictions out to a wavenumber k ~ 5Mpc -1 and redshift z ≤ 2. Besides covering the standard set of CDM parameters, massive neutrinos and a dynamical dark energy of state are included. The emulator is built on a sample set of 36 cosmological models, carefully chosen to provide accurate predictions over the wide and large parameter space. For each model, we have performed a high-resolution simulation, augmented with sixteen medium-resolution simulations and TimeRG perturbation theory resultsmore » to provide accurate coverage of a wide k-range; the dataset generated as part of this project is more than 1.2Pbyte. With the current set of simulated models, we achieve an accuracy of approximately 4%. Because the sampling approach used here has established convergence and error-control properties, follow-on results with more than a hundred cosmological models will soon achieve ~1% accuracy. We compare our approach with other prediction schemes that are based on halo model ideas and remapping approaches. The new emulator code is publicly available.« less

Top