Statistical Reference Datasets
National Institute of Standards and Technology Data Gateway
Statistical Reference Datasets (Web, free access) The Statistical Reference Datasets is also supported by the Standard Reference Data Program. The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software.
Accuracy of Digital vs. Conventional Implant Impressions
Lee, Sang J.; Betensky, Rebecca A.; Gianneschi, Grace E.; Gallucci, German O.
2015-01-01
The accuracy of digital impressions greatly influences the clinical viability in implant restorations. The aim of this study is to compare the accuracy of gypsum models acquired from the conventional implant impression to digitally milled models created from direct digitalization by three-dimensional analysis. Thirty gypsum and 30 digitally milled models impressed directly from a reference model were prepared. The models were scanned by a laboratory scanner and 30 STL datasets from each group were imported to an inspection software. The datasets were aligned to the reference dataset by a repeated best fit algorithm and 10 specified contact locations of interest were measured in mean volumetric deviations. The areas were pooled by cusps, fossae, interproximal contacts, horizontal and vertical axes of implant position and angulation. The pooled areas were statistically analysed by comparing each group to the reference model to investigate the mean volumetric deviations accounting for accuracy and standard deviations for precision. Milled models from digital impressions had comparable accuracy to gypsum models from conventional impressions. However, differences in fossae and vertical displacement of the implant position from the gypsum and digitally milled models compared to the reference model, exhibited statistical significance (p<0.001, p=0.020 respectively). PMID:24720423
NASA Astrophysics Data System (ADS)
Pernot, Pascal; Savin, Andreas
2018-06-01
Benchmarking studies in computational chemistry use reference datasets to assess the accuracy of a method through error statistics. The commonly used error statistics, such as the mean signed and mean unsigned errors, do not inform end-users on the expected amplitude of prediction errors attached to these methods. We show that, the distributions of model errors being neither normal nor zero-centered, these error statistics cannot be used to infer prediction error probabilities. To overcome this limitation, we advocate for the use of more informative statistics, based on the empirical cumulative distribution function of unsigned errors, namely, (1) the probability for a new calculation to have an absolute error below a chosen threshold and (2) the maximal amplitude of errors one can expect with a chosen high confidence level. Those statistics are also shown to be well suited for benchmarking and ranking studies. Moreover, the standard error on all benchmarking statistics depends on the size of the reference dataset. Systematic publication of these standard errors would be very helpful to assess the statistical reliability of benchmarking conclusions.
Improved Statistical Method For Hydrographic Climatic Records Quality Control
NASA Astrophysics Data System (ADS)
Gourrion, J.; Szekely, T.
2016-02-01
Climate research benefits from the continuous development of global in-situ hydrographic networks in the last decades. Apart from the increasing volume of observations available on a large range of temporal and spatial scales, a critical aspect concerns the ability to constantly improve the quality of the datasets. In the context of the Coriolis Dataset for ReAnalysis (CORA) version 4.2, a new quality control method based on a local comparison to historical extreme values ever observed is developed, implemented and validated. Temperature, salinity and potential density validity intervals are directly estimated from minimum and maximum values from an historical reference dataset, rather than from traditional mean and standard deviation estimates. Such an approach avoids strong statistical assumptions on the data distributions such as unimodality, absence of skewness and spatially homogeneous kurtosis. As a new feature, it also allows addressing simultaneously the two main objectives of a quality control strategy, i.e. maximizing the number of good detections while minimizing the number of false alarms. The reference dataset is presently built from the fusion of 1) all ARGO profiles up to early 2014, 2) 3 historical CTD datasets and 3) the Sea Mammals CTD profiles from the MEOP database. All datasets are extensively and manually quality controlled. In this communication, the latest method validation results are also presented. The method has been implemented in the latest version of the CORA dataset and will benefit to the next version of the Copernicus CMEMS dataset.
Reference datasets for 2-treatment, 2-sequence, 2-period bioequivalence studies.
Schütz, Helmut; Labes, Detlew; Fuglsang, Anders
2014-11-01
It is difficult to validate statistical software used to assess bioequivalence since very few datasets with known results are in the public domain, and the few that are published are of moderate size and balanced. The purpose of this paper is therefore to introduce reference datasets of varying complexity in terms of dataset size and characteristics (balance, range, outlier presence, residual error distribution) for 2-treatment, 2-period, 2-sequence bioequivalence studies and to report their point estimates and 90% confidence intervals which companies can use to validate their installations. The results for these datasets were calculated using the commercial packages EquivTest, Kinetica, SAS and WinNonlin, and the non-commercial package R. The results of three of these packages mostly agree, but imbalance between sequences seems to provoke questionable results with one package, which illustrates well the need for proper software validation.
Improved statistical method for temperature and salinity quality control
NASA Astrophysics Data System (ADS)
Gourrion, Jérôme; Szekely, Tanguy
2017-04-01
Climate research and Ocean monitoring benefit from the continuous development of global in-situ hydrographic networks in the last decades. Apart from the increasing volume of observations available on a large range of temporal and spatial scales, a critical aspect concerns the ability to constantly improve the quality of the datasets. In the context of the Coriolis Dataset for ReAnalysis (CORA) version 4.2, a new quality control method based on a local comparison to historical extreme values ever observed is developed, implemented and validated. Temperature, salinity and potential density validity intervals are directly estimated from minimum and maximum values from an historical reference dataset, rather than from traditional mean and standard deviation estimates. Such an approach avoids strong statistical assumptions on the data distributions such as unimodality, absence of skewness and spatially homogeneous kurtosis. As a new feature, it also allows addressing simultaneously the two main objectives of an automatic quality control strategy, i.e. maximizing the number of good detections while minimizing the number of false alarms. The reference dataset is presently built from the fusion of 1) all ARGO profiles up to late 2015, 2) 3 historical CTD datasets and 3) the Sea Mammals CTD profiles from the MEOP database. All datasets are extensively and manually quality controlled. In this communication, the latest method validation results are also presented. The method has already been implemented in the latest version of the delayed-time CMEMS in-situ dataset and will be deployed soon in the equivalent near-real time products.
Integrative missing value estimation for microarray data.
Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine
2006-10-12
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
Dental age assessment of southern Chinese using the United Kingdom Caucasian reference dataset.
Jayaraman, Jayakumar; Roberts, Graham J; King, Nigel M; Wong, Hai Ming
2012-03-10
Dental age assessment is one the most accurate methods for estimating the age of an unknown person. Demirjian's dataset on a French-Canadian population has been widely tested for its applicability on various ethnic groups including southern Chinese. Following inaccurate results from these studies, investigators are now confronted with using alternate datasets for comparison. Testing the applicability of other reliable datasets which result in accurate findings might limit the need to develop population specific standards. Recently, a Reference Data Set (RDS) similar to the Demirjian was prepared in the United Kingdom (UK) and has been subsequently validated. The advantages of the UK Caucasian RDS includes versatility from including both the maxillary and mandibular dentitions, involvement of a wide age group of subjects for evaluation and the possibility of precise age estimation with the mathematical technique of meta-analysis. The aim of this study was to evaluate the applicability of the United Kingdom Caucasian RDS on southern Chinese subjects. Dental panoramic tomographs (DPT) of 266 subjects (133 males and 133 females) aged 2-21 years that were previously taken for clinical diagnostic purposes were selected and scored by a single calibrated examiner based on Demirjian's classification of tooth developmental stages (A-H). The ages corresponding to each stage of tooth developmental stage were obtained from the UK dataset. Intra-examiner reproducibility was tested and the Cohen kappa (0.88) showed that the level of agreement was 'almost perfect'. The estimated dental age was then compared with the chronological age using a paired t-test, with statistical significance set at p<0.01. The results showed that the UK dataset, underestimated the age of southern Chinese subjects by 0.24 years but the results were not statistically significant. In conclusion, the UK Caucasian RDS may not be suitable for estimating the age of southern Chinese subjects and there is a need for an ethnic specific reference dataset for southern Chinese. Copyright © 2011. Published by Elsevier Ireland Ltd.
Using high-resolution variant frequencies to empower clinical genome interpretation.
Whiffin, Nicola; Minikel, Eric; Walsh, Roddy; O'Donnell-Luria, Anne H; Karczewski, Konrad; Ing, Alexander Y; Barton, Paul J R; Funke, Birgit; Cook, Stuart A; MacArthur, Daniel; Ware, James S
2017-10-01
PurposeWhole-exome and whole-genome sequencing have transformed the discovery of genetic variants that cause human Mendelian disease, but discriminating pathogenic from benign variants remains a daunting challenge. Rarity is recognized as a necessary, although not sufficient, criterion for pathogenicity, but frequency cutoffs used in Mendelian analysis are often arbitrary and overly lenient. Recent very large reference datasets, such as the Exome Aggregation Consortium (ExAC), provide an unprecedented opportunity to obtain robust frequency estimates even for very rare variants.MethodsWe present a statistical framework for the frequency-based filtering of candidate disease-causing variants, accounting for disease prevalence, genetic and allelic heterogeneity, inheritance mode, penetrance, and sampling variance in reference datasets.ResultsUsing the example of cardiomyopathy, we show that our approach reduces by two-thirds the number of candidate variants under consideration in the average exome, without removing true pathogenic variants (false-positive rate<0.001).ConclusionWe outline a statistically robust framework for assessing whether a variant is "too common" to be causative for a Mendelian disorder of interest. We present precomputed allele frequency cutoffs for all variants in the ExAC dataset.
Comparing distinct ground-based lightning location networks covering the Netherlands
NASA Astrophysics Data System (ADS)
de Vos, Lotte; Leijnse, Hidde; Schmeits, Maurice; Beekhuis, Hans; Poelman, Dieter; Evers, Läslo; Smets, Pieter
2015-04-01
Lightning can be detected using a ground-based sensor network. The Royal Netherlands Meteorological Institute (KNMI) monitors lightning activity in the Netherlands with the so-called FLITS-system; a network combining SAFIR-type sensors. This makes use of Very High Frequency (VHF) as well as Low Frequency (LF) sensors. KNMI has recently decided to replace FLITS by data from a sub-continental network operated by Météorage which makes use of LF sensors only (KNMI Lightning Detection Network, or KLDN). KLDN is compared to the FLITS system, as well as Met Office's long-range Arrival Time Difference (ATDnet), which measures Very Low Frequency (VLF). Special focus lies on the ability to detect Cloud to Ground (CG) and Cloud to Cloud (CC) lightning in the Netherlands. Relative detection efficiency of individual flashes and lightning activity in a more general sense are calculated over a period of almost 5 years. Additionally, the detection efficiency of each system is compared to a ground-truth that is constructed from flashes that are detected by both of the other datasets. Finally, infrasound data is used as a fourth lightning data source for several case studies. Relative performance is found to vary strongly with location and time. As expected, it is found that FLITS detects significantly more CC lightning (because of the strong aptitude of VHF antennas to detect CC), though KLDN and ATDnet detect more CG lightning. We analyze statistics computed over the entire 5-year period, where we look at CG as well as total lightning (CC and CG combined). Statistics that are considered are the Probability of Detection (POD) and the so-called Lightning Activity Detection (LAD). POD is defined as the percentage of reference flashes the system detects compared to the total detections in the reference. LAD is defined as the fraction of system recordings of one or more flashes in predefined area boxes over a certain time period given the fact that the reference detects at least one flash, compared to the total recordings in the reference dataset. The reference for these statistics is taken to be either another dataset, or a dataset consisting of flashes detected by two datasets. Extreme thunderstorm case evaluation shows that the weather alert criterion for severe thunderstorm is reached by FLITS when this is not the case in KLDN and ATD, suggesting the need for KNMI to modify that weather alert criterion when using KLDN.
Common pitfalls in statistical analysis: The perils of multiple testing
Ranganathan, Priya; Pramesh, C. S.; Buyse, Marc
2016-01-01
Multiple testing refers to situations where a dataset is subjected to statistical testing multiple times - either at multiple time-points or through multiple subgroups or for multiple end-points. This amplifies the probability of a false-positive finding. In this article, we look at the consequences of multiple testing and explore various methods to deal with this issue. PMID:27141478
a Critical Review of Automated Photogrammetric Processing of Large Datasets
NASA Astrophysics Data System (ADS)
Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.
2017-08-01
The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.
Luchko, Tyler; Blinov, Nikolay; Limon, Garrett C; Joyce, Kevin P; Kovalenko, Andriy
2016-11-01
Implicit solvent methods for classical molecular modeling are frequently used to provide fast, physics-based hydration free energies of macromolecules. Less commonly considered is the transferability of these methods to other solvents. The Statistical Assessment of Modeling of Proteins and Ligands 5 (SAMPL5) distribution coefficient dataset and the accompanying explicit solvent partition coefficient reference calculations provide a direct test of solvent model transferability. Here we use the 3D reference interaction site model (3D-RISM) statistical-mechanical solvation theory, with a well tested water model and a new united atom cyclohexane model, to calculate partition coefficients for the SAMPL5 dataset. The cyclohexane model performed well in training and testing ([Formula: see text] for amino acid neutral side chain analogues) but only if a parameterized solvation free energy correction was used. In contrast, the same protocol, using single solute conformations, performed poorly on the SAMPL5 dataset, obtaining [Formula: see text] compared to the reference partition coefficients, likely due to the much larger solute sizes. Including solute conformational sampling through molecular dynamics coupled with 3D-RISM (MD/3D-RISM) improved agreement with the reference calculation to [Formula: see text]. Since our initial calculations only considered partition coefficients and not distribution coefficients, solute sampling provided little benefit comparing against experiment, where ionized and tautomer states are more important. Applying a simple [Formula: see text] correction improved agreement with experiment from [Formula: see text] to [Formula: see text], despite a small number of outliers. Better agreement is possible by accounting for tautomers and improving the ionization correction.
NASA Astrophysics Data System (ADS)
Luchko, Tyler; Blinov, Nikolay; Limon, Garrett C.; Joyce, Kevin P.; Kovalenko, Andriy
2016-11-01
Implicit solvent methods for classical molecular modeling are frequently used to provide fast, physics-based hydration free energies of macromolecules. Less commonly considered is the transferability of these methods to other solvents. The Statistical Assessment of Modeling of Proteins and Ligands 5 (SAMPL5) distribution coefficient dataset and the accompanying explicit solvent partition coefficient reference calculations provide a direct test of solvent model transferability. Here we use the 3D reference interaction site model (3D-RISM) statistical-mechanical solvation theory, with a well tested water model and a new united atom cyclohexane model, to calculate partition coefficients for the SAMPL5 dataset. The cyclohexane model performed well in training and testing (R=0.98 for amino acid neutral side chain analogues) but only if a parameterized solvation free energy correction was used. In contrast, the same protocol, using single solute conformations, performed poorly on the SAMPL5 dataset, obtaining R=0.73 compared to the reference partition coefficients, likely due to the much larger solute sizes. Including solute conformational sampling through molecular dynamics coupled with 3D-RISM (MD/3D-RISM) improved agreement with the reference calculation to R=0.93. Since our initial calculations only considered partition coefficients and not distribution coefficients, solute sampling provided little benefit comparing against experiment, where ionized and tautomer states are more important. Applying a simple pK_{ {a}} correction improved agreement with experiment from R=0.54 to R=0.66, despite a small number of outliers. Better agreement is possible by accounting for tautomers and improving the ionization correction.
NASA Astrophysics Data System (ADS)
Abul Ehsan Bhuiyan, Md; Nikolopoulos, Efthymios I.; Anagnostou, Emmanouil N.; Quintana-Seguí, Pere; Barella-Ortiz, Anaïs
2018-02-01
This study investigates the use of a nonparametric, tree-based model, quantile regression forests (QRF), for combining multiple global precipitation datasets and characterizing the uncertainty of the combined product. We used the Iberian Peninsula as the study area, with a study period spanning 11 years (2000-2010). Inputs to the QRF model included three satellite precipitation products, CMORPH, PERSIANN, and 3B42 (V7); an atmospheric reanalysis precipitation and air temperature dataset; satellite-derived near-surface daily soil moisture data; and a terrain elevation dataset. We calibrated the QRF model for two seasons and two terrain elevation categories and used it to generate ensemble for these conditions. Evaluation of the combined product was based on a high-resolution, ground-reference precipitation dataset (SAFRAN) available at 5 km 1 h-1 resolution. Furthermore, to evaluate relative improvements and the overall impact of the combined product in hydrological response, we used the generated ensemble to force a distributed hydrological model (the SURFEX land surface model and the RAPID river routing scheme) and compared its streamflow simulation results with the corresponding simulations from the individual global precipitation and reference datasets. We concluded that the proposed technique could generate realizations that successfully encapsulate the reference precipitation and provide significant improvement in streamflow simulations, with reduction in systematic and random error on the order of 20-99 and 44-88 %, respectively, when considering the ensemble mean.
NASA Astrophysics Data System (ADS)
Schumacher, R.; Schimpf, H.; Schiller, J.
2011-06-01
The most challenging problem of Automatic Target Recognition (ATR) is the extraction of robust and independent target features which describe the target unambiguously. These features have to be robust and invariant in different senses: in time, between aspect views (azimuth and elevation angle), between target motion (translation and rotation) and between different target variants. Especially for ground moving targets in military applications an irregular target motion is typical, so that a strong variation of the backscattered radar signal with azimuth and elevation angle makes the extraction of stable and robust features most difficult. For ATR based on High Range Resolution (HRR) profiles and / or Inverse Synthetic Aperture Radar (ISAR) images it is crucial that the reference dataset consists of stable and robust features, which, among others, will depend on the target aspect and depression angle amongst others. Here it is important to find an adequate data grid for an efficient data coverage in the reference dataset for ATR. In this paper the variability of the backscattered radar signals of target scattering centers is analyzed for different HRR profiles and ISAR images from measured turntable datasets of ground targets under controlled conditions. Especially the dependency of the features on the elevation angle is analyzed regarding to the ATR of large strip SAR data with a large range of depression angles by using available (I)SAR datasets as reference. In this work the robustness of these scattering centers is analyzed by extracting their amplitude, phase and position. Therefore turntable measurements under controlled conditions were performed targeting an artificial military reference object called STANDCAM. Measures referring to variability, similarity, robustness and separability regarding the scattering centers are defined. The dependency of the scattering behaviour with respect to azimuth and elevation variations is analyzed. Additionally generic types of features (geometrical, statistical), which can be derived especially from (I)SAR images, are applied to the ATR-task. Therefore subsequently the dependence of individual feature values as well as the feature statistics on aspect (i.e. azimuth and elevation) are presented. The Kolmogorov-Smirnov distance will be used to show how the feature statistics is influenced by varying elevation angles. Finally, confusion matrices are computed between the STANDCAM target at all eleven elevation angles. This helps to assess the robustness of ATR performance under the influence of aspect angle deviations between training set and test set.
Zhang, Shu-Dong; Gant, Timothy W
2009-07-31
Connectivity mapping is a process to recognize novel pharmacological and toxicological properties in small molecules by comparing their gene expression signatures with others in a database. A simple and robust method for connectivity mapping with increased specificity and sensitivity was recently developed, and its utility demonstrated using experimentally derived gene signatures. This paper introduces sscMap (statistically significant connections' map), a Java application designed to undertake connectivity mapping tasks using the recently published method. The software is bundled with a default collection of reference gene-expression profiles based on the publicly available dataset from the Broad Institute Connectivity Map 02, which includes data from over 7000 Affymetrix microarrays, for over 1000 small-molecule compounds, and 6100 treatment instances in 5 human cell lines. In addition, the application allows users to add their custom collections of reference profiles and is applicable to a wide range of other 'omics technologies. The utility of sscMap is two fold. First, it serves to make statistically significant connections between a user-supplied gene signature and the 6100 core reference profiles based on the Broad Institute expanded dataset. Second, it allows users to apply the same improved method to custom-built reference profiles which can be added to the database for future referencing. The software can be freely downloaded from http://purl.oclc.org/NET/sscMap.
Baritaux, Jean-Charles; Simon, Anne-Catherine; Schultz, Emmanuelle; Emain, C; Laurent, P; Dinten, Jean-Marc
2016-05-01
We report on our recent efforts towards identifying bacteria in environmental samples by means of Raman spectroscopy. We established a database of Raman spectra from bacteria submitted to various environmental conditions. This dataset was used to verify that Raman typing is possible from measurements performed in non-ideal conditions. Starting from the same dataset, we then varied the phenotype and matrix diversity content included in the reference library used to train the statistical model. The results show that it is possible to obtain models with an extended coverage of spectral variabilities, compared to environment-specific models trained on spectra from a restricted set of conditions. Broad coverage models are desirable for environmental samples since the exact conditions of the bacteria cannot be controlled.
Fine-tuning satellite-based rainfall estimates
NASA Astrophysics Data System (ADS)
Harsa, Hastuadi; Buono, Agus; Hidayat, Rahmat; Achyar, Jaumil; Noviati, Sri; Kurniawan, Roni; Praja, Alfan S.
2018-05-01
Rainfall datasets are available from various sources, including satellite estimates and ground observation. The locations of ground observation scatter sparsely. Therefore, the use of satellite estimates is advantageous, because satellite estimates can provide data on places where the ground observations do not present. However, in general, the satellite estimates data contain bias, since they are product of algorithms that transform the sensors response into rainfall values. Another cause may come from the number of ground observations used by the algorithms as the reference in determining the rainfall values. This paper describe the application of bias correction method to modify the satellite-based dataset by adding a number of ground observation locations that have not been used before by the algorithm. The bias correction was performed by utilizing Quantile Mapping procedure between ground observation data and satellite estimates data. Since Quantile Mapping required mean and standard deviation of both the reference and the being-corrected data, thus the Inverse Distance Weighting scheme was applied beforehand to the mean and standard deviation of the observation data in order to provide a spatial composition of them, which were originally scattered. Therefore, it was possible to provide a reference data point at the same location with that of the satellite estimates. The results show that the new dataset have statistically better representation of the rainfall values recorded by the ground observation than the previous dataset.
Tsybovskii, I S; Veremeichik, V M; Kotova, S A; Kritskaya, S V; Evmenenko, S A; Udina, I G
2017-02-01
For the Republic of Belarus, development of a forensic reference database on the basis of 18 autosomal microsatellites (STR) using a population dataset (N = 1040), “familial” genotypic dataset (N = 2550) obtained from expertise performance of paternity testing, and a dataset of genotypes from a criminal registration database (N = 8756) is described. Population samples studied consist of 80% ethnic Belarusians and 20% individuals of other nationality or of mixed origin (by questionnaire data). Genotypes of 12346 inhabitants of the Republic of Belarus from 118 regional samples studied by 18 autosomal microsatellites are included in the sample: 16 tetranucleotide STR (D2S1338, TPOX, D3S1358, CSF1PO, D5S818, D8S1179, D7S820, THO1, vWA, D13S317, D16S539, D18S51, D19S433, D21S11, F13B, and FGA) and two pentanucleotide STR (Penta D and Penta E). The samples studied are in Hardy–Weinberg equilibrium according to distribution of genotypes by 18 STR. Significant differences were not detected between discrete populations or between samples from various historical ethnographic regions of the Republic of Belarus (Western and Eastern Polesie, Podneprovye, Ponemanye, Poozerye, and Center), which indicates the absence of prominent genetic differentiation. Statistically significant differences between the studied genotypic datasets also were not detected, which made it possible to combine the datasets and consider the total sample as a unified forensic reference database for 18 “criminalistic” STR loci. Differences between reference database of the Republic of Belarus and Russians and Ukrainians by the distribution of the range of autosomal STR also were not detected, corresponding to a close genetic relationship of the three Eastern Slavic nations mediated by common origin and intense mutual migrations. Significant differences by separate STR loci between the reference database of Republic of Belarus and populations of Southern and Western Slavs were observed. The necessity of using original reference database for support of forensic expertise practice in the Republic of Belarus was demonstrated.
Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration.
Deelen, Patrick; Bonder, Marc Jan; van der Velde, K Joeri; Westra, Harm-Jan; Winder, Erwin; Hendriksen, Dennis; Franke, Lude; Swertz, Morris A
2014-12-11
To gain statistical power or to allow fine mapping, researchers typically want to pool data before meta-analyses or genotype imputation. However, the necessary harmonization of genetic datasets is currently error-prone because of many different file formats and lack of clarity about which genomic strand is used as reference. Genotype Harmonizer (GH) is a command-line tool to harmonize genetic datasets by automatically solving issues concerning genomic strand and file format. GH solves the unknown strand issue by aligning ambiguous A/T and G/C SNPs to a specified reference, using linkage disequilibrium patterns without prior knowledge of the used strands. GH supports many common GWAS/NGS genotype formats including PLINK, binary PLINK, VCF, SHAPEIT2 & Oxford GEN. GH is implemented in Java and a large part of the functionality can also be used as Java 'Genotype-IO' API. All software is open source under license LGPLv3 and available from http://www.molgenis.org/systemsgenetics. GH can be used to harmonize genetic datasets across different file formats and can be easily integrated as a step in routine meta-analysis and imputation pipelines.
Learning discriminative functional network features of schizophrenia
NASA Astrophysics Data System (ADS)
Gheiratmand, Mina; Rish, Irina; Cecchi, Guillermo; Brown, Matthew; Greiner, Russell; Bashivan, Pouya; Polosecki, Pablo; Dursun, Serdar
2017-03-01
Associating schizophrenia with disrupted functional connectivity is a central idea in schizophrenia research. However, identifying neuroimaging-based features that can serve as reliable "statistical biomarkers" of the disease remains a challenging open problem. We argue that generalization accuracy and stability of candidate features ("biomarkers") must be used as additional criteria on top of standard significance tests in order to discover more robust biomarkers. Generalization accuracy refers to the utility of biomarkers for making predictions about individuals, for example discriminating between patients and controls, in novel datasets. Feature stability refers to the reproducibility of the candidate features across different datasets. Here, we extracted functional connectivity network features from fMRI data at both high-resolution (voxel-level) and a spatially down-sampled lower-resolution ("supervoxel" level). At the supervoxel level, we used whole-brain network links, while at the voxel level, due to the intractably large number of features, we sampled a subset of them. We compared statistical significance, stability and discriminative utility of both feature types in a multi-site fMRI dataset, composed of schizophrenia patients and healthy controls. For both feature types, a considerable fraction of features showed significant differences between the two groups. Also, both feature types were similarly stable across multiple data subsets. However, the whole-brain supervoxel functional connectivity features showed a higher cross-validation classification accuracy of 78.7% vs. 72.4% for the voxel-level features. Cross-site variability and heterogeneity in the patient samples in the multi-site FBIRN dataset made the task more challenging compared to single-site studies. The use of the above methodology in combination with the fully data-driven approach using the whole brain information have the potential to shed light on "biomarker discovery" in schizophrenia.
Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses
Park, Danny S.; Brown, Brielin; Eng, Celeste; Huntsman, Scott; Hu, Donglei; Torgerson, Dara G.; Burchard, Esteban G.; Zaitlen, Noah
2015-01-01
Motivation: Approaches to identifying new risk loci, training risk prediction models, imputing untyped variants and fine-mapping causal variants from summary statistics of genome-wide association studies are playing an increasingly important role in the human genetics community. Current summary statistics-based methods rely on global ‘best guess’ reference panels to model the genetic correlation structure of the dataset being studied. This approach, especially in admixed populations, has the potential to produce misleading results, ignores variation in local structure and is not feasible when appropriate reference panels are missing or small. Here, we develop a method, Adapt-Mix, that combines information across all available reference panels to produce estimates of local genetic correlation structure for summary statistics-based methods in arbitrary populations. Results: We applied Adapt-Mix to estimate the genetic correlation structure of both admixed and non-admixed individuals using simulated and real data. We evaluated our method by measuring the performance of two summary statistics-based methods: imputation and joint-testing. When using our method as opposed to the current standard of ‘best guess’ reference panels, we observed a 28% decrease in mean-squared error for imputation and a 73.7% decrease in mean-squared error for joint-testing. Availability and implementation: Our method is publicly available in a software package called ADAPT-Mix available at https://github.com/dpark27/adapt_mix. Contact: noah.zaitlen@ucsf.edu PMID:26072481
Observational uncertainty and regional climate model evaluation: A pan-European perspective
NASA Astrophysics Data System (ADS)
Kotlarski, Sven; Szabó, Péter; Herrera, Sixto; Räty, Olle; Keuler, Klaus; Soares, Pedro M.; Cardoso, Rita M.; Bosshard, Thomas; Pagé, Christian; Boberg, Fredrik; Gutiérrez, José M.; Jaczewski, Adam; Kreienkamp, Frank; Liniger, Mark. A.; Lussana, Cristian; Szepszo, Gabriella
2017-04-01
Local and regional climate change assessments based on downscaling methods crucially depend on the existence of accurate and reliable observational reference data. In dynamical downscaling via regional climate models (RCMs) observational data can influence model development itself and, later on, model evaluation, parameter calibration and added value assessment. In empirical-statistical downscaling, observations serve as predictand data and directly influence model calibration with corresponding effects on downscaled climate change projections. Focusing on the evaluation of RCMs, we here analyze the influence of uncertainties in observational reference data on evaluation results in a well-defined performance assessment framework and on a European scale. For this purpose we employ three different gridded observational reference grids, namely (1) the well-established EOBS dataset (2) the recently developed EURO4M-MESAN regional re-analysis, and (3) several national high-resolution and quality-controlled gridded datasets that recently became available. In terms of climate models five reanalysis-driven experiments carried out by five different RCMs within the EURO-CORDEX framework are used. Two variables (temperature and precipitation) and a range of evaluation metrics that reflect different aspects of RCM performance are considered. We furthermore include an illustrative model ranking exercise and relate observational spread to RCM spread. The results obtained indicate a varying influence of observational uncertainty on model evaluation depending on the variable, the season, the region and the specific performance metric considered. Over most parts of the continent, the influence of the choice of the reference dataset for temperature is rather small for seasonal mean values and inter-annual variability. Here, model uncertainty (as measured by the spread between the five RCM simulations considered) is typically much larger than reference data uncertainty. For parameters of the daily temperature distribution and for the spatial pattern correlation, however, important dependencies on the reference dataset can arise. The related evaluation uncertainties can be as large or even larger than model uncertainty. For precipitation the influence of observational uncertainty is, in general, larger than for temperature. It often dominates model uncertainty especially for the evaluation of the wet day frequency, the spatial correlation and the shape and location of the distribution of daily values. But even the evaluation of large-scale seasonal mean values can be considerably affected by the choice of the reference. When employing a simple and illustrative model ranking scheme on these results it is found that RCM ranking in many cases depends on the reference dataset employed.
Towards interoperable and reproducible QSAR analyses: Exchange of datasets.
Spjuth, Ola; Willighagen, Egon L; Guha, Rajarshi; Eklund, Martin; Wikberg, Jarl Es
2010-06-30
QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.
Towards interoperable and reproducible QSAR analyses: Exchange of datasets
2010-01-01
Background QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. Results We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Conclusions Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community. PMID:20591161
Watson, Nathanial E; Parsons, Brendon A; Synovec, Robert E
2016-08-12
Performance of tile-based Fisher Ratio (F-ratio) data analysis, recently developed for discovery-based studies using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC-TOFMS), is evaluated with a metabolomics dataset that had been previously analyzed in great detail, but while taking a brute force approach. The previously analyzed data (referred to herein as the benchmark dataset) were intracellular extracts from Saccharomyces cerevisiae (yeast), either metabolizing glucose (repressed) or ethanol (derepressed), which define the two classes in the discovery-based analysis to find metabolites that are statistically different in concentration between the two classes. Beneficially, this previously analyzed dataset provides a concrete means to validate the tile-based F-ratio software. Herein, we demonstrate and validate the significant benefits of applying tile-based F-ratio analysis. The yeast metabolomics data are analyzed more rapidly in about one week versus one year for the prior studies with this dataset. Furthermore, a null distribution analysis is implemented to statistically determine an adequate F-ratio threshold, whereby the variables with F-ratio values below the threshold can be ignored as not class distinguishing, which provides the analyst with confidence when analyzing the hit table. Forty-six of the fifty-four benchmarked changing metabolites were discovered by the new methodology while consistently excluding all but one of the benchmarked nineteen false positive metabolites previously identified. Copyright © 2016 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Kotlarski, Sven; Gutiérrez, José M.; Boberg, Fredrik; Bosshard, Thomas; Cardoso, Rita M.; Herrera, Sixto; Maraun, Douglas; Mezghani, Abdelkader; Pagé, Christian; Räty, Olle; Stepanek, Petr; Soares, Pedro M. M.; Szabo, Peter
2016-04-01
VALUE is an open European network to validate and compare downscaling methods for climate change research (http://www.value-cost.eu). A key deliverable of VALUE is the development of a systematic validation framework to enable the assessment and comparison of downscaling methods. Such assessments can be expected to crucially depend on the existence of accurate and reliable observational reference data. In dynamical downscaling, observational data can influence model development itself and, later on, model evaluation, parameter calibration and added value assessment. In empirical-statistical downscaling, observations serve as predictand data and directly influence model calibration with corresponding effects on downscaled climate change projections. We here present a comprehensive assessment of the influence of uncertainties in observational reference data and of scale-related issues on several of the above-mentioned aspects. First, temperature and precipitation characteristics as simulated by a set of reanalysis-driven EURO-CORDEX RCM experiments are validated against three different gridded reference data products, namely (1) the EOBS dataset (2) the recently developed EURO4M-MESAN regional re-analysis, and (3) several national high-resolution and quality-controlled gridded datasets that recently became available. The analysis reveals a considerable influence of the choice of the reference data on the evaluation results, especially for precipitation. It is also illustrated how differences between the reference data sets influence the ranking of RCMs according to a comprehensive set of performance measures.
FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web.
Probst, Daniel; Reymond, Jean-Louis
2018-04-15
During the past decade, big data have become a major tool in scientific endeavors. Although statistical methods and algorithms are well-suited for analyzing and summarizing enormous amounts of data, the results do not allow for a visual inspection of the entire data. Current scientific software, including R packages and Python libraries such as ggplot2, matplotlib and plot.ly, do not support interactive visualizations of datasets exceeding 100 000 data points on the web. Other solutions enable the web-based visualization of big data only through data reduction or statistical representations. However, recent hardware developments, especially advancements in graphical processing units, allow for the rendering of millions of data points on a wide range of consumer hardware such as laptops, tablets and mobile phones. Similar to the challenges and opportunities brought to virtually every scientific field by big data, both the visualization of and interaction with copious amounts of data are both demanding and hold great promise. Here we present FUn, a framework consisting of a client (Faerun) and server (Underdark) module, facilitating the creation of web-based, interactive 3D visualizations of large datasets, enabling record level visual inspection. We also introduce a reference implementation providing access to SureChEMBL, a database containing patent information on more than 17 million chemical compounds. The source code and the most recent builds of Faerun and Underdark, Lore.js and the data preprocessing toolchain used in the reference implementation, are available on the project website (http://doc.gdb.tools/fun/). daniel.probst@dcb.unibe.ch or jean-louis.reymond@dcb.unibe.ch.
Statistical Analysis of Sport Movement Observations: the Case of Orienteering
NASA Astrophysics Data System (ADS)
Amouzandeh, K.; Karimipour, F.
2017-09-01
Study of movement observations is becoming more popular in several applications. Particularly, analyzing sport movement time series has been considered as a demanding area. However, most of the attempts made on analyzing movement sport data have focused on spatial aspects of movement to extract some movement characteristics, such as spatial patterns and similarities. This paper proposes statistical analysis of sport movement observations, which refers to analyzing changes in the spatial movement attributes (e.g. distance, altitude and slope) and non-spatial movement attributes (e.g. speed and heart rate) of athletes. As the case study, an example dataset of movement observations acquired during the "orienteering" sport is presented and statistically analyzed.
Wu, Jiayi; Ma, Yong-Bei; Congdon, Charles; Brett, Bevin; Chen, Shuobing; Xu, Yaofang; Ouyang, Qi
2017-01-01
Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. However, traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR) in the image data, yet demand increased computational costs. Overcoming these limitations requires further development of clustering algorithms for high-performance cryo-EM data processing. Here we introduce an unsupervised single-particle clustering algorithm derived from a statistical manifold learning framework called generative topographic mapping (GTM). We show that unsupervised GTM clustering improves classification accuracy by about 40% in the absence of input references for data with lower SNRs. Applications to several experimental datasets suggest that our algorithm can detect subtle structural differences among classes via a hierarchical clustering strategy. After code optimization over a high-performance computing (HPC) environment, our software implementation was able to generate thousands of reference-free class averages within hours in a massively parallel fashion, which allows a significant improvement on ab initio 3D reconstruction and assists in the computational purification of homogeneous datasets for high-resolution visualization. PMID:28786986
Wu, Jiayi; Ma, Yong-Bei; Congdon, Charles; Brett, Bevin; Chen, Shuobing; Xu, Yaofang; Ouyang, Qi; Mao, Youdong
2017-01-01
Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. However, traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR) in the image data, yet demand increased computational costs. Overcoming these limitations requires further development of clustering algorithms for high-performance cryo-EM data processing. Here we introduce an unsupervised single-particle clustering algorithm derived from a statistical manifold learning framework called generative topographic mapping (GTM). We show that unsupervised GTM clustering improves classification accuracy by about 40% in the absence of input references for data with lower SNRs. Applications to several experimental datasets suggest that our algorithm can detect subtle structural differences among classes via a hierarchical clustering strategy. After code optimization over a high-performance computing (HPC) environment, our software implementation was able to generate thousands of reference-free class averages within hours in a massively parallel fashion, which allows a significant improvement on ab initio 3D reconstruction and assists in the computational purification of homogeneous datasets for high-resolution visualization.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Josse, Florent; Lefebvre, Yannick; Todeschini, Patrick
2006-07-01
Assessing the structural integrity of a nuclear Reactor Pressure Vessel (RPV) subjected to pressurized-thermal-shock (PTS) transients is extremely important to safety. In addition to conventional deterministic calculations to confirm RPV integrity, Electricite de France (EDF) carries out probabilistic analyses. Probabilistic analyses are interesting because some key variables, albeit conventionally taken at conservative values, can be modeled more accurately through statistical variability. One variable which significantly affects RPV structural integrity assessment is cleavage fracture initiation toughness. The reference fracture toughness method currently in use at EDF is the RCCM and ASME Code lower-bound K{sub IC} based on the indexing parameter RT{submore » NDT}. However, in order to quantify the toughness scatter for probabilistic analyses, the master curve method is being analyzed at present. Furthermore, the master curve method is a direct means of evaluating fracture toughness based on K{sub JC} data. In the framework of the master curve investigation undertaken by EDF, this article deals with the following two statistical items: building a master curve from an extract of a fracture toughness dataset (from the European project 'Unified Reference Fracture Toughness Design curves for RPV Steels') and controlling statistical uncertainty for both mono-temperature and multi-temperature tests. Concerning the first point, master curve temperature dependence is empirical in nature. To determine the 'original' master curve, Wallin postulated that a unified description of fracture toughness temperature dependence for ferritic steels is possible, and used a large number of data corresponding to nuclear-grade pressure vessel steels and welds. Our working hypothesis is that some ferritic steels may behave in slightly different ways. Therefore we focused exclusively on the basic french reactor vessel metal of types A508 Class 3 and A 533 grade B Class 1, taking the sampling level and direction into account as well as the test specimen type. As for the second point, the emphasis is placed on the uncertainties in applying the master curve approach. For a toughness dataset based on different specimens of a single product, application of the master curve methodology requires the statistical estimation of one parameter: the reference temperature T{sub 0}. Because of the limited number of specimens, estimation of this temperature is uncertain. The ASTM standard provides a rough evaluation of this statistical uncertainty through an approximate confidence interval. In this paper, a thorough study is carried out to build more meaningful confidence intervals (for both mono-temperature and multi-temperature tests). These results ensure better control over uncertainty, and allow rigorous analysis of the impact of its influencing factors: the number of specimens and the temperatures at which they have been tested. (authors)« less
Meyer, Patrick E; Lafitte, Frédéric; Bontempi, Gianluca
2008-10-29
This paper presents the R/Bioconductor package minet (version 1.1.6) which provides a set of functions to infer mutual information networks from a dataset. Once fed with a microarray dataset, the package returns a network where nodes denote genes, edges model statistical dependencies between genes and the weight of an edge quantifies the statistical evidence of a specific (e.g transcriptional) gene-to-gene interaction. Four different entropy estimators are made available in the package minet (empirical, Miller-Madow, Schurmann-Grassberger and shrink) as well as four different inference methods, namely relevance networks, ARACNE, CLR and MRNET. Also, the package integrates accuracy assessment tools, like F-scores, PR-curves and ROC-curves in order to compare the inferred network with a reference one. The package minet provides a series of tools for inferring transcriptional networks from microarray data. It is freely available from the Comprehensive R Archive Network (CRAN) as well as from the Bioconductor website.
Genotype Imputation with Thousands of Genomes
Howie, Bryan; Marchini, Jonathan; Stephens, Matthew
2011-01-01
Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package. PMID:22384356
Using Third Party Data to Update a Reference Dataset in a Quality Evaluation Service
NASA Astrophysics Data System (ADS)
Xavier, E. M. A.; Ariza-López, F. J.; Ureña-Cámara, M. A.
2016-06-01
Nowadays it is easy to find many data sources for various regions around the globe. In this 'data overload' scenario there are few, if any, information available about the quality of these data sources. In order to easily provide these data quality information we presented the architecture of a web service for the automation of quality control of spatial datasets running over a Web Processing Service (WPS). For quality procedures that require an external reference dataset, like positional accuracy or completeness, the architecture permits using a reference dataset. However, this reference dataset is not ageless, since it suffers the natural time degradation inherent to geospatial features. In order to mitigate this problem we propose the Time Degradation & Updating Module which intends to apply assessed data as a tool to maintain the reference database updated. The main idea is to utilize datasets sent to the quality evaluation service as a source of 'candidate data elements' for the updating of the reference database. After the evaluation, if some elements of a candidate dataset reach a determined quality level, they can be used as input data to improve the current reference database. In this work we present the first design of the Time Degradation & Updating Module. We believe that the outcomes can be applied in the search of a full-automatic on-line quality evaluation platform.
A comparative evaluation of intraoral and extraoral digital impressions: An in vivo study.
Sason, Gursharan Kaur; Mistry, Gaurang; Tabassum, Rubina; Shetty, Omkar
2018-01-01
The accuracy of a dental impression is determined by two factors: "trueness" and "precision." The scanners used in dentistry are relatively new in market, and very few studies have compared the "precision" and "trueness" of intraoral scanner with the extraoral scanner. The aim of this study was to evaluate and compare accuracy of intraoral and extraoral digital impressions. Ten dentulous participants (male/female) aged 18-45 years with an asymptomatic endodontically treated mandibular first molars with adjacent teeth present were selected for this study. The prepared test tooth was measured using a digital Vernier caliper to obtain reference datasets. The tooth was then scanned using the intraoral scanner, and the extraoral scans were obtained using the casts made from the impressions. The datasets were divided into four groups and then statistically analyzed. The test tooth preparation was done, and dimples were made using a round diamond point on the bucco-occlusal, mesio-occlusal, disto-occlusal, and linguo-occlusal lines angles, and these were used to obtain reference datasets intraorally using a digital Vernier caliper. The test tooth was then scanned with the IO scanner (CS 3500, Carestream dental) thrice and also impressions were made using addition silicone impression material (3M™ ESPE) and dental casts were poured in Type IV dental stone (Kalrock-Kalabhai Karson India Pvt. Ltd., India) which were later scanned with the EO scanner (LAVA™ Scan ST Design system [3M™ ESPE]) thrice. The Datasets obtained from Intraoral and Extraoral scanner were exported to Dental Wings software and readings were obtained. Repeated measures ANOVA test was used to compare differences between the groups and independent t -test for comparison between the readings of intraoral and extraoral scanner. Least significant difference test was used for comparison between reference datasets with intraoral and extraoral scanner, respectively. A level of statistical significance of P < 0.05 was set. The precision values ranged from 20.7 to 33.35 μm for intraoral scanner and 19.5 to 37 μm for extraoral scanner. The mean deviations for intraoral scanner were 19.6 μm mesiodistally (MD) and 16.4 μm buccolingually (BL) and 24.0 μm MD and 22.5 μm BL for extraoral scanner. The mean values of the intraoral scanner (413 μm) for trueness were closest to the actual measurements (459 μm) than the extraoral scanner (396 μm). The intraoral scanner showed higher "precision" and "trueness" values when compared with the extraoral scanner.
EnviroAtlas - Average Annual Precipitation 1981-2010 by HUC12 for the Conterminous United States
This EnviroAtlas dataset provides the average annual precipitation by 12-digit Hydrologic Unit (HUC). The values were estimated from maps produced by the PRISM Climate Group, Oregon State University. The original data was at the scale of 800 m grid cells representing average precipitation from 1981-2010 in mm. The data was converted to inches of precipitation and then zonal statistics were estimated for a final value of average annual precipitation for each 12 digit HUC. For more information about the original dataset please refer to the PRISM website at http://www.prism.oregonstate.edu/. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Elshehawi, Waleed; Alsaffar, Hani; Roberts, Graham; Lucas, Victoria; McDonald, Fraser; Camilleri, Simon
2016-04-01
The purpose of this study was to develop and validate a Reference Data Set for Dental Age Assessment of the Maltese population and compare the mean Age of Attainment to a UK Caucasian Reference Data Set. The Maltese Reference Data Set was developed from 1593 Dental Panoramic Tomograms of patients aged between 4 and 26 years, taken from the radiographic archives of the Dental Department, Mater Dei Hospital, Malta. Tooth Development Stages were recorded for all 16 maxillary and mandibular permanent teeth on the left side and both permanent third molars on the right, according to Demirjian's staging method. Summary and percentile data were calculated for each Tooth Development Stage, including the mean Age of Attainment. These means were used to estimate the Dental Age of each subject in the study sample using the simple unweighted average method. The estimated Dental Age was compared to the gold standard of the Chronological Age. Comparison of the Maltese and UK Caucasian Reference Data Set was by a series of t-tests, carried out for each paired Tooth Development Stage by gender. The mean Age of Attainment was slightly higher for the Maltese than the UK Caucasians in both males and females. However there was no statistically significant difference between the Chronological Age and Dental Age for either sex. Copyright © 2016 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
Ambiguity of Quality in Remote Sensing Data
NASA Technical Reports Server (NTRS)
Lynnes, Christopher; Leptoukh, Greg
2010-01-01
This slide presentation reviews some of the issues in quality of remote sensing data. Data "quality" is used in several different contexts in remote sensing data, with quite different meanings. At the pixel level, quality typically refers to a quality control process exercised by the processing algorithm, not an explicit declaration of accuracy or precision. File level quality is usually a statistical summary of the pixel-level quality but is of doubtful use for scenes covering large areal extents. Quality at the dataset or product level, on the other hand, usually refers to how accurately the dataset is believed to represent the physical quantities it purports to measure. This assessment often bears but an indirect relationship at best to pixel level quality. In addition to ambiguity at different levels of granularity, ambiguity is endemic within levels. Pixel-level quality terms vary widely, as do recommendations for use of these flags. At the dataset/product level, quality for low-resolution gridded products is often extrapolated from validation campaigns using high spatial resolution swath data, a suspect practice at best. Making use of quality at all levels is complicated by the dependence on application needs. We will present examples of the various meanings of quality in remote sensing data and possible ways forward toward a more unified and usable quality framework.
Undersampling strategies for compressed sensing accelerated MR spectroscopic imaging
NASA Astrophysics Data System (ADS)
Vidya Shankar, Rohini; Hu, Houchun Harry; Bikkamane Jayadev, Nutandev; Chang, John C.; Kodibagkar, Vikram D.
2017-03-01
Compressed sensing (CS) can accelerate magnetic resonance spectroscopic imaging (MRSI), facilitating its widespread clinical integration. The objective of this study was to assess the effect of different undersampling strategy on CS-MRSI reconstruction quality. Phantom data were acquired on a Philips 3 T Ingenia scanner. Four types of undersampling masks, corresponding to each strategy, namely, low resolution, variable density, iterative design, and a priori were simulated in Matlab and retrospectively applied to the test 1X MRSI data to generate undersampled datasets corresponding to the 2X - 5X, and 7X accelerations for each type of mask. Reconstruction parameters were kept the same in each case(all masks and accelerations) to ensure that any resulting differences can be attributed to the type of mask being employed. The reconstructed datasets from each mask were statistically compared with the reference 1X, and assessed using metrics like the root mean square error and metabolite ratios. Simulation results indicate that both the a priori and variable density undersampling masks maintain high fidelity with the 1X up to five-fold acceleration. The low resolution mask based reconstructions showed statistically significant differences from the 1X with the reconstruction failing at 3X, while the iterative design reconstructions maintained fidelity with the 1X till 4X acceleration. In summary, a pilot study was conducted to identify an optimal sampling mask in CS-MRSI. Simulation results demonstrate that the a priori and variable density masks can provide statistically similar results to the fully sampled reference. Future work would involve implementing these two masks prospectively on a clinical scanner.
Dong, Yingying; Luo, Ruisen; Feng, Haikuan; Wang, Jihua; Zhao, Jinling; Zhu, Yining; Yang, Guijun
2014-01-01
Differences exist among analysis results of agriculture monitoring and crop production based on remote sensing observations, which are obtained at different spatial scales from multiple remote sensors in same time period, and processed by same algorithms, models or methods. These differences can be mainly quantitatively described from three aspects, i.e. multiple remote sensing observations, crop parameters estimation models, and spatial scale effects of surface parameters. Our research proposed a new method to analyse and correct the differences between multi-source and multi-scale spatial remote sensing surface reflectance datasets, aiming to provide references for further studies in agricultural application with multiple remotely sensed observations from different sources. The new method was constructed on the basis of physical and mathematical properties of multi-source and multi-scale reflectance datasets. Theories of statistics were involved to extract statistical characteristics of multiple surface reflectance datasets, and further quantitatively analyse spatial variations of these characteristics at multiple spatial scales. Then, taking the surface reflectance at small spatial scale as the baseline data, theories of Gaussian distribution were selected for multiple surface reflectance datasets correction based on the above obtained physical characteristics and mathematical distribution properties, and their spatial variations. This proposed method was verified by two sets of multiple satellite images, which were obtained in two experimental fields located in Inner Mongolia and Beijing, China with different degrees of homogeneity of underlying surfaces. Experimental results indicate that differences of surface reflectance datasets at multiple spatial scales could be effectively corrected over non-homogeneous underlying surfaces, which provide database for further multi-source and multi-scale crop growth monitoring and yield prediction, and their corresponding consistency analysis evaluation.
Dong, Yingying; Luo, Ruisen; Feng, Haikuan; Wang, Jihua; Zhao, Jinling; Zhu, Yining; Yang, Guijun
2014-01-01
Differences exist among analysis results of agriculture monitoring and crop production based on remote sensing observations, which are obtained at different spatial scales from multiple remote sensors in same time period, and processed by same algorithms, models or methods. These differences can be mainly quantitatively described from three aspects, i.e. multiple remote sensing observations, crop parameters estimation models, and spatial scale effects of surface parameters. Our research proposed a new method to analyse and correct the differences between multi-source and multi-scale spatial remote sensing surface reflectance datasets, aiming to provide references for further studies in agricultural application with multiple remotely sensed observations from different sources. The new method was constructed on the basis of physical and mathematical properties of multi-source and multi-scale reflectance datasets. Theories of statistics were involved to extract statistical characteristics of multiple surface reflectance datasets, and further quantitatively analyse spatial variations of these characteristics at multiple spatial scales. Then, taking the surface reflectance at small spatial scale as the baseline data, theories of Gaussian distribution were selected for multiple surface reflectance datasets correction based on the above obtained physical characteristics and mathematical distribution properties, and their spatial variations. This proposed method was verified by two sets of multiple satellite images, which were obtained in two experimental fields located in Inner Mongolia and Beijing, China with different degrees of homogeneity of underlying surfaces. Experimental results indicate that differences of surface reflectance datasets at multiple spatial scales could be effectively corrected over non-homogeneous underlying surfaces, which provide database for further multi-source and multi-scale crop growth monitoring and yield prediction, and their corresponding consistency analysis evaluation. PMID:25405760
The 3D Reference Earth Model: Status and Preliminary Results
NASA Astrophysics Data System (ADS)
Moulik, P.; Lekic, V.; Romanowicz, B. A.
2017-12-01
In the 20th century, seismologists constructed models of how average physical properties (e.g. density, rigidity, compressibility, anisotropy) vary with depth in the Earth's interior. These one-dimensional (1D) reference Earth models (e.g. PREM) have proven indispensable in earthquake location, imaging of interior structure, understanding material properties under extreme conditions, and as a reference in other fields, such as particle physics and astronomy. Over the past three decades, new datasets motivated more sophisticated efforts that yielded models of how properties vary both laterally and with depth in the Earth's interior. Though these three-dimensional (3D) models exhibit compelling similarities at large scales, differences in the methodology, representation of structure, and dataset upon which they are based, have prevented the creation of 3D community reference models. As part of the REM-3D project, we are compiling and reconciling reference seismic datasets of body wave travel-time measurements, fundamental mode and overtone surface wave dispersion measurements, and normal mode frequencies and splitting functions. These reference datasets are being inverted for a long-wavelength, 3D reference Earth model that describes the robust long-wavelength features of mantle heterogeneity. As a community reference model with fully quantified uncertainties and tradeoffs and an associated publically available dataset, REM-3D will facilitate Earth imaging studies, earthquake characterization, inferences on temperature and composition in the deep interior, and be of improved utility to emerging scientific endeavors, such as neutrino geoscience. Here, we summarize progress made in the construction of the reference long period dataset and present a preliminary version of REM-3D in the upper-mantle. In order to determine the level of detail warranted for inclusion in REM-3D, we analyze the spectrum of discrepancies between models inverted with different subsets of the reference dataset. This procedure allows us to evaluate the extent of consistency in imaging heterogeneity at various depths and between spatial scales.
NASA Astrophysics Data System (ADS)
Srivastava, Prashant K.; Han, Dawei; Islam, Tanvir; Petropoulos, George P.; Gupta, Manika; Dai, Qiang
2016-04-01
Reference evapotranspiration (ETo) is an important variable in hydrological modeling, which is not always available, especially for ungauged catchments. Satellite data, such as those available from the MODerate Resolution Imaging Spectroradiometer (MODIS), and global datasets via the European Centre for Medium Range Weather Forecasts (ECMWF) reanalysis (ERA) interim and National Centers for Environmental Prediction (NCEP) reanalysis are important sources of information for ETo. This study explored the seasonal performances of MODIS (MOD16) and Weather Research and Forecasting (WRF) model downscaled global reanalysis datasets, such as ERA interim and NCEP-derived ETo, against ground-based datasets. Overall, on the basis of the statistical metrics computed, ETo derived from ERA interim and MODIS were more accurate in comparison to the estimates from NCEP for all the seasons. The pooled datasets also revealed a similar performance to the seasonal assessment with higher agreement for the ERA interim (r = 0.96, RMSE = 2.76 mm/8 days; bias = 0.24 mm/8 days), followed by MODIS (r = 0.95, RMSE = 7.66 mm/8 days; bias = -7.17 mm/8 days) and NCEP (r = 0.76, RMSE = 11.81 mm/8 days; bias = -10.20 mm/8 days). The only limitation with downscaling ERA interim reanalysis datasets using WRF is that it is time-consuming in contrast to the readily available MODIS operational product for use in mesoscale studies and practical applications.
Statistical Significance of Optical Map Alignments
Sarkar, Deepayan; Goldstein, Steve; Schwartz, David C.
2012-01-01
Abstract The Optical Mapping System constructs ordered restriction maps spanning entire genomes through the assembly and analysis of large datasets comprising individually analyzed genomic DNA molecules. Such restriction maps uniquely reveal mammalian genome structure and variation, but also raise computational and statistical questions beyond those that have been solved in the analysis of smaller, microbial genomes. We address the problem of how to filter maps that align poorly to a reference genome. We obtain map-specific thresholds that control errors and improve iterative assembly. We also show how an optimal self-alignment score provides an accurate approximation to the probability of alignment, which is useful in applications seeking to identify structural genomic abnormalities. PMID:22506568
NASA Astrophysics Data System (ADS)
Harting, Ronald; Bosch, Aleid; Gunnink, Jan
2014-05-01
Society has an increasing demand from the subsurface, which in the Dutch shallow subsurface (upper 30 to 40 meters) mainly focuses on natural aggregate resources, groundwater, infrastructure and dike safety. This stimulates the demand for knowledge about the composition and heterogeneity of the subsurface and its physical and chemical properties, including the uncertainties involved. Physical and chemical properties of sediments in the subsurface have been under investigation for decades; however, the usefulness of this data for applied research and the understanding of these properties is limited. This is due to several factors: studies consist mainly of separately collected datasets, targeted at a limited amount of parameters, focused on a small number of geological units, distributed unevenly with depth and usually collected from clustered drillings with limited spatial extent or are analysed with different techniques and methods, often on disturbed samples. These factors result in a heterogeneous and biased dataset not suitable to function as a reference dataset or to statistically determine regional characteristics of geological units. To overcome these shortcomings, the Geological Survey of the Netherlands is establishing a nation-wide reference dataset for physical and chemical properties. In 2006, a drilling campaign was started using cone penetration tests, cored drillings and geophysical well logs, choosing the sites for a good geographical distribution. The lithological properties of the undisturbed cores are visually described and interpreted for lithostratigraphy and inferred sedimentary environment based on lithofacies. The location of the samples in the cores are chosen based on this description and interpretation, resulting in an evenly distributed dataset of in situ samples with respect to geological units as well as an adequate number of samples suitable for statistical analysis. Analyses are uniformly performed for grain size distribution, permeability (both high and low permeable lithologies) and geochemical methods (X-Ray Fluorescence, Thermo-Gravimetric Analysis, Total Carbon, Total Sulphur and Total Organic Carbon). These analyses result in a large number of lithological, hydrological and geochemical parameters, i.e. clay content, sand median, vertical and horizontal permeability and CaCO3-content. We present the results from the analysis of lithological properties for the Northern Netherlands. Besides geology, these properties can be applied directly in studies concerning (amongst others) groundwater, natural aggregates and dike safety. We demonstrate the use of sedimentary environments based on lithofacies as a useful tool for comparison between lithostratigraphic units and lithofacies. These lithofacies match distinct parts of the marine, fluvial, glacial, eolian or organogenic environment, i.e. tidal channel sand, floodbasin clay and subglacial till. This results in lithological properties illustrating the heterogeneity within a geological unit and between equal depositional environments in different lithostratigraphic units. The acquired data have so far been used in several applied studies, i.e. improving parameterisation of 3D models leading to increased accuracy in groundwater models and dike safety studies concerning dike failure due to undermining. Recently, grain size distributions measured with different methods were recalibrated into a homogeneous dataset using this reference set, which greatly enlarged the dataset to be incorporated in the parameterisation of a 3D voxel model.
Jayaraman, Jayakumar; Wong, Hai Ming; King, Nigel M; Roberts, Graham J
2016-10-01
Many countries have recently experienced a rapid increase in the demand for forensic age estimates of unaccompanied minors. Hong Kong is a major tourist and business center where there has been an increase in the number of people intercepted with false travel documents. An accurate estimation of age is only possible when a dataset for age estimation that has been derived from the corresponding ethnic population. Thus, the aim of this study was to develop and validate a Reference Data Set (RDS) for dental age estimation for southern Chinese. A total of 2306 subjects were selected from the patient archives of a large dental hospital and the chronological age for each subject was recorded. This age was assigned to each specific stage of dental development for each tooth to create a RDS. To validate this RDS, a further 484 subjects were randomly chosen from the patient archives and their dental age was assessed based on the scores from the RDS. Dental age was estimated using meta-analysis command corresponding to random effects statistical model. Chronological age (CA) and Dental Age (DA) were compared using the paired t-test. The overall difference between the chronological and dental age (CA-DA) was 0.05 years (2.6 weeks) for males and 0.03 years (1.6 weeks) for females. The paired t-test indicated that there was no statistically significant difference between the chronological and dental age (p > 0.05). The validated southern Chinese reference dataset based on dental maturation accurately estimated the chronological age. Copyright © 2016 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
Kilborn, Joshua P; Jones, David L; Peebles, Ernst B; Naar, David F
2017-04-01
Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.
genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools.
Lemieux Perreault, Louis-Philippe; Legault, Marc-André; Asselin, Géraldine; Dubé, Marie-Pierre
2016-12-01
Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies). The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe Documentation and tutorials are available at http://pgxcentre.github.io/genipe CONTACT: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Oechsner, Markus; Chizzali, Barbara; Devecka, Michal; Combs, Stephanie Elisabeth; Wilkens, Jan Jakob; Duma, Marciana Nona
2016-10-26
The aim of this study was to analyze differences in couch shifts (setup errors) resulting from image registration of different CT datasets with free breathing cone beam CTs (FB-CBCT). As well automatic as manual image registrations were performed and registration results were correlated to tumor characteristics. FB-CBCT image registration was performed for 49 patients with lung lesions using slow planning CT (PCT), average intensity projection (AIP), maximum intensity projection (MIP) and mid-ventilation CTs (MidV) as reference images. Both, automatic and manual image registrations were applied. Shift differences were evaluated between the registered CT datasets for automatic and manual registration, respectively. Furthermore, differences between automatic and manual registration were analyzed for the same CT datasets. The registration results were statistically analyzed and correlated to tumor characteristics (3D tumor motion, tumor volume, superior-inferior (SI) distance, tumor environment). Median 3D shift differences over all patients were between 0.5 mm (AIPvsMIP) and 1.9 mm (MIPvsPCT and MidVvsPCT) for the automatic registration and between 1.8 mm (AIPvsPCT) and 2.8 mm (MIPvsPCT and MidVvsPCT) for the manual registration. For some patients, large shift differences (>5.0 mm) were found (maximum 10.5 mm, automatic registration). Comparing automatic vs manual registrations for the same reference CTs, ∆AIP achieved the smallest (1.1 mm) and ∆MIP the largest (1.9 mm) median 3D shift differences. The standard deviation (variability) for the 3D shift differences was also the smallest for ∆AIP (1.1 mm). Significant correlations (p < 0.01) between 3D shift difference and 3D tumor motion (AIPvsMIP, MIPvsMidV) and SI distance (AIPvsMIP) (automatic) and also for 3D tumor motion (∆PCT, ∆MidV; automatic vs manual) were found. Using different CT datasets for image registration with FB-CBCTs can result in different 3D couch shifts. Manual registrations achieved partly different 3D shifts than automatic registrations. AIP CTs yielded the smallest shift differences and might be the most appropriate CT dataset for registration with 3D FB-CBCTs.
A Sorting Statistic with Application in Neurological Magnetic Resonance Imaging of Autism.
Levman, Jacob; Takahashi, Emi; Forgeron, Cynthia; MacDonald, Patrick; Stewart, Natalie; Lim, Ashley; Martel, Anne
2018-01-01
Effect size refers to the assessment of the extent of differences between two groups of samples on a single measurement. Assessing effect size in medical research is typically accomplished with Cohen's d statistic. Cohen's d statistic assumes that average values are good estimators of the position of a distribution of numbers and also assumes Gaussian (or bell-shaped) underlying data distributions. In this paper, we present an alternative evaluative statistic that can quantify differences between two data distributions in a manner that is similar to traditional effect size calculations; however, the proposed approach avoids making assumptions regarding the shape of the underlying data distribution. The proposed sorting statistic is compared with Cohen's d statistic and is demonstrated to be capable of identifying feature measurements of potential interest for which Cohen's d statistic implies the measurement would be of little use. This proposed sorting statistic has been evaluated on a large clinical autism dataset from Boston Children's Hospital , Harvard Medical School , demonstrating that it can potentially play a constructive role in future healthcare technologies.
A Sorting Statistic with Application in Neurological Magnetic Resonance Imaging of Autism
Takahashi, Emi; Lim, Ashley; Martel, Anne
2018-01-01
Effect size refers to the assessment of the extent of differences between two groups of samples on a single measurement. Assessing effect size in medical research is typically accomplished with Cohen's d statistic. Cohen's d statistic assumes that average values are good estimators of the position of a distribution of numbers and also assumes Gaussian (or bell-shaped) underlying data distributions. In this paper, we present an alternative evaluative statistic that can quantify differences between two data distributions in a manner that is similar to traditional effect size calculations; however, the proposed approach avoids making assumptions regarding the shape of the underlying data distribution. The proposed sorting statistic is compared with Cohen's d statistic and is demonstrated to be capable of identifying feature measurements of potential interest for which Cohen's d statistic implies the measurement would be of little use. This proposed sorting statistic has been evaluated on a large clinical autism dataset from Boston Children's Hospital, Harvard Medical School, demonstrating that it can potentially play a constructive role in future healthcare technologies. PMID:29796236
Nilsson, R Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M; Bengtsson-Palme, Johan; Walker, Donald M; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C; Abarenkov, Kessy
2015-01-01
The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric-artificially joined-DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation.
Nilsson, R. Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M.; Bengtsson-Palme, Johan; Walker, Donald M.; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C.; Abarenkov, Kessy
2015-01-01
The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric—artificially joined—DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation. PMID:25786896
Reference-tissue correction of T2-weighted signal intensity for prostate cancer detection
NASA Astrophysics Data System (ADS)
Peng, Yahui; Jiang, Yulei; Oto, Aytekin
2014-03-01
The purpose of this study was to investigate whether correction with respect to reference tissue of T2-weighted MRimage signal intensity (SI) improves its effectiveness for classification of regions of interest (ROIs) as prostate cancer (PCa) or normal prostatic tissue. Two image datasets collected retrospectively were used in this study: 71 cases acquired with GE scanners (dataset A), and 59 cases acquired with Philips scanners (dataset B). Through a consensus histology- MR correlation review, 175 PCa and 108 normal-tissue ROIs were identified and drawn manually. Reference-tissue ROIs were selected in each case from the levator ani muscle, urinary bladder, and pubic bone. T2-weighted image SI was corrected as the ratio of the average T2-weighted image SI within an ROI to that of a reference-tissue ROI. Area under the receiver operating characteristic curve (AUC) was used to evaluate the effectiveness of T2-weighted image SIs for differentiation of PCa from normal-tissue ROIs. AUC (+/- standard error) for uncorrected T2-weighted image SIs was 0.78+/-0.04 (datasets A) and 0.65+/-0.05 (datasets B). AUC for corrected T2-weighted image SIs with respect to muscle, bladder, and bone reference was 0.77+/-0.04 (p=1.0), 0.77+/-0.04 (p=1.0), and 0.75+/-0.04 (p=0.8), respectively, for dataset A; and 0.81+/-0.04 (p=0.002), 0.78+/-0.04 (p<0.001), and 0.79+/-0.04 (p<0.001), respectively, for dataset B. Correction in reference to the levator ani muscle yielded the most consistent results between GE and Phillips images. Correction of T2-weighted image SI in reference to three types of extra-prostatic tissue can improve its effectiveness for differentiation of PCa from normal-tissue ROIs, and correction in reference to the levator ani muscle produces consistent T2-weighted image SIs between GE and Phillips MR images.
The Genomic HyperBrowser: an analysis web server for genome-scale data
Sandve, Geir K.; Gundersen, Sveinung; Johansen, Morten; Glad, Ingrid K.; Gunathasan, Krishanthi; Holden, Lars; Holden, Marit; Liestøl, Knut; Nygård, Ståle; Nygaard, Vegard; Paulsen, Jonas; Rydbeck, Halfdan; Trengereid, Kai; Clancy, Trevor; Drabløs, Finn; Ferkingstad, Egil; Kalaš, Matúš; Lien, Tonje; Rye, Morten B.; Frigessi, Arnoldo; Hovig, Eivind
2013-01-01
The immense increase in availability of genomic scale datasets, such as those provided by the ENCODE and Roadmap Epigenomics projects, presents unprecedented opportunities for individual researchers to pose novel falsifiable biological questions. With this opportunity, however, researchers are faced with the challenge of how to best analyze and interpret their genome-scale datasets. A powerful way of representing genome-scale data is as feature-specific coordinates relative to reference genome assemblies, i.e. as genomic tracks. The Genomic HyperBrowser (http://hyperbrowser.uio.no) is an open-ended web server for the analysis of genomic track data. Through the provision of several highly customizable components for processing and statistical analysis of genomic tracks, the HyperBrowser opens for a range of genomic investigations, related to, e.g., gene regulation, disease association or epigenetic modifications of the genome. PMID:23632163
The Genomic HyperBrowser: an analysis web server for genome-scale data.
Sandve, Geir K; Gundersen, Sveinung; Johansen, Morten; Glad, Ingrid K; Gunathasan, Krishanthi; Holden, Lars; Holden, Marit; Liestøl, Knut; Nygård, Ståle; Nygaard, Vegard; Paulsen, Jonas; Rydbeck, Halfdan; Trengereid, Kai; Clancy, Trevor; Drabløs, Finn; Ferkingstad, Egil; Kalas, Matús; Lien, Tonje; Rye, Morten B; Frigessi, Arnoldo; Hovig, Eivind
2013-07-01
The immense increase in availability of genomic scale datasets, such as those provided by the ENCODE and Roadmap Epigenomics projects, presents unprecedented opportunities for individual researchers to pose novel falsifiable biological questions. With this opportunity, however, researchers are faced with the challenge of how to best analyze and interpret their genome-scale datasets. A powerful way of representing genome-scale data is as feature-specific coordinates relative to reference genome assemblies, i.e. as genomic tracks. The Genomic HyperBrowser (http://hyperbrowser.uio.no) is an open-ended web server for the analysis of genomic track data. Through the provision of several highly customizable components for processing and statistical analysis of genomic tracks, the HyperBrowser opens for a range of genomic investigations, related to, e.g., gene regulation, disease association or epigenetic modifications of the genome.
X-MATE: a flexible system for mapping short read data
Pearson, John V.; Cloonan, Nicole; Grimmond, Sean M.
2011-01-01
Summary: Accurate and complete mapping of short-read sequencing to a reference genome greatly enhances the discovery of biological results and improves statistical predictions. We recently presented RNA-MATE, a pipeline for the recursive mapping of RNA-Seq datasets. With the rapid increase in genome re-sequencing projects, progression of available mapping software and the evolution of file formats, we now present X-MATE, an updated version of RNA-MATE, capable of mapping both RNA-Seq and DNA datasets and with improved performance, output file formats, configuration files, and flexibility in core mapping software. Availability: Executables, source code, junction libraries, test data and results and the user manual are available from http://grimmond.imb.uq.edu.au/X-MATE/. Contact: n.cloonan@uq.edu.au; s.grimmond@uq.edu.au Supplementary information: Supplementary data are available at Bioinformatics Online. PMID:21216778
Schoof, Rosalind A; Johnson, Dina L; Handziuk, Emma R; Landingham, Cynthia Van; Feldpausch, Alma M; Gallagher, Alexa E; Dell, Linda D; Kephart, Amy
2016-10-01
Lead exposure and blood lead levels (BLLs) in the United States have declined dramatically since the 1970s as many widespread lead uses have been discontinued. Large scale mining and mineral processing represents an additional localized source of potential lead exposure in many historical mining communities, such as Butte, Montana. After 25 years of ongoing remediation efforts and a residential metals abatement program that includes blood lead monitoring of Butte children, examination of blood lead trends offers a unique opportunity to assess the effectiveness of Butte's lead source and exposure reduction measures. This study examined BLL trends in Butte children ages 1-5 (n= 2796) from 2003-2010 as compared to a reference dataset matched for similar demographic characteristics over the same period. Blood lead differences across Butte during the same period are also examined. Findings are interpreted with respect to effectiveness of remediation and other factors potentially contributing to ongoing exposure concerns. BLLs from Butte were compared with a reference dataset (n=2937) derived from the National Health and Nutrition Examination Survey. The reference dataset was initially matched for child age and sample dates. Additional demographic factors associated with higher BLLs were then evaluated. Weights were applied to make the reference dataset more consistent with the Butte dataset for the three factors that were most disparate (poverty-to-income ratio, house age, and race/ethnicity). A weighted linear mixed regression model showed Butte geometric mean BLLs were higher than reference BLLs for 2003-2004 (3.48vs. 2.05µg/dL), 2005-2006 (2.65vs. 1.80µg/dL), and 2007-2008 (2.2vs. 1.72µg/dL), but comparable for 2009-2010 (1.53vs. 1.51µg/dL). This trend suggests that, over time, the impact of other factors that may be associated with Butte BLLs has been reduced. Neighborhood differences were examined by dividing the Butte dataset into the older area called "Uptown", located at higher elevation atop historical mine workings, and "the Flats", at lower elevation and more recently developed. Significant declines in BLLs were observed over time in both areas, though Uptown had slightly higher BLLs than the Flats (2003-2004: 3.57vs. 3.45µg/dL, p=0.7; 2005-2006: 2.84vs. 2.52µg/dL, p=0.1; 2007-2008: 2.58vs. 1.99µg/dL, p=0.001; 2009-2010: 1.71vs. 1.44µg/dL, p=0.02). BLLs were higher when tested in summer/fall than in winter/spring for both neighborhoods, and statistically higher BLLs were found for children in Uptown living in properties built before 1940. Neighborhood differences and the persistence of a greater percentage of high BLLs (>5µg/dL) in Butte vs. the reference dataset support continuation of the home lead abatement program. Butte BLL declines likely reflect the cumulative effectiveness of screening efforts, community-wide remediation, and the ongoing metals abatement program in Butte in addition to other factors not accounted for by this study. As evidenced in Butte, abatement programs that include home evaluations and assistance in addressing multiple sources of lead exposure can be an important complement to community-wide soil remediation activities. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Jiang, Yueyang; Kim, John B.; Still, Christopher J.; Kerns, Becky K.; Kline, Jeffrey D.; Cunningham, Patrick G.
2018-01-01
Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies. PMID:29461513
Jiang, Yueyang; Kim, John B; Still, Christopher J; Kerns, Becky K; Kline, Jeffrey D; Cunningham, Patrick G
2018-02-20
Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies.
Technical note: Space-time analysis of rainfall extremes in Italy: clues from a reconciled dataset
NASA Astrophysics Data System (ADS)
Libertino, Andrea; Ganora, Daniele; Claps, Pierluigi
2018-05-01
Like other Mediterranean areas, Italy is prone to the development of events with significant rainfall intensity, lasting for several hours. The main triggering mechanisms of these events are quite well known, but the aim of developing rainstorm hazard maps compatible with their actual probability of occurrence is still far from being reached. A systematic frequency analysis of these occasional highly intense events would require a complete countrywide dataset of sub-daily rainfall records, but this kind of information was still lacking for the Italian territory. In this work several sources of data are gathered, for assembling the first comprehensive and updated dataset of extreme rainfall of short duration in Italy. The resulting dataset, referred to as the Italian Rainfall Extreme Dataset (I-RED), includes the annual maximum rainfalls recorded in 1 to 24 consecutive hours from more than 4500 stations across the country, spanning the period between 1916 and 2014. A detailed description of the spatial and temporal coverage of the I-RED is presented, together with an exploratory statistical analysis aimed at providing preliminary information on the climatology of extreme rainfall at the national scale. Due to some legal restrictions, the database can be provided only under certain conditions. Taking into account the potentialities emerging from the analysis, a description of the ongoing and planned future work activities on the database is provided.
REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations
NASA Astrophysics Data System (ADS)
Moulik, P.; Lekic, V.; Romanowicz, B. A.
2017-12-01
A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history. By assessing inter-dataset consistency across similar paths, we quantify travel-time measurement errors for both surface and body waves. Finally, we discuss challenges associated with combining high frequency ( 1 Hz) and long period (10-20s) body-wave measurements into the REM-3D reference dataset.
Data Sharing and the Development of the Cleveland Clinic Statistical Education Dataset Repository
ERIC Educational Resources Information Center
Nowacki, Amy S.
2013-01-01
Examples are highly sought by both students and teachers. This is particularly true as many statistical instructors aim to engage their students and increase active participation. While simulated datasets are functional, they lack real perspective and the intricacies of actual data. In order to obtain real datasets, the principal investigator of a…
SHARE: system design and case studies for statistical health information release
Gardner, James; Xiong, Li; Xiao, Yonghui; Gao, Jingjing; Post, Andrew R; Jiang, Xiaoqian; Ohno-Machado, Lucila
2013-01-01
Objectives We present SHARE, a new system for statistical health information release with differential privacy. We present two case studies that evaluate the software on real medical datasets and demonstrate the feasibility and utility of applying the differential privacy framework on biomedical data. Materials and Methods SHARE releases statistical information in electronic health records with differential privacy, a strong privacy framework for statistical data release. It includes a number of state-of-the-art methods for releasing multidimensional histograms and longitudinal patterns. We performed a variety of experiments on two real datasets, the surveillance, epidemiology and end results (SEER) breast cancer dataset and the Emory electronic medical record (EeMR) dataset, to demonstrate the feasibility and utility of SHARE. Results Experimental results indicate that SHARE can deal with heterogeneous data present in medical data, and that the released statistics are useful. The Kullback–Leibler divergence between the released multidimensional histograms and the original data distribution is below 0.5 and 0.01 for seven-dimensional and three-dimensional data cubes generated from the SEER dataset, respectively. The relative error for longitudinal pattern queries on the EeMR dataset varies between 0 and 0.3. While the results are promising, they also suggest that challenges remain in applying statistical data release using the differential privacy framework for higher dimensional data. Conclusions SHARE is one of the first systems to provide a mechanism for custodians to release differentially private aggregate statistics for a variety of use cases in the medical domain. This proof-of-concept system is intended to be applied to large-scale medical data warehouses. PMID:23059729
Dose coverage calculation using a statistical shape model—applied to cervical cancer radiotherapy
NASA Astrophysics Data System (ADS)
Tilly, David; van de Schoot, Agustinus J. A. J.; Grusell, Erik; Bel, Arjan; Ahnesjö, Anders
2017-05-01
A comprehensive methodology for treatment simulation and evaluation of dose coverage probabilities is presented where a population based statistical shape model (SSM) provide samples of fraction specific patient geometry deformations. The learning data consists of vector fields from deformable image registration of repeated imaging giving intra-patient deformations which are mapped to an average patient serving as a common frame of reference. The SSM is created by extracting the most dominating eigenmodes through principal component analysis of the deformations from all patients. The sampling of a deformation is thus reduced to sampling weights for enough of the most dominating eigenmodes that describe the deformations. For the cervical cancer patient datasets in this work, we found seven eigenmodes to be sufficient to capture 90% of the variance in the deformations of the, and only three eigenmodes for stability in the simulated dose coverage probabilities. The normality assumption of the eigenmode weights was tested and found relevant for the 20 most dominating eigenmodes except for the first. Individualization of the SSM is demonstrated to be improved using two deformation samples from a new patient. The probabilistic evaluation provided additional information about the trade-offs compared to the conventional single dataset treatment planning.
Sparse intervertebral fence composition for 3D cervical vertebra segmentation
NASA Astrophysics Data System (ADS)
Liu, Xinxin; Yang, Jian; Song, Shuang; Cong, Weijian; Jiao, Peifeng; Song, Hong; Ai, Danni; Jiang, Yurong; Wang, Yongtian
2018-06-01
Statistical shape models are capable of extracting shape prior information, and are usually utilized to assist the task of segmentation of medical images. However, such models require large training datasets in the case of multi-object structures, and it also is difficult to achieve satisfactory results for complex shapes. This study proposed a novel statistical model for cervical vertebra segmentation, called sparse intervertebral fence composition (SiFC), which can reconstruct the boundary between adjacent vertebrae by modeling intervertebral fences. The complex shape of the cervical spine is replaced by a simple intervertebral fence, which considerably reduces the difficulty of cervical segmentation. The final segmentation results are obtained by using a 3D active contour deformation model without shape constraint, which substantially enhances the recognition capability of the proposed method for objects with complex shapes. The proposed segmentation framework is tested on a dataset with CT images from 20 patients. A quantitative comparison against corresponding reference vertebral segmentation yields an overall mean absolute surface distance of 0.70 mm and a dice similarity index of 95.47% for cervical vertebral segmentation. The experimental results show that the SiFC method achieves competitive cervical vertebral segmentation performances, and completely eliminates inter-process overlap.
Injury profiles related to mortality in patients with a low Injury Severity Score: a case-mix issue?
Joosse, Pieter; Schep, Niels W L; Goslings, J Carel
2012-07-01
Outcome prediction models are widely used to evaluate trauma care. External benchmarking provides individual institutions with a tool to compare survival with a reference dataset. However, these models do have limitations. In this study, the hypothesis was tested whether specific injuries are associated with increased mortality and whether differences in case-mix of these injuries influence outcome comparison. A retrospective study was conducted in a Dutch trauma region. Injury profiles, based on injuries most frequently endured by unexpected death, were determined. The association between these injury profiles and mortality was studied in patients with a low Injury Severity Score by logistic regression. The standardized survival of our population (Ws statistic) was compared with North-American and British reference databases, with and without patients suffering from previously defined injury profiles. In total, 14,811 patients were included. Hip fractures, minor pelvic fractures, femur fractures, and minor thoracic injuries were significantly associated with mortality corrected for age, sex, and physiologic derangement in patients with a low injury severity. Odds ratios ranged from 2.42 to 2.92. The Ws statistic for comparison with North-American databases significantly improved after exclusion of patients with these injuries. The Ws statistic for comparison with a British reference database remained unchanged. Hip fractures, minor pelvic fractures, femur fractures, and minor thoracic wall injuries are associated with increased mortality. Comparative outcome analysis of a population with a reference database that differs in case-mix with respect to these injuries should be interpreted cautiously. Prognostic study, level II.
ESSG-based global spatial reference frame for datasets interrelation
NASA Astrophysics Data System (ADS)
Yu, J. Q.; Wu, L. X.; Jia, Y. J.
2013-10-01
To know well about the highly complex earth system, a large volume of, as well as a large variety of, datasets on the planet Earth are being obtained, distributed, and shared worldwide everyday. However, seldom of existing systems concentrates on the distribution and interrelation of different datasets in a common Global Spatial Reference Frame (GSRF), which holds an invisble obstacle to the data sharing and scientific collaboration. Group on Earth Obeservation (GEO) has recently established a new GSRF, named Earth System Spatial Grid (ESSG), for global datasets distribution, sharing and interrelation in its 2012-2015 WORKING PLAN.The ESSG may bridge the gap among different spatial datasets and hence overcome the obstacles. This paper is to present the implementation of the ESSG-based GSRF. A reference spheroid, a grid subdvision scheme, and a suitable encoding system are required to implement it. The radius of ESSG reference spheroid was set to the double of approximated Earth radius to make datasets from different areas of earth system science being covered. The same paramerters of positioning and orienting as Earth Centred Earth Fixed (ECEF) was adopted for the ESSG reference spheroid to make any other GSRFs being freely transformed into the ESSG-based GSRF. Spheroid degenerated octree grid with radius refiment (SDOG-R) and its encoding method were taken as the grid subdvision and encoding scheme for its good performance in many aspects. A triple (C, T, A) model is introduced to represent and link different datasets based on the ESSG-based GSRF. Finally, the methods of coordinate transformation between the ESSGbased GSRF and other GSRFs were presented to make ESSG-based GSRF operable and propagable.
Downscaling global precipitation for local applications - a case for the Rhine basin
NASA Astrophysics Data System (ADS)
Sperna Weiland, Frederiek; van Verseveld, Willem; Schellekens, Jaap
2017-04-01
Within the EU FP7 project eartH2Observe a global Water Resources Re-analysis (WRR) is being developed. This re-analysis consists of meteorological and hydrological water balance variables with global coverage, spanning the period 1979-2014 at 0.25 degrees resolution (Schellekens et al., 2016). The dataset can be of special interest in regions with limited in-situ data availability, yet for local scale analysis particularly in mountainous regions, a resolution of 0.25 degrees may be too coarse and downscaling the data to a higher resolution may be required. A downscaling toolbox has been made that includes spatial downscaling of precipitation based on the global WorldClim dataset that is available at 1 km resolution as a monthly climatology (Hijmans et al., 2005). The input of the down-scaling tool are either the global eartH2Observe WRR1 and WRR2 datasets based on the WFDEI correction methodology (Weedon et al., 2014) or the global Multi-Source Weighted-Ensemble Precipitation (MSWEP) dataset (Beck et al., 2016). Here we present a validation of the datasets over the Rhine catchment by means of a distributed hydrological model (wflow, Schellekens et al., 2014) using a number of precipitation scenarios. (1) We start by running the model using the local reference dataset derived by spatial interpolation of gauge observations. Furthermore we use (2) the MSWEP dataset at the native 0.25-degree resolution followed by (3) MSWEP downscaled with the WorldClim dataset and final (4) MSWEP downscaled with the local reference dataset. The validation will be based on comparison of the modeled river discharges as well as rainfall statistics. We expect that down-scaling the MSWEP dataset with the WorldClim data to higher resolution will increase its performance. To test the performance of the down-scaling routine we have added a run with MSWEP data down-scaled with the local dataset and compare this with the run based on the local dataset itself. - Beck, H. E. et al., 2016. MSWEP: 3-hourly 0.25° global gridded precipitation (1979-2015) by merging gauge, satellite, and reanalysis data, Hydrol. Earth Syst. Sci. Discuss., doi:10.5194/hess-2016-236, accepted for final publication. - Hijmans, R.J. et al., 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978. - Schellekens, J. et al., 2016. A global water resources ensemble of hydrological models: the eartH2Observe Tier-1 dataset, Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-55, under review. - Schellekens, J. et al., 2014. Rapid setup of hydrological and hydraulic models using OpenStreetMap and the SRTM derived digital elevation model. Environmental Modelling&Software - Weedon, G.P. et al., 2014. The WFDEI meteorological forcing data set: WATCH Forcing Data methodology applied to ERA-Interim reanalysis data. Water Resources Research, 50, doi:10.1002/2014WR015638.
Evaluation of precipitation extremes over the Asian domain: observation and modelling studies
NASA Astrophysics Data System (ADS)
Kim, In-Won; Oh, Jaiho; Woo, Sumin; Kripalani, R. H.
2018-04-01
In this study, a comparison in the precipitation extremes as exhibited by the seven reference datasets is made to ascertain whether the inferences based on these datasets agree or they differ. These seven datasets, roughly grouped in three categories i.e. rain-gauge based (APHRODITE, CPC-UNI), satellite-based (TRMM, GPCP1DD) and reanalysis based (ERA-Interim, MERRA, and JRA55), having a common data period 1998-2007 are considered. Focus is to examine precipitation extremes in the summer monsoon rainfall over South Asia, East Asia and Southeast Asia. Measures of extreme precipitation include the percentile thresholds, frequency of extreme precipitation events and other quantities. Results reveal that the differences in displaying extremes among the datasets are small over South Asia and East Asia but large differences among the datasets are displayed over the Southeast Asian region including the maritime continent. Furthermore, precipitation data appear to be more consistent over East Asia among the seven datasets. Decadal trends in extreme precipitation are consistent with known results over South and East Asia. No trends in extreme precipitation events are exhibited over Southeast Asia. Outputs of the Coupled Model Intercomparison Project Phase 5 (CMIP5) simulation data are categorized as high, medium and low-resolution models. The regions displaying maximum intensity of extreme precipitation appear to be dependent on model resolution. High-resolution models simulate maximum intensity of extreme precipitation over the Indian sub-continent, medium-resolution models over northeast India and South China and the low-resolution models over Bangladesh, Myanmar and Thailand. In summary, there are differences in displaying extreme precipitation statistics among the seven datasets considered here and among the 29 CMIP5 model data outputs.
Estimating flow-duration and low-flow frequency statistics for unregulated streams in Oregon.
DOT National Transportation Integrated Search
2008-08-01
Flow statistical datasets, basin-characteristic datasets, and regression equations were developed to provide decision makers with surface-water information needed for activities such as water-quality regulation, water-rights adjudication, biological ...
Plant selection for ethnobotanical uses on the Amalfi Coast (Southern Italy).
Savo, V; Joy, R; Caneva, G; McClatchey, W C
2015-07-15
Many ethnobotanical studies have investigated selection criteria for medicinal and non-medicinal plants. In this paper we test several statistical methods using different ethnobotanical datasets in order to 1) define to which extent the nature of the datasets can affect the interpretation of results; 2) determine if the selection for different plant uses is based on phylogeny, or other selection criteria. We considered three different ethnobotanical datasets: two datasets of medicinal plants and a dataset of non-medicinal plants (handicraft production, domestic and agro-pastoral practices) and two floras of the Amalfi Coast. We performed residual analysis from linear regression, the binomial test and the Bayesian approach for calculating under-used and over-used plant families within ethnobotanical datasets. Percentages of agreement were calculated to compare the results of the analyses. We also analyzed the relationship between plant selection and phylogeny, chorology, life form and habitat using the chi-square test. Pearson's residuals for each of the significant chi-square analyses were examined for investigating alternative hypotheses of plant selection criteria. The three statistical analysis methods differed within the same dataset, and between different datasets and floras, but with some similarities. In the two medicinal datasets, only Lamiaceae was identified in both floras as an over-used family by all three statistical methods. All statistical methods in one flora agreed that Malvaceae was over-used and Poaceae under-used, but this was not found to be consistent with results of the second flora in which one statistical result was non-significant. All other families had some discrepancy in significance across methods, or floras. Significant over- or under-use was observed in only a minority of cases. The chi-square analyses were significant for phylogeny, life form and habitat. Pearson's residuals indicated a non-random selection of woody species for non-medicinal uses and an under-use of plants of temperate forests for medicinal uses. Our study showed that selection criteria for plant uses (including medicinal) are not always based on phylogeny. The comparison of different statistical methods (regression, binomial and Bayesian) under different conditions led to the conclusion that the most conservative results are obtained using regression analysis.
Statistical and Spatial Analysis of Bathymetric Data for the St. Clair River, 1971-2007
Bennion, David
2009-01-01
To address questions concerning ongoing geomorphic processes in the St. Clair River, selected bathymetric datasets spanning 36 years were analyzed. Comparisons of recent high-resolution datasets covering the upper river indicate a highly variable, active environment. Although statistical and spatial comparisons of the datasets show that some changes to the channel size and shape have taken place during the study period, uncertainty associated with various survey methods and interpolation processes limit the statistically certain results. The methods used to spatially compare the datasets are sensitive to small variations in position and depth that are within the range of uncertainty associated with the datasets. Characteristics of the data, such as the density of measured points and the range of values surveyed, can also influence the results of spatial comparison. With due consideration of these limitations, apparently active and ongoing areas of elevation change in the river are mapped and discussed.
NASA Astrophysics Data System (ADS)
Xiong, Qiufen; Hu, Jianglin
2013-05-01
The minimum/maximum (Min/Max) temperature in the Yangtze River valley is decomposed into the climatic mean and anomaly component. A spatial interpolation is developed which combines the 3D thin-plate spline scheme for climatological mean and the 2D Barnes scheme for the anomaly component to create a daily Min/Max temperature dataset. The climatic mean field is obtained by the 3D thin-plate spline scheme because the relationship between the decreases in Min/Max temperature with elevation is robust and reliable on a long time-scale. The characteristics of the anomaly field tend to be related to elevation variation weakly, and the anomaly component is adequately analyzed by the 2D Barnes procedure, which is computationally efficient and readily tunable. With this hybridized interpolation method, a daily Min/Max temperature dataset that covers the domain from 99°E to 123°E and from 24°N to 36°N with 0.1° longitudinal and latitudinal resolution is obtained by utilizing daily Min/Max temperature data from three kinds of station observations, which are national reference climatological stations, the basic meteorological observing stations and the ordinary meteorological observing stations in 15 provinces and municipalities in the Yangtze River valley from 1971 to 2005. The error estimation of the gridded dataset is assessed by examining cross-validation statistics. The results show that the statistics of daily Min/Max temperature interpolation not only have high correlation coefficient (0.99) and interpolation efficiency (0.98), but also the mean bias error is 0.00 °C. For the maximum temperature, the root mean square error is 1.1 °C and the mean absolute error is 0.85 °C. For the minimum temperature, the root mean square error is 0.89 °C and the mean absolute error is 0.67 °C. Thus, the new dataset provides the distribution of Min/Max temperature over the Yangtze River valley with realistic, successive gridded data with 0.1° × 0.1° spatial resolution and daily temporal scale. The primary factors influencing the dataset precision are elevation and terrain complexity. In general, the gridded dataset has a relatively high precision in plains and flatlands and a relatively low precision in mountainous areas.
NASA Astrophysics Data System (ADS)
Moise Famien, Adjoua; Janicot, Serge; Delfin Ochou, Abe; Vrac, Mathieu; Defrance, Dimitri; Sultan, Benjamin; Noël, Thomas
2018-03-01
The objective of this paper is to present a new dataset of bias-corrected CMIP5 global climate model (GCM) daily data over Africa. This dataset was obtained using the cumulative distribution function transform (CDF-t) method, a method that has been applied to several regions and contexts but never to Africa. Here CDF-t has been applied over the period 1950-2099 combining Historical runs and climate change scenarios for six variables: precipitation, mean near-surface air temperature, near-surface maximum air temperature, near-surface minimum air temperature, surface downwelling shortwave radiation, and wind speed, which are critical variables for agricultural purposes. WFDEI has been used as the reference dataset to correct the GCMs. Evaluation of the results over West Africa has been carried out on a list of priority user-based metrics that were discussed and selected with stakeholders. It includes simulated yield using a crop model simulating maize growth. These bias-corrected GCM data have been compared with another available dataset of bias-corrected GCMs using WATCH Forcing Data as the reference dataset. The impact of WFD, WFDEI, and also EWEMBI reference datasets has been also examined in detail. It is shown that CDF-t is very effective at removing the biases and reducing the high inter-GCM scattering. Differences with other bias-corrected GCM data are mainly due to the differences among the reference datasets. This is particularly true for surface downwelling shortwave radiation, which has a significant impact in terms of simulated maize yields. Projections of future yields over West Africa are quite different, depending on the bias-correction method used. However all these projections show a similar relative decreasing trend over the 21st century.
2014-01-01
Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods. PMID:24708878
Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice
2015-01-01
The aim of this study is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) fuel datasets. The revision is based on the data quality indicators described by the ILCD Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD fuel datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the fuel-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD fuel datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall DQR of databases.
Yin, Zheng; Zhou, Xiaobo; Bakal, Chris; Li, Fuhai; Sun, Youxian; Perrimon, Norbert; Wong, Stephen TC
2008-01-01
Background The recent emergence of high-throughput automated image acquisition technologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular phenotypes in different experimental conditions has been dependent upon the expert opinions of well-trained biologists. Such qualitative analysis is particularly effective in detecting subtle, but important, deviations in phenotypes. However, while the rapid and continuing development of automated microscope-based technologies now facilitates the acquisition of trillions of cells in thousands of diverse experimental conditions, such as in the context of RNA interference (RNAi) or small-molecule screens, the massive size of these datasets precludes human analysis. Thus, the development of automated methods which aim to identify novel and biological relevant phenotypes online is one of the major challenges in high-throughput image-based screening. Ideally, phenotype discovery methods should be designed to utilize prior/existing information and tackle three challenging tasks, i.e. restoring pre-defined biological meaningful phenotypes, differentiating novel phenotypes from known ones and clarifying novel phenotypes from each other. Arbitrarily extracted information causes biased analysis, while combining the complete existing datasets with each new image is intractable in high-throughput screens. Results Here we present the design and implementation of a novel and robust online phenotype discovery method with broad applicability that can be used in diverse experimental contexts, especially high-throughput RNAi screens. This method features phenotype modelling and iterative cluster merging using improved gap statistics. A Gaussian Mixture Model (GMM) is employed to estimate the distribution of each existing phenotype, and then used as reference distribution in gap statistics. This method is broadly applicable to a number of different types of image-based datasets derived from a wide spectrum of experimental conditions and is suitable to adaptively process new images which are continuously added to existing datasets. Validations were carried out on different dataset, including published RNAi screening using Drosophila embryos [Additional files 1, 2], dataset for cell cycle phase identification using HeLa cells [Additional files 1, 3, 4] and synthetic dataset using polygons, our methods tackled three aforementioned tasks effectively with an accuracy range of 85%–90%. When our method is implemented in the context of a Drosophila genome-scale RNAi image-based screening of cultured cells aimed to identifying the contribution of individual genes towards the regulation of cell-shape, it efficiently discovers meaningful new phenotypes and provides novel biological insight. We also propose a two-step procedure to modify the novelty detection method based on one-class SVM, so that it can be used to online phenotype discovery. In different conditions, we compared the SVM based method with our method using various datasets and our methods consistently outperformed SVM based method in at least two of three tasks by 2% to 5%. These results demonstrate that our methods can be used to better identify novel phenotypes in image-based datasets from a wide range of conditions and organisms. Conclusion We demonstrate that our method can detect various novel phenotypes effectively in complex datasets. Experiment results also validate that our method performs consistently under different order of image input, variation of starting conditions including the number and composition of existing phenotypes, and dataset from different screens. In our findings, the proposed method is suitable for online phenotype discovery in diverse high-throughput image-based genetic and chemical screens. PMID:18534020
Grantz, Erin; Haggard, Brian; Scott, J Thad
2018-06-12
We calculated four median datasets (chlorophyll a, Chl a; total phosphorus, TP; and transparency) using multiple approaches to handling censored observations, including substituting fractions of the quantification limit (QL; dataset 1 = 1QL, dataset 2 = 0.5QL) and statistical methods for censored datasets (datasets 3-4) for approximately 100 Texas, USA reservoirs. Trend analyses of differences between dataset 1 and 3 medians indicated percent difference increased linearly above thresholds in percent censored data (%Cen). This relationship was extrapolated to estimate medians for site-parameter combinations with %Cen > 80%, which were combined with dataset 3 as dataset 4. Changepoint analysis of Chl a- and transparency-TP relationships indicated threshold differences up to 50% between datasets. Recursive analysis identified secondary thresholds in dataset 4. Threshold differences show that information introduced via substitution or missing due to limitations of statistical methods biased values, underestimated error, and inflated the strength of TP thresholds identified in datasets 1-3. Analysis of covariance identified differences in linear regression models relating transparency-TP between datasets 1, 2, and the more statistically robust datasets 3-4. Study findings identify high-risk scenarios for biased analytical outcomes when using substitution. These include high probability of median overestimation when %Cen > 50-60% for a single QL, or when %Cen is as low 16% for multiple QL's. Changepoint analysis was uniquely vulnerable to substitution effects when using medians from sites with %Cen > 50%. Linear regression analysis was less sensitive to substitution and missing data effects, but differences in model parameters for transparency cannot be discounted and could be magnified by log-transformation of the variables.
Yigzaw, Kassaye Yitbarek; Michalas, Antonis; Bellika, Johan Gustav
2017-01-03
Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N - 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians.
2016-01-01
Abstract Background Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. New information In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand. Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset. Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach. PMID:27932919
Holovachov, Oleksandr
2016-01-01
Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand.Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset.Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach.
Tolosa, Imma; Cassi, Roberto; Huertas, David
2018-04-11
A new marine sediment certified reference material (IAEA 459) with very low concentrations (μg kg -1 ) for a variety of persistent organic contaminants (POPs) listed by the Stockholm Convention, as well as other POPs and priority substances (PSs) listed in many environmental monitoring programs was developed by the IAEA. The sediment material was collected from the Ham River estuary in South Korea, and the assigned final values were derived from robust statistics on the results provided by selected laboratories which demonstrated technical and quality competence, following the guidance given in ISO Guide 35. The robust mean of the laboratory means was assigned as certified values, for those compounds where the assigned value was derived from at least five datasets and its relative expanded uncertainty was less than 40% of the assigned value (most of the values ranging from 8 to 20%). All the datasets were derived from at least two different analytical techniques which have allowed the assignment of certified concentrations for 22 polychlorinated biphenyl (PCB) congeners, 6 organochlorinated (OC) pesticides, 5 polybrominated diphenyl ethers (PBDEs), and 18 polycyclic aromatic hydrocarbon (PAHs). Mass fractions of compounds that did not fulfill the criteria of certification are considered information values, which include 29 PAHs, 11 PCBs, 16 OC pesticides, and 5 PBDEs. The extensive characterization and associated uncertainties at concentration levels close to the marine sediment quality guidelines will make CRM 459 a valuable matrix reference material for use in marine environmental monitoring programs.
Chaitanya, Lakshmi; van Oven, Mannis; Brauer, Silke; Zimmermann, Bettina; Huber, Gabriela; Xavier, Catarina; Parson, Walther; de Knijff, Peter; Kayser, Manfred
2016-03-01
The use of mitochondrial DNA (mtDNA) for maternal lineage identification often marks the last resort when investigating forensic and missing-person cases involving highly degraded biological materials. As with all comparative DNA testing, a match between evidence and reference sample requires a statistical interpretation, for which high-quality mtDNA population frequency data are crucial. Here, we determined, under high quality standards, the complete mtDNA control-region sequences of 680 individuals from across the Netherlands sampled at 54 sites, covering the entire country with 10 geographic sub-regions. The complete mtDNA control region (nucleotide positions 16,024-16,569 and 1-576) was amplified with two PCR primers and sequenced with ten different sequencing primers using the EMPOP protocol. Haplotype diversity of the entire sample set was very high at 99.63% and, accordingly, the random-match probability was 0.37%. No population substructure within the Netherlands was detected with our dataset. Phylogenetic analyses were performed to determine mtDNA haplogroups. Inclusion of these high-quality data in the EMPOP database (accession number: EMP00666) will improve its overall data content and geographic coverage in the interest of all EMPOP users worldwide. Moreover, this dataset will serve as (the start of) a national reference database for mtDNA applications in forensic and missing person casework in the Netherlands. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Reproducibility-optimized test statistic for ranking genes in microarray studies.
Elo, Laura L; Filén, Sanna; Lahesmaa, Riitta; Aittokallio, Tero
2008-01-01
A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. While previous studies on simulated or spike-in datasets do not provide practical guidance on how to choose the best method for a given real dataset, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene- anking statistic directly from the data. In comparison with existing ranking methods, the reproducibilityoptimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in dataset. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given dataset without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibilityoptimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well.
Time-REferenced data Kriging (TREK): mapping hydrological statistics given their time of reference
NASA Astrophysics Data System (ADS)
Porcheron, Delphine; Leblois, Etienne; Sauquet, Eric
2016-04-01
A major issue in water sciences is to predict runoff parameters at ungauged sites. Estimates can be obtained by various methods. Among them, geostatistical approaches provide interpolation methods that consequently use explicit assumptions on the variable of interest. Geostatistical techniques have been applied to precipitation and temperature fields and later extended to estimate runoff features considered as basin-support variates along the river network (e.g. Gottschalk, 1993; Sauquet et al., 2000; Skoien et al., 2006; Gottschalk et al., 2011). To obtain robust estimations, the first step is to collect a relevant dataset. Sauquet et al. (2000) and Sauquet (2006) suggest including a large number of catchments with long and common observation periods to ensure both reliability and temporal consistency in runoff estimates. However most observation networks evolve with time. Several choices are thus possible to define an optimal reference period maximizing either spatial or temporal overlap. However, the constraints usually lead to discard a significant number of stations. Time-REferenced data Kriging method (TREK) has been developed to overcome this issue. Here is proposed a method of geostatistical estimation considering the temporal support over which a hydrological statistic has been estimated. This allows attenuating the loss of data previously caused by the application of a strict reference period. The time reference remains for the targeted map itself. The weights depend on the observation period of the data included in the dataset and how near this is to the target period. In this presentation, the concepts of TREK will be introduced and thereafter illustrated to map mean annual runoff in France. References Gottschalk, L., 1993, Correlation and covariance of runoff. Stochastic Hydrology and Hydraulics 7(2), 85-101. Sauquet, E., Gottschalk, L. and Leblois, E., 2000, Mapping average annual runoff: a hierarchical approach applying a stochastic interpolation scheme. Hydrological Sciences Journal 45(6), 799-815. Skoien, J.O., Merz, R. and Bloschl, G., 2006, Top-kriging - geostatistics on stream networks. Hydrology and Earth System Sciences 10(2), 277-287. Gottschalk, L., Leblois, E. and Skoien, J.O., 2011, Correlation and covariance of runoff revisited. Journal of Hydrology 398(1-2), 76-90. Sauquet, E., 2006, Mapping mean annual river discharges: Geostatistical developments for incorporating river network dependencies. Journal of Hydrology 331(1-2), 300-314.
Depth calibration of the Experimental Advanced Airborne Research Lidar, EAARL-B
Wright, C. Wayne; Kranenburg, Christine J.; Troche, Rodolfo J.; Mitchell, Richard W.; Nagle, David B.
2016-05-17
The resulting calibrated EAARL-B data were then analyzed and compared with the original reference dataset, the jet-ski-based dataset from the same Fort Lauderdale site, as well as the depth-accuracy requirements of the International Hydrographic Organization (IHO). We do not claim to meet all of the IHO requirements and standards. The IHO minimum depth-accuracy requirements were used as a reference only and we do not address the other IHO requirements such as “ Full Seafloor Search”. Our results show good agreement between the calibrated EAARL-B data and all reference datasets, with results that are within the 95 percent depth accuracy of the IHO Order 1 (a and b) depth-accuracy requirements.
Global retrieval of soil moisture and vegetation properties using data-driven methods
NASA Astrophysics Data System (ADS)
Rodriguez-Fernandez, Nemesio; Richaume, Philippe; Kerr, Yann
2017-04-01
Data-driven methods such as neural networks (NNs) are a powerful tool to retrieve soil moisture from multi-wavelength remote sensing observations at global scale. In this presentation we will review a number of recent results regarding the retrieval of soil moisture with the Soil Moisture and Ocean Salinity (SMOS) satellite, either using SMOS brightness temperatures as input data for the retrieval or using SMOS soil moisture retrievals as reference dataset for the training. The presentation will discuss several possibilities for both the input datasets and the datasets to be used as reference for the supervised learning phase. Regarding the input datasets, it will be shown that NNs take advantage of the synergy of SMOS data and data from other sensors such as the Advanced Scatterometer (ASCAT, active microwaves) and MODIS (visible and infra red). NNs have also been successfully used to construct long time series of soil moisture from the Advanced Microwave Scanning Radiometer - Earth Observing System (AMSR-E) and SMOS. A NN with input data from ASMR-E observations and SMOS soil moisture as reference for the training was used to construct a dataset sharing a similar climatology and without a significant bias with respect to SMOS soil moisture. Regarding the reference data to train the data-driven retrievals, we will show different possibilities depending on the application. Using actual in situ measurements is challenging at global scale due to the scarce distribution of sensors. In contrast, in situ measurements have been successfully used to retrieve SM at continental scale in North America, where the density of in situ measurement stations is high. Using global land surface models to train the NN constitute an interesting alternative to implement new remote sensing surface datasets. In addition, these datasets can be used to perform data assimilation into the model used as reference for the training. This approach has recently been tested at the European Centre for Medium-Range Weather Forecasts (ECMWF). Finally, retrievals using radiative transfer models can also be used as a reference SM dataset for the training phase. This approach was used to retrieve soil moisture from ASMR-E, as mentioned above, and also to implement the official European Space Agency (ESA) SMOS soil moisture product in Near-Real-Time. We will finish with a discussion of the retrieval of vegetation parameters from SMOS observations using data-driven methods.
Status and trends of land change in selected U.S. ecoregions - 2000 to 2011
Sayler, Kristi L.; Acevedo, William; Taylor, Janis
2016-01-01
U.S. Geological Survey scientists developed a dataset of 2006 and 2011 land-use and land-cover (LULC) information for selected 100-km2 sample blocks within 29 U.S. Environmental Protection Agency (EPA) Level III ecoregions across the conterminous United States. The data can be used with the previously published Land Cover Trends Dataset: 1973 to 2000 to assess landuse/land-cover change across a 37-year study period. Results from analysis of these data include ecoregion-based statistical estimates of the amount of LULC change per time period, ranking of the most common types of conversions, rates of change, and percent composition. Overall estimated amount of change per ecoregion from 2001 to 2011 ranged from a low of 370 km2 in the Northern Basin and Range Ecoregion to a high of 78,782 km2 in the Southeastern Plains Ecoregion. The Southeastern Plains continues to encompass one of the most intense forest harvesting and regrowth regions in the country, with 16.6 percent of the ecoregion changing between 2001 and 2011. These LULC change statistics provide a new, valuable resource that complements other reference data and field-verified LULC data. Researchers can use this resource to independently validate other land change products or to conduct regional land change assessments.
NASA Astrophysics Data System (ADS)
Du, X.; Leinenkugel, P.; Guo, H.; Kuenzer, C.
2017-12-01
During the recent decades, global coasts are undergoing tremendous change due to accelerating socio-economic growth, which has severe effects on the functioning of global coastal systems. In view of this, accurate, timely, and area-wide global information on natural as well as anthropogenic processes in the coastal zone are of paramount importance for sustainable coastal development. A broad range of freely available satellite derived products, and open geo-datasets, as well as statistics with global coverage exist that have not yet been fully exploited to evaluate human development patterns in coastal areas. In this study, we demonstrate the potential of freely and openly available EO and GEO data sets for characterizing and evaluating human development in coastal zones on large scales. Therefore, different geo-spatial dataset such as Global Urban Footprint (GUF), Open Street Map (OSM), time series of Global Human Settlement Layer (GHSL) and Climate Change Initiative (CCI) Land cover were acquired for the entire continental coast of Asia, defined as the terrestrial area 100 km from the coastline. In order to extract indices for the coastline, a reference structure was developed allowing the integration of a 2D spatial pattern of a given parameter to a certain location along the coast line. Based on this reference structure statistics for the coast were calculated every 5 km parallel to the coast line as well as for four different distance intervals from the coast. The results demonstrate the highly unequal distribution of coastal development with respect to urban and agricultural usage in Asia, with large differences between and within different countries. China coasts show the highest overall patterns of urban development, while countries such as Pakistan and Myanmar show comparably low levels with nearly no development evident absence from coastal metropolitan areas. Furthermore, a clear trend of decreasing urban development is evident with increasing distance from the coast. This study highlights the potential of global geo-spatial data products for deriving anthropogenic development indicators that can support the evaluation and monitoring for sustainable development of coastal zones, while also discussing the shortcomings of these datasets for such purposes.
TH-A-9A-01: Active Optical Flow Model: Predicting Voxel-Level Dose Prediction in Spine SBRT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, J; Wu, Q.J.; Yin, F
2014-06-15
Purpose: To predict voxel-level dose distribution and enable effective evaluation of cord dose sparing in spine SBRT. Methods: We present an active optical flow model (AOFM) to statistically describe cord dose variations and train a predictive model to represent correlations between AOFM and PTV contours. Thirty clinically accepted spine SBRT plans are evenly divided into training and testing datasets. The development of predictive model consists of 1) collecting a sequence of dose maps including PTV and OAR (spinal cord) as well as a set of associated PTV contours adjacent to OAR from the training dataset, 2) classifying data into fivemore » groups based on PTV's locations relative to OAR, two “Top”s, “Left”, “Right”, and “Bottom”, 3) randomly selecting a dose map as the reference in each group and applying rigid registration and optical flow deformation to match all other maps to the reference, 4) building AOFM by importing optical flow vectors and dose values into the principal component analysis (PCA), 5) applying another PCA to features of PTV and OAR contours to generate an active shape model (ASM), and 6) computing a linear regression model of correlations between AOFM and ASM.When predicting dose distribution of a new case in the testing dataset, the PTV is first assigned to a group based on its contour characteristics. Contour features are then transformed into ASM's principal coordinates of the selected group. Finally, voxel-level dose distribution is determined by mapping from the ASM space to the AOFM space using the predictive model. Results: The DVHs predicted by the AOFM-based model and those in clinical plans are comparable in training and testing datasets. At 2% volume the dose difference between predicted and clinical plans is 4.2±4.4% and 3.3±3.5% in the training and testing datasets, respectively. Conclusion: The AOFM is effective in predicting voxel-level dose distribution for spine SBRT. Partially supported by NIH/NCI under grant #R21CA161389 and a master research grant by Varian Medical System.« less
Data-driven gating in PET: Influence of respiratory signal noise on motion resolution.
Büther, Florian; Ernst, Iris; Frohwein, Lynn Johann; Pouw, Joost; Schäfers, Klaus Peter; Stegger, Lars
2018-05-21
Data-driven gating (DDG) approaches for positron emission tomography (PET) are interesting alternatives to conventional hardware-based gating methods. In DDG, the measured PET data themselves are utilized to calculate a respiratory signal, that is, subsequently used for gating purposes. The success of gating is then highly dependent on the statistical quality of the PET data. In this study, we investigate how this quality determines signal noise and thus motion resolution in clinical PET scans using a center-of-mass-based (COM) DDG approach, specifically with regard to motion management of target structures in future radiotherapy planning applications. PET list mode datasets acquired in one bed position of 19 different radiotherapy patients undergoing pretreatment [ 18 F]FDG PET/CT or [ 18 F]FDG PET/MRI were included into this retrospective study. All scans were performed over a region with organs (myocardium, kidneys) or tumor lesions of high tracer uptake and under free breathing. Aside from the original list mode data, datasets with progressively decreasing PET statistics were generated. From these, COM DDG signals were derived for subsequent amplitude-based gating of the original list mode file. The apparent respiratory shift d from end-expiration to end-inspiration was determined from the gated images and expressed as a function of signal-to-noise ratio SNR of the determined gating signals. This relation was tested against additional 25 [ 18 F]FDG PET/MRI list mode datasets where high-precision MR navigator-like respiratory signals were available as reference signal for respiratory gating of PET data, and data from a dedicated thorax phantom scan. All original 19 high-quality list mode datasets demonstrated the same behavior in terms of motion resolution when reducing the amount of list mode events for DDG signal generation. Ratios and directions of respiratory shifts between end-respiratory gates and the respective nongated image were constant over all statistic levels. Motion resolution d/d max could be modeled as d/dmax=1-e-1.52(SNR-1)0.52, with d max as the actual respiratory shift. Determining d max from d and SNR in the 25 test datasets and the phantom scan demonstrated no significant differences to the MR navigator-derived shift values and the predefined shift, respectively. The SNR can serve as a general metric to assess the success of COM-based DDG, even in different scanners and patients. The derived formula for motion resolution can be used to estimate the actual motion extent reasonably well in cases of limited PET raw data statistics. This may be of interest for individualized radiotherapy treatment planning procedures of target structures subjected to respiratory motion. © 2018 American Association of Physicists in Medicine.
Caple, Jodi; Stephan, Carl N
2017-05-01
Graphic exemplars of cranial sex and ancestry are essential to forensic anthropology for standardizing casework, training analysts, and communicating group trends. To date, graphic exemplars have comprised hand-drawn sketches, or photographs of individual specimens, which risks bias/subjectivity. Here, we performed quantitative analysis of photographic data to generate new photo-realistic and objective exemplars of skull form. Standardized anterior and left lateral photographs of skulls for each sex were analyzed in the computer graphics program Psychomorph for the following groups: South African Blacks, South African Whites, American Blacks, American Whites, and Japanese. The average cranial form was calculated for each photographic view, before the color information for every individual was warped to the average form and combined to produce statistical averages. These mathematically derived exemplars-and their statistical exaggerations or extremes-retain the high-resolution detail of the original photographic dataset, making them the ideal casework and training reference standards. © 2016 American Academy of Forensic Sciences.
Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice
2015-01-01
The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.
Validating Variational Bayes Linear Regression Method With Multi-Central Datasets.
Murata, Hiroshi; Zangwill, Linda M; Fujino, Yuri; Matsuura, Masato; Miki, Atsuya; Hirasawa, Kazunori; Tanito, Masaki; Mizoue, Shiro; Mori, Kazuhiko; Suzuki, Katsuyoshi; Yamashita, Takehiro; Kashiwagi, Kenji; Shoji, Nobuyuki; Asaoka, Ryo
2018-04-01
To validate the prediction accuracy of variational Bayes linear regression (VBLR) with two datasets external to the training dataset. The training dataset consisted of 7268 eyes of 4278 subjects from the University of Tokyo Hospital. The Japanese Archive of Multicentral Databases in Glaucoma (JAMDIG) dataset consisted of 271 eyes of 177 patients, and the Diagnostic Innovations in Glaucoma Study (DIGS) dataset includes 248 eyes of 173 patients, which were used for validation. Prediction accuracy was compared between the VBLR and ordinary least squared linear regression (OLSLR). First, OLSLR and VBLR were carried out using total deviation (TD) values at each of the 52 test points from the second to fourth visual fields (VFs) (VF2-4) to 2nd to 10th VF (VF2-10) of each patient in JAMDIG and DIGS datasets, and the TD values of the 11th VF test were predicted every time. The predictive accuracy of each method was compared through the root mean squared error (RMSE) statistic. OLSLR RMSEs with the JAMDIG and DIGS datasets were between 31 and 4.3 dB, and between 19.5 and 3.9 dB. On the other hand, VBLR RMSEs with JAMDIG and DIGS datasets were between 5.0 and 3.7, and between 4.6 and 3.6 dB. There was statistically significant difference between VBLR and OLSLR for both datasets at every series (VF2-4 to VF2-10) (P < 0.01 for all tests). However, there was no statistically significant difference in VBLR RMSEs between JAMDIG and DIGS datasets at any series of VFs (VF2-2 to VF2-10) (P > 0.05). VBLR outperformed OLSLR to predict future VF progression, and the VBLR has a potential to be a helpful tool at clinical settings.
The data presented in this data file is a product of a journal publication. The dataset contains DEHP air concentrations in the emission test chamber.This dataset is associated with the following publication:Wu, Y., S. Cox, Y. Xu, Y. Liang, D. Wong, X. Liu, J. Benning, P. Clausen, Y. Zhang, C. Liu, and J. Little. A Reference Method for Measuring Emissions of SVOCs in Small Chambers. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 95: 126-132, (2016).
NASA Astrophysics Data System (ADS)
Gianoli, Chiara; Kurz, Christopher; Riboldi, Marco; Bauer, Julia; Fontana, Giulia; Baroni, Guido; Debus, Jürgen; Parodi, Katia
2016-06-01
A clinical trial named PROMETHEUS is currently ongoing for inoperable hepatocellular carcinoma (HCC) at the Heidelberg Ion Beam Therapy Center (HIT, Germany). In this framework, 4D PET-CT datasets are acquired shortly after the therapeutic treatment to compare the irradiation induced PET image with a Monte Carlo PET prediction resulting from the simulation of treatment delivery. The extremely low count statistics of this measured PET image represents a major limitation of this technique, especially in presence of target motion. The purpose of the study is to investigate two different 4D PET motion compensation strategies towards the recovery of the whole count statistics for improved image quality of the 4D PET-CT datasets for PET-based treatment verification. The well-known 4D-MLEM reconstruction algorithm, embedding the motion compensation in the reconstruction process of 4D PET sinograms, was compared to a recently proposed pre-reconstruction motion compensation strategy, which operates in sinogram domain by applying the motion compensation to the 4D PET sinograms. With reference to phantom and patient datasets, advantages and drawbacks of the two 4D PET motion compensation strategies were identified. The 4D-MLEM algorithm was strongly affected by inverse inconsistency of the motion model but demonstrated the capability to mitigate the noise-break-up effects. Conversely, the pre-reconstruction warping showed less sensitivity to inverse inconsistency but also more noise in the reconstructed images. The comparison was performed by relying on quantification of PET activity and ion range difference, typically yielding similar results. The study demonstrated that treatment verification of moving targets could be accomplished by relying on the whole count statistics image quality, as obtained from the application of 4D PET motion compensation strategies. In particular, the pre-reconstruction warping was shown to represent a promising choice when combined with intra-reconstruction smoothing.
NASA Technical Reports Server (NTRS)
da Silva, Arlindo; Redder, Christopher
2010-01-01
MERRA is a NASA reanalysis for the satellite era using a major new version of the Goddard Earth Observing System Data Assimilation System Version 5 (GEOS-5). The project focuses on historical analyses of the hydrological cycle on a broad range of weather and climate time scales and places the NASA EOS suite of observations in a climate context. The characterization of uncertainty in reanalysis fields is a commonly requested feature by users of such data. While intercomparison with reference data sets is common practice for ascertaining the realism of the datasets, such studies typically are restricted to long term climatological statistics and seldom provide state dependent measures of the uncertainties involved. In principle, variational data assimilation algorithms have the ability of producing error estimates for the analysis variables (typically surface pressure, winds, temperature, moisture and ozone) consistent with the assumed background and observation error statistics. However, these "perceived error estimates" are expensive to obtain and are limited by the somewhat simplistic errors assumed in the algorithm. The observation minus forecast residuals (innovations) by-product of any assimilation system constitutes a powerful tool for estimating the systematic and random errors in the analysis fields. Unfortunately, such data is usually not readily available with reanalysis products, often requiring the tedious decoding of large datasets and not so-user friendly file formats. With MERRA we have introduced a gridded version of the observations/innovations used in the assimilation process, using the same grid and data formats as the regular datasets. Such dataset empowers the user with the ability of conveniently performing observing system related analysis and error estimates. The scope of this dataset will be briefly described. We will present a systematic analysis of MERRA innovation time series for the conventional observing system, including maximum-likelihood estimates of background and observation errors, as well as global bias estimates. Starting with the joint PDF of innovations and analysis increments at observation locations we propose a technique for diagnosing bias among the observing systems, and document how these contextual biases have evolved during the satellite era covered by MERRA.
NASA Astrophysics Data System (ADS)
da Silva, A.; Redder, C. R.
2010-12-01
MERRA is a NASA reanalysis for the satellite era using a major new version of the Goddard Earth Observing System Data Assimilation System Version 5 (GEOS-5). The Project focuses on historical analyses of the hydrological cycle on a broad range of weather and climate time scales and places the NASA EOS suite of observations in a climate context. The characterization of uncertainty in reanalysis fields is a commonly requested feature by users of such data. While intercomparison with reference data sets is common practice for ascertaining the realism of the datasets, such studies typically are restricted to long term climatological statistics and seldom provide state dependent measures of the uncertainties involved. In principle, variational data assimilation algorithms have the ability of producing error estimates for the analysis variables (typically surface pressure, winds, temperature, moisture and ozone) consistent with the assumed background and observation error statistics. However, these "perceived error estimates" are expensive to obtain and are limited by the somewhat simplistic errors assumed in the algorithm. The observation minus forecast residuals (innovations) by-product of any assimilation system constitutes a powerful tool for estimating the systematic and random errors in the analysis fields. Unfortunately, such data is usually not readily available with reanalysis products, often requiring the tedious decoding of large datasets and not so-user friendly file formats. With MERRA we have introduced a gridded version of the observations/innovations used in the assimilation process, using the same grid and data formats as the regular datasets. Such dataset empowers the user with the ability of conveniently performing observing system related analysis and error estimates. The scope of this dataset will be briefly described. We will present a systematic analysis of MERRA innovation time series for the conventional observing system, including maximum-likelihood estimates of background and observation errors, as well as global bias estimates. Starting with the joint PDF of innovations and analysis increments at observation locations we propose a technique for diagnosing bias among the observing systems, and document how these contextual biases have evolved during the satellite era covered by MERRA.
USDA-ARS?s Scientific Manuscript database
USDA National Nutrient Database for Standard Reference Dataset for What We Eat In America, NHANES (Survey-SR) provides the nutrient data for assessing dietary intakes from the national survey What We Eat In America, National Health and Nutrition Examination Survey (WWEIA, NHANES). The current versi...
Construction and comparative evaluation of different activity detection methods in brain FDG-PET.
Buchholz, Hans-Georg; Wenzel, Fabian; Gartenschläger, Martin; Thiele, Frank; Young, Stewart; Reuss, Stefan; Schreckenberger, Mathias
2015-08-18
We constructed and evaluated reference brain FDG-PET databases for usage by three software programs (Computer-aided diagnosis for dementia (CAD4D), Statistical Parametric Mapping (SPM) and NEUROSTAT), which allow a user-independent detection of dementia-related hypometabolism in patients' brain FDG-PET. Thirty-seven healthy volunteers were scanned in order to construct brain FDG reference databases, which reflect the normal, age-dependent glucose consumption in human brain, using either software. Databases were compared to each other to assess the impact of different stereotactic normalization algorithms used by either software package. In addition, performance of the new reference databases in the detection of altered glucose consumption in the brains of patients was evaluated by calculating statistical maps of regional hypometabolism in FDG-PET of 20 patients with confirmed Alzheimer's dementia (AD) and of 10 non-AD patients. Extent (hypometabolic volume referred to as cluster size) and magnitude (peak z-score) of detected hypometabolism was statistically analyzed. Differences between the reference databases built by CAD4D, SPM or NEUROSTAT were observed. Due to the different normalization methods, altered spatial FDG patterns were found. When analyzing patient data with the reference databases created using CAD4D, SPM or NEUROSTAT, similar characteristic clusters of hypometabolism in the same brain regions were found in the AD group with either software. However, larger z-scores were observed with CAD4D and NEUROSTAT than those reported by SPM. Better concordance with CAD4D and NEUROSTAT was achieved using the spatially normalized images of SPM and an independent z-score calculation. The three software packages identified the peak z-scores in the same brain region in 11 of 20 AD cases, and there was concordance between CAD4D and SPM in 16 AD subjects. The clinical evaluation of brain FDG-PET of 20 AD patients with either CAD4D-, SPM- or NEUROSTAT-generated databases from an identical reference dataset showed similar patterns of hypometabolism in the brain regions known to be involved in AD. The extent of hypometabolism and peak z-score appeared to be influenced by the calculation method used in each software package rather than by different spatial normalization parameters.
Dataset on daytime outdoor thermal comfort for Belo Horizonte, Brazil.
Hirashima, Simone Queiroz da Silveira; Assis, Eleonora Sad de; Nikolopoulou, Marialena
2016-12-01
This dataset describe microclimatic parameters of two urban open public spaces in the city of Belo Horizonte, Brazil; physiological equivalent temperature (PET) index values and the related subjective responses of interviewees regarding thermal sensation perception and preference and thermal comfort evaluation. Individuals and behavioral characteristics of respondents were also presented. Data were collected at daytime, in summer and winter, 2013. Statistical treatment of this data was firstly presented in a PhD Thesis ("Percepção sonora e térmica e avaliação de conforto em espaços urbanos abertos do município de Belo Horizonte - MG, Brasil" (Hirashima, 2014) [1]), providing relevant information on thermal conditions in these locations and on thermal comfort assessment. Up to now, this data was also explored in the article "Daytime Thermal Comfort in Urban Spaces: A Field Study in Brazil" (Hirashima et al., in press) [2]. These references are recommended for further interpretation and discussion.
Catelan, Dolores; Biggeri, Annibale
2008-11-01
In environmental epidemiology, long lists of relative risk estimates from exposed populations are compared to a reference to scrutinize the dataset for extremes. Here, inference on disease profiles for given areas, or for fixed disease population signatures, are of interest and summaries can be obtained averaging over areas or diseases. We have developed a multivariate hierarchical Bayesian approach to estimate posterior rank distributions and we show how to produce league tables of ranks with credibility intervals useful to address the above mentioned inferential problems. Applying the procedure to a real dataset from the report "Environment and Health in Sardinia (Italy)" we selected 18 areas characterized by high environmental pressure for industrial, mining or military activities investigated for 29 causes of deaths among male residents. Ranking diseases highlighted the increased burdens of neoplastic (cancerous), and non-neoplastic respiratory diseases in the heavily polluted area of Portoscuso. The averaged ranks by disease over areas showed lung cancer among the three highest positions.
Wide-Area Cooperative Biometric Tagging, Tracking and Locating in a Multimodal Sensor Network
2014-12-04
12] 89.3% 2.7% 7 5 50s Our Model 90.7% 2.7% 6 5 4.6s TABLE III: Comparison of tracking results on CAVIAR dataset. The number of trajectories in...other. We evaluate our approach on two widely used public single-camera pedestrian tracking datasets: the CAVIAR dataset [1] and the TownCentre dataset...collaborators at Progeny. It is also being provided to ONR along with datasets on which it has been tested. REFERENCES [1] Caviar dataset. http
Preprocessed Consortium for Neuropsychiatric Phenomics dataset.
Gorgolewski, Krzysztof J; Durnez, Joke; Poldrack, Russell A
2017-01-01
Here we present preprocessed MRI data of 265 participants from the Consortium for Neuropsychiatric Phenomics (CNP) dataset. The preprocessed dataset includes minimally preprocessed data in the native, MNI and surface spaces accompanied with potential confound regressors, tissue probability masks, brain masks and transformations. In addition the preprocessed dataset includes unthresholded group level and single subject statistical maps from all tasks included in the original dataset. We hope that availability of this dataset will greatly accelerate research.
Statistical procedures for analyzing mental health services data.
Elhai, Jon D; Calhoun, Patrick S; Ford, Julian D
2008-08-15
In mental health services research, analyzing service utilization data often poses serious problems, given the presence of substantially skewed data distributions. This article presents a non-technical introduction to statistical methods specifically designed to handle the complexly distributed datasets that represent mental health service use, including Poisson, negative binomial, zero-inflated, and zero-truncated regression models. A flowchart is provided to assist the investigator in selecting the most appropriate method. Finally, a dataset of mental health service use reported by medical patients is described, and a comparison of results across several different statistical methods is presented. Implications of matching data analytic techniques appropriately with the often complexly distributed datasets of mental health services utilization variables are discussed.
Ochsner, Scott A.; Tsimelzon, Anna; Dong, Jianrong; Coarfa, Cristian
2016-01-01
The pregnane X receptor (PXR) (PXR/NR1I3) and constitutive androstane receptor (CAR) (CAR/NR1I2) members of the nuclear receptor (NR) superfamily of ligand-regulated transcription factors are well-characterized mediators of xenobiotic and endocrine-disrupting chemical signaling. The Nuclear Receptor Signaling Atlas maintains a growing library of transcriptomic datasets involving perturbations of NR signaling pathways, many of which involve perturbations relevant to PXR and CAR xenobiotic signaling. Here, we generated a reference transcriptome based on the frequency of differential expression of genes across 159 experiments compiled from 22 datasets involving perturbations of CAR and PXR signaling pathways. In addition to the anticipated overrepresentation in the reference transcriptome of genes encoding components of the xenobiotic stress response, the ranking of genes involved in carbohydrate metabolism and gonadotropin action sheds mechanistic light on the suspected role of xenobiotics in metabolic syndrome and reproductive disorders. Gene Set Enrichment Analysis showed that although acetaminophen, chlorpromazine, and phenobarbital impacted many similar gene sets, differences in direction of regulation were evident in a variety of processes. Strikingly, gene sets representing genes linked to Parkinson's, Huntington's, and Alzheimer's diseases were enriched in all 3 transcriptomes. The reference xenobiotic transcriptome will be supplemented with additional future datasets to provide the community with a continually updated reference transcriptomic dataset for CAR- and PXR-mediated xenobiotic signaling. Our study demonstrates how aggregating and annotating transcriptomic datasets, and making them available for routine data mining, facilitates research into the mechanisms by which xenobiotics and endocrine-disrupting chemicals subvert conventional NR signaling modalities. PMID:27409825
Ochsner, Scott A; Tsimelzon, Anna; Dong, Jianrong; Coarfa, Cristian; McKenna, Neil J
2016-08-01
The pregnane X receptor (PXR) (PXR/NR1I3) and constitutive androstane receptor (CAR) (CAR/NR1I2) members of the nuclear receptor (NR) superfamily of ligand-regulated transcription factors are well-characterized mediators of xenobiotic and endocrine-disrupting chemical signaling. The Nuclear Receptor Signaling Atlas maintains a growing library of transcriptomic datasets involving perturbations of NR signaling pathways, many of which involve perturbations relevant to PXR and CAR xenobiotic signaling. Here, we generated a reference transcriptome based on the frequency of differential expression of genes across 159 experiments compiled from 22 datasets involving perturbations of CAR and PXR signaling pathways. In addition to the anticipated overrepresentation in the reference transcriptome of genes encoding components of the xenobiotic stress response, the ranking of genes involved in carbohydrate metabolism and gonadotropin action sheds mechanistic light on the suspected role of xenobiotics in metabolic syndrome and reproductive disorders. Gene Set Enrichment Analysis showed that although acetaminophen, chlorpromazine, and phenobarbital impacted many similar gene sets, differences in direction of regulation were evident in a variety of processes. Strikingly, gene sets representing genes linked to Parkinson's, Huntington's, and Alzheimer's diseases were enriched in all 3 transcriptomes. The reference xenobiotic transcriptome will be supplemented with additional future datasets to provide the community with a continually updated reference transcriptomic dataset for CAR- and PXR-mediated xenobiotic signaling. Our study demonstrates how aggregating and annotating transcriptomic datasets, and making them available for routine data mining, facilitates research into the mechanisms by which xenobiotics and endocrine-disrupting chemicals subvert conventional NR signaling modalities.
A global dataset of crowdsourced land cover and land use reference data.
Fritz, Steffen; See, Linda; Perger, Christoph; McCallum, Ian; Schill, Christian; Schepaschenko, Dmitry; Duerauer, Martina; Karner, Mathias; Dresel, Christopher; Laso-Bayas, Juan-Carlos; Lesiv, Myroslava; Moorthy, Inian; Salk, Carl F; Danylo, Olha; Sturn, Tobias; Albrecht, Franziska; You, Liangzhi; Kraxner, Florian; Obersteiner, Michael
2017-06-13
Global land cover is an essential climate variable and a key biophysical driver for earth system models. While remote sensing technology, particularly satellites, have played a key role in providing land cover datasets, large discrepancies have been noted among the available products. Global land use is typically more difficult to map and in many cases cannot be remotely sensed. In-situ or ground-based data and high resolution imagery are thus an important requirement for producing accurate land cover and land use datasets and this is precisely what is lacking. Here we describe the global land cover and land use reference data derived from the Geo-Wiki crowdsourcing platform via four campaigns. These global datasets provide information on human impact, land cover disagreement, wilderness and land cover and land use. Hence, they are relevant for the scientific community that requires reference data for global satellite-derived products, as well as those interested in monitoring global terrestrial ecosystems in general.
A global dataset of crowdsourced land cover and land use reference data
Fritz, Steffen; See, Linda; Perger, Christoph; McCallum, Ian; Schill, Christian; Schepaschenko, Dmitry; Duerauer, Martina; Karner, Mathias; Dresel, Christopher; Laso-Bayas, Juan-Carlos; Lesiv, Myroslava; Moorthy, Inian; Salk, Carl F.; Danylo, Olha; Sturn, Tobias; Albrecht, Franziska; You, Liangzhi; Kraxner, Florian; Obersteiner, Michael
2017-01-01
Global land cover is an essential climate variable and a key biophysical driver for earth system models. While remote sensing technology, particularly satellites, have played a key role in providing land cover datasets, large discrepancies have been noted among the available products. Global land use is typically more difficult to map and in many cases cannot be remotely sensed. In-situ or ground-based data and high resolution imagery are thus an important requirement for producing accurate land cover and land use datasets and this is precisely what is lacking. Here we describe the global land cover and land use reference data derived from the Geo-Wiki crowdsourcing platform via four campaigns. These global datasets provide information on human impact, land cover disagreement, wilderness and land cover and land use. Hence, they are relevant for the scientific community that requires reference data for global satellite-derived products, as well as those interested in monitoring global terrestrial ecosystems in general. PMID:28608851
Abatzoglou, John T; Dobrowski, Solomon Z; Parks, Sean A; Hegewisch, Katherine C
2018-01-09
We present TerraClimate, a dataset of high-spatial resolution (1/24°, ~4-km) monthly climate and climatic water balance for global terrestrial surfaces from 1958-2015. TerraClimate uses climatically aided interpolation, combining high-spatial resolution climatological normals from the WorldClim dataset, with coarser resolution time varying (i.e., monthly) data from other sources to produce a monthly dataset of precipitation, maximum and minimum temperature, wind speed, vapor pressure, and solar radiation. TerraClimate additionally produces monthly surface water balance datasets using a water balance model that incorporates reference evapotranspiration, precipitation, temperature, and interpolated plant extractable soil water capacity. These data provide important inputs for ecological and hydrological studies at global scales that require high spatial resolution and time varying climate and climatic water balance data. We validated spatiotemporal aspects of TerraClimate using annual temperature, precipitation, and calculated reference evapotranspiration from station data, as well as annual runoff from streamflow gauges. TerraClimate datasets showed noted improvement in overall mean absolute error and increased spatial realism relative to coarser resolution gridded datasets.
NASA Astrophysics Data System (ADS)
Abatzoglou, John T.; Dobrowski, Solomon Z.; Parks, Sean A.; Hegewisch, Katherine C.
2018-01-01
We present TerraClimate, a dataset of high-spatial resolution (1/24°, ~4-km) monthly climate and climatic water balance for global terrestrial surfaces from 1958-2015. TerraClimate uses climatically aided interpolation, combining high-spatial resolution climatological normals from the WorldClim dataset, with coarser resolution time varying (i.e., monthly) data from other sources to produce a monthly dataset of precipitation, maximum and minimum temperature, wind speed, vapor pressure, and solar radiation. TerraClimate additionally produces monthly surface water balance datasets using a water balance model that incorporates reference evapotranspiration, precipitation, temperature, and interpolated plant extractable soil water capacity. These data provide important inputs for ecological and hydrological studies at global scales that require high spatial resolution and time varying climate and climatic water balance data. We validated spatiotemporal aspects of TerraClimate using annual temperature, precipitation, and calculated reference evapotranspiration from station data, as well as annual runoff from streamflow gauges. TerraClimate datasets showed noted improvement in overall mean absolute error and increased spatial realism relative to coarser resolution gridded datasets.
Detecting and Quantifying Forest Change: The Potential of Existing C- and X-Band Radar Datasets.
Tanase, Mihai A; Ismail, Ismail; Lowell, Kim; Karyanto, Oka; Santoro, Maurizio
2015-01-01
This paper evaluates the opportunity provided by global interferometric radar datasets for monitoring deforestation, degradation and forest regrowth in tropical and semi-arid environments. The paper describes an easy to implement method for detecting forest spatial changes and estimating their magnitude. The datasets were acquired within space-borne high spatial resolutions radar missions at near-global scales thus being significant for monitoring systems developed under the United Framework Convention on Climate Change (UNFCCC). The approach presented in this paper was tested in two areas located in Indonesia and Australia. Forest change estimation was based on differences between a reference dataset acquired in February 2000 by the Shuttle Radar Topography Mission (SRTM) and TanDEM-X mission (TDM) datasets acquired in 2011 and 2013. The synergy between SRTM and TDM datasets allowed not only identifying changes in forest extent but also estimating their magnitude with respect to the reference through variations in forest height.
2010-01-01
Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show BABAR transforms real and simulated datasets to allow for the correct interpretation of these data, and is the ideal tool to facilitate the identification of differentially expressed genes or network inference analysis from transcriptomic datasets. PMID:20128918
Fazio, Simone; Garraín, Daniel; Mathieux, Fabrice; De la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda
2015-01-01
Under the framework of the European Platform on Life Cycle Assessment, the European Reference Life-Cycle Database (ELCD - developed by the Joint Research Centre of the European Commission), provides core Life Cycle Inventory (LCI) data from front-running EU-level business associations and other sources. The ELCD contains energy-related data on power and fuels. This study describes the methods to be used for the quality analysis of energy data for European markets (available in third-party LC databases and from authoritative sources) that are, or could be, used in the context of the ELCD. The methodology was developed and tested on the energy datasets most relevant for the EU context, derived from GaBi (the reference database used to derive datasets for the ELCD), Ecoinvent, E3 and Gemis. The criteria for the database selection were based on the availability of EU-related data, the inclusion of comprehensive datasets on energy products and services, and the general approval of the LCA community. The proposed approach was based on the quality indicators developed within the International Reference Life Cycle Data System (ILCD) Handbook, further refined to facilitate their use in the analysis of energy systems. The overall Data Quality Rating (DQR) of the energy datasets can be calculated by summing up the quality rating (ranging from 1 to 5, where 1 represents very good, and 5 very poor quality) of each of the quality criteria indicators, divided by the total number of indicators considered. The quality of each dataset can be estimated for each indicator, and then compared with the different databases/sources. The results can be used to highlight the weaknesses of each dataset and can be used to guide further improvements to enhance the data quality with regard to the established criteria. This paper describes the application of the methodology to two exemplary datasets, in order to show the potential of the methodological approach. The analysis helps LCA practitioners to evaluate the usefulness of the ELCD datasets for their purposes, and dataset developers and reviewers to derive information that will help improve the overall DQR of databases.
Global Data Spatially Interrelate System for Scientific Big Data Spatial-Seamless Sharing
NASA Astrophysics Data System (ADS)
Yu, J.; Wu, L.; Yang, Y.; Lei, X.; He, W.
2014-04-01
A good data sharing system with spatial-seamless services will prevent the scientists from tedious, boring, and time consuming work of spatial transformation, and hence encourage the usage of the scientific data, and increase the scientific innovation. Having been adopted as the framework of Earth datasets by Group on Earth Observation (GEO), Earth System Spatial Grid (ESSG) is potential to be the spatial reference of the Earth datasets. Based on the implementation of ESSG, SDOG-ESSG, a data sharing system named global data spatially interrelate system (GASE) was design to make the data sharing spatial-seamless. The architecture of GASE was introduced. The implementation of the two key components, V-Pools, and interrelating engine, and the prototype is presented. Any dataset is firstly resampled into SDOG-ESSG, and is divided into small blocks, and then are mapped into hierarchical system of the distributed file system in V-Pools, which together makes the data serving at a uniform spatial reference and at a high efficiency. Besides, the datasets from different data centres are interrelated by the interrelating engine at the uniform spatial reference of SDOGESSG, which enables the system to sharing the open datasets in the internet spatial-seamless.
New Statistics for Testing Differential Expression of Pathways from Microarray Data
NASA Astrophysics Data System (ADS)
Siu, Hoicheong; Dong, Hua; Jin, Li; Xiong, Momiao
Exploring biological meaning from microarray data is very important but remains a great challenge. Here, we developed three new statistics: linear combination test, quadratic test and de-correlation test to identify differentially expressed pathways from gene expression profile. We apply our statistics to two rheumatoid arthritis datasets. Notably, our results reveal three significant pathways and 275 genes in common in two datasets. The pathways we found are meaningful to uncover the disease mechanisms of rheumatoid arthritis, which implies that our statistics are a powerful tool in functional analysis of gene expression data.
Context-Aware Generative Adversarial Privacy
NASA Astrophysics Data System (ADS)
Huang, Chong; Kairouz, Peter; Chen, Xiao; Sankar, Lalitha; Rajagopal, Ram
2017-12-01
Preserving the utility of published datasets while simultaneously providing provable privacy guarantees is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other hand, context-aware privacy solutions, such as information theoretic privacy, achieve an improved privacy-utility tradeoff, but assume that the data holder has access to dataset statistics. We circumvent these limitations by introducing a novel context-aware privacy framework called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to allow the data holder to learn privatization schemes from the dataset itself. Under GAP, learning the privacy mechanism is formulated as a constrained minimax game between two players: a privatizer that sanitizes the dataset in a way that limits the risk of inference attacks on the individuals' private variables, and an adversary that tries to infer the private variables from the sanitized dataset. To evaluate GAP's performance, we investigate two simple (yet canonical) statistical dataset models: (a) the binary data model, and (b) the binary Gaussian mixture model. For both models, we derive game-theoretically optimal minimax privacy mechanisms, and show that the privacy mechanisms learned from data (in a generative adversarial fashion) match the theoretically optimal ones. This demonstrates that our framework can be easily applied in practice, even in the absence of dataset statistics.
Masso, Majid
2018-09-14
Scientific breakthroughs in recent decades have uncovered the capability of RNA molecules to fulfill a wide array of structural, functional, and regulatory roles in living cells, leading to a concomitantly significant increase in both the number and diversity of experimentally determined RNA three-dimensional (3D) structures. Atomic coordinates from a representative training set of solved RNA structures, displaying low sequence and structure similarity, facilitate derivation of knowledge-based energy functions. Here we develop an all-atom four-body statistical potential and evaluate its capacity to distinguish native RNA 3D structures from nonnative folds based on calculated free energy scores. Atomic four-body nearest-neighbors are objectively identified by their occurrence as tetrahedral vertices in the Delaunay tessellations of RNA structures, and rates of atomic quadruplet interactions expected by chance are obtained from a multinomial reference distribution. Our four-body energy function, referred to as RAMP (ribonucleic acids multibody potential), is subsequently derived by applying the inverted Boltzmann principle to the frequency data, yielding an energy score for each type of atomic quadruplet interaction. Several well-known benchmark datasets reveal that RAMP is comparable with, and often outperforms, existing knowledge- and physics-based energy functions. To the best of our knowledge, this is the first study detailing an RNA tertiary structure-based multibody statistical potential and its comparative evaluation. Copyright © 2018 Elsevier Ltd. All rights reserved.
Xu, H; Li, C; Zeng, Q; Agrawal, I; Zhu, X; Gong, Z
2016-06-01
In this study, to systematically identify the most stably expressed genes for internal reference in zebrafish Danio rerio investigations, 37 D. rerio transcriptomic datasets (both RNA sequencing and microarray data) were collected from gene expression omnibus (GEO) database and unpublished data, and gene expression variations were analysed under three experimental conditions: tissue types, developmental stages and chemical treatments. Forty-four putative candidate genes were identified with the c.v. <0·2 from all datasets. Following clustering into different functional groups, 21 genes, in addition to four conventional housekeeping genes (eef1a1l1, b2m, hrpt1l and actb1), were selected from different functional groups for further quantitative real-time (qrt-)PCR validation using 25 RNA samples from different adult tissues, developmental stages and chemical treatments. The qrt-PCR data were then analysed using the statistical algorithm refFinder for gene expression stability. Several new candidate genes showed better expression stability than the conventional housekeeping genes in all three categories. It was found that sep15 and metap1 were the top two stable genes for tissue types, ube2a and tmem50a the top two for different developmental stages, and rpl13a and rp1p0 the top two for chemical treatments. Thus, based on the extensive transcriptomic analyses and qrt-PCR validation, these new reference genes are recommended for normalization of D. rerio qrt-PCR data respectively for the three different experimental conditions. © 2016 The Fisheries Society of the British Isles.
2013-01-01
Background Coffee production in Africa represents a significant share of the total export revenues and influences the lives of millions of people, yet severe socio-economic repercussions are annually felt in result of the overall losses caused by the coffee berry disease (CBD). This quarantine disease is caused by the fungus Colletotrichum kahawae Waller and Bridge, which remains one of the most devastating threats to Coffea arabica production in Africa at high altitude, and its dispersal to Latin America and Asia represents a serious concern. Understanding the molecular genetic basis of coffee resistance to this disease is of high priority to support breeding strategies. Selection and validation of suitable reference genes presenting stable expression in the system studied is the first step to engage studies of gene expression profiling. Results In this study, a set of ten genes (S24, 14-3-3, RPL7, GAPDH, UBQ9, VATP16, SAND, UQCC, IDE and β-Tub9) was evaluated to identify reference genes during the first hours of interaction (12, 48 and 72 hpi) between resistant and susceptible coffee genotypes and C. kahawae. Three analyses were done for the selection of these genes considering the entire dataset and the two genotypes (resistant and susceptible), separately. The three statistical methods applied GeNorm, NormFinder, and BestKeeper, allowed identifying IDE as one of the most stable genes for all datasets analysed, and in contrast GADPH and UBQ9 as the least stable ones. In addition, the expression of two defense-related transcripts, encoding for a receptor like kinase and a pathogenesis related protein 10, were used to validate the reference genes selected. Conclusion Taken together, our results provide guidelines for reference gene(s) selection towards a more accurate and widespread use of qPCR to study the interaction between Coffea spp. and C. kahawae. PMID:24073624
Neural Network for Nanoscience Scanning Electron Microscope Image Recognition.
Modarres, Mohammad Hadi; Aversa, Rossella; Cozzini, Stefano; Ciancio, Regina; Leto, Angelo; Brandino, Giuseppe Piero
2017-10-16
In this paper we applied transfer learning techniques for image recognition, automatic categorization, and labeling of nanoscience images obtained by scanning electron microscope (SEM). Roughly 20,000 SEM images were manually classified into 10 categories to form a labeled training set, which can be used as a reference set for future applications of deep learning enhanced algorithms in the nanoscience domain. The categories chosen spanned the range of 0-Dimensional (0D) objects such as particles, 1D nanowires and fibres, 2D films and coated surfaces, and 3D patterned surfaces such as pillars. The training set was used to retrain on the SEM dataset and to compare many convolutional neural network models (Inception-v3, Inception-v4, ResNet). We obtained compatible results by performing a feature extraction of the different models on the same dataset. We performed additional analysis of the classifier on a second test set to further investigate the results both on particular cases and from a statistical point of view. Our algorithm was able to successfully classify around 90% of a test dataset consisting of SEM images, while reduced accuracy was found in the case of images at the boundary between two categories or containing elements of multiple categories. In these cases, the image classification did not identify a predominant category with a high score. We used the statistical outcomes from testing to deploy a semi-automatic workflow able to classify and label images generated by the SEM. Finally, a separate training was performed to determine the volume fraction of coherently aligned nanowires in SEM images. The results were compared with what was obtained using the Local Gradient Orientation method. This example demonstrates the versatility and the potential of transfer learning to address specific tasks of interest in nanoscience applications.
The Minneapolis-St. Paul, MN EnviroAtlas Meter-scale Urban Land Cover (MULC) data were generated from four-band (red, green, blue, and near infrared) aerial photography provided by the United States Department of Agriculture (USDA) National Agricultural Imagery Program (NAIP). The NAIP imagery for the state of Minnesota was collected during the summer and fall of 2010. Lidar data and relevant ancillary datasets contributed to the classification. Eight land cover types were classified: water, impervious surface, soil and barren land, trees and forest, grass and herbaceous, agriculture, woody wetland, and emergent wetland. An accuracy assessment of 644 completely random and 62 stratified random photointerpreted reference points yielded an overall User's Accuracy of 83 percent. The boundary of this data layer is delineated by the US Census Bureau's 2010 Urban Statistical Area for Minneapolis-St. Paul, MN plus a 1-km buffer. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associat
Danielson, Patrick; Yang, Limin; Jin, Suming; Homer, Collin G.; Napton, Darrell
2016-01-01
We developed a method that analyzes the quality of the cultivated cropland class mapped in the USA National Land Cover Database (NLCD) 2006. The method integrates multiple geospatial datasets and a Multi Index Integrated Change Analysis (MIICA) change detection method that captures spectral changes to identify the spatial distribution and magnitude of potential commission and omission errors for the cultivated cropland class in NLCD 2006. The majority of the commission and omission errors in NLCD 2006 are in areas where cultivated cropland is not the most dominant land cover type. The errors are primarily attributed to the less accurate training dataset derived from the National Agricultural Statistics Service Cropland Data Layer dataset. In contrast, error rates are low in areas where cultivated cropland is the dominant land cover. Agreement between model-identified commission errors and independently interpreted reference data was high (79%). Agreement was low (40%) for omission error comparison. The majority of the commission errors in the NLCD 2006 cultivated crops were confused with low-intensity developed classes, while the majority of omission errors were from herbaceous and shrub classes. Some errors were caused by inaccurate land cover change from misclassification in NLCD 2001 and the subsequent land cover post-classification process.
EnviroAtlas -- Austin, TX -- One Meter Resolution Urban Land Cover Data (2010)
The Austin, TX EnviroAtlas One Meter-scale Urban Land Cover (MULC) Data were generated from United States Department of Agriculture (USDA) National Agricultural Imagery Program (NAIP) four band (red, green, blue, and near infrared) aerial photography at 1 m spatial resolution from multiple dates in May, 2010. Six land cover classes were mapped: water, impervious surfaces, soil and barren land, trees, grass-herbaceous non-woody vegetation, and agriculture. An accuracy assessment of 600 completely random and 55 stratified random photo interpreted reference points yielded an overall User's fuzzy accuracy of 87 percent. The area mapped is the US Census Bureau's 2010 Urban Statistical Area for Austin, TX plus a 1 km buffer. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Data-adaptive harmonic analysis and prediction of sea level change in North Atlantic region
NASA Astrophysics Data System (ADS)
Kondrashov, D. A.; Chekroun, M.
2017-12-01
This study aims to characterize North Atlantic sea level variability across the temporal and spatial scales. We apply recently developed data-adaptive Harmonic Decomposition (DAH) and Multilayer Stuart-Landau Models (MSLM) stochastic modeling techniques [Chekroun and Kondrashov, 2017] to monthly 1993-2017 dataset of Combined TOPEX/Poseidon, Jason-1 and Jason-2/OSTM altimetry fields over North Atlantic region. The key numerical feature of the DAH relies on the eigendecomposition of a matrix constructed from time-lagged spatial cross-correlations. In particular, eigenmodes form an orthogonal set of oscillating data-adaptive harmonic modes (DAHMs) that come in pairs and in exact phase quadrature for a given temporal frequency. Furthermore, the pairs of data-adaptive harmonic coefficients (DAHCs), obtained by projecting the dataset onto associated DAHMs, can be very efficiently modeled by a universal parametric family of simple nonlinear stochastic models - coupled Stuart-Landau oscillators stacked per frequency, and synchronized across different frequencies by the stochastic forcing. Despite the short record of altimetry dataset, developed DAH-MSLM model provides for skillful prediction of key dynamical and statistical features of sea level variability. References M. D. Chekroun and D. Kondrashov, Data-adaptive harmonic spectra and multilayer Stuart-Landau models. HAL preprint, 2017, https://hal.archives-ouvertes.fr/hal-01537797
Analysis of 3d Building Models Accuracy Based on the Airborne Laser Scanning Point Clouds
NASA Astrophysics Data System (ADS)
Ostrowski, W.; Pilarska, M.; Charyton, J.; Bakuła, K.
2018-05-01
Creating 3D building models in large scale is becoming more popular and finds many applications. Nowadays, a wide term "3D building models" can be applied to several types of products: well-known CityGML solid models (available on few Levels of Detail), which are mainly generated from Airborne Laser Scanning (ALS) data, as well as 3D mesh models that can be created from both nadir and oblique aerial images. City authorities and national mapping agencies are interested in obtaining the 3D building models. Apart from the completeness of the models, the accuracy aspect is also important. Final accuracy of a building model depends on various factors (accuracy of the source data, complexity of the roof shapes, etc.). In this paper the methodology of inspection of dataset containing 3D models is presented. The proposed approach check all building in dataset with comparison to ALS point clouds testing both: accuracy and level of details. Using analysis of statistical parameters for normal heights for reference point cloud and tested planes and segmentation of point cloud provides the tool that can indicate which building and which roof plane in do not fulfill requirement of model accuracy and detail correctness. Proposed method was tested on two datasets: solid and mesh model.
Hofman, Abe D.; Visser, Ingmar; Jansen, Brenda R. J.; van der Maas, Han L. J.
2015-01-01
We propose and test three statistical models for the analysis of children’s responses to the balance scale task, a seminal task to study proportional reasoning. We use a latent class modelling approach to formulate a rule-based latent class model (RB LCM) following from a rule-based perspective on proportional reasoning and a new statistical model, the Weighted Sum Model, following from an information-integration approach. Moreover, a hybrid LCM using item covariates is proposed, combining aspects of both a rule-based and information-integration perspective. These models are applied to two different datasets, a standard paper-and-pencil test dataset (N = 779), and a dataset collected within an online learning environment that included direct feedback, time-pressure, and a reward system (N = 808). For the paper-and-pencil dataset the RB LCM resulted in the best fit, whereas for the online dataset the hybrid LCM provided the best fit. The standard paper-and-pencil dataset yielded more evidence for distinct solution rules than the online data set in which quantitative item characteristics are more prominent in determining responses. These results shed new light on the discussion on sequential rule-based and information-integration perspectives of cognitive development. PMID:26505905
PARRoT- a homology-based strategy to quantify and compare RNA-sequencing from non-model organisms.
Gan, Ruei-Chi; Chen, Ting-Wen; Wu, Timothy H; Huang, Po-Jung; Lee, Chi-Ching; Yeh, Yuan-Ming; Chiu, Cheng-Hsun; Huang, Hsien-Da; Tang, Petrus
2016-12-22
Next-generation sequencing promises the de novo genomic and transcriptomic analysis of samples of interests. However, there are only a few organisms having reference genomic sequences and even fewer having well-defined or curated annotations. For transcriptome studies focusing on organisms lacking proper reference genomes, the common strategy is de novo assembly followed by functional annotation. However, things become even more complicated when multiple transcriptomes are compared. Here, we propose a new analysis strategy and quantification methods for quantifying expression level which not only generate a virtual reference from sequencing data, but also provide comparisons between transcriptomes. First, all reads from the transcriptome datasets are pooled together for de novo assembly. The assembled contigs are searched against NCBI NR databases to find potential homolog sequences. Based on the searched result, a set of virtual transcripts are generated and served as a reference transcriptome. By using the same reference, normalized quantification values including RC (read counts), eRPKM (estimated RPKM) and eTPM (estimated TPM) can be obtained that are comparable across transcriptome datasets. In order to demonstrate the feasibility of our strategy, we implement it in the web service PARRoT. PARRoT stands for Pipeline for Analyzing RNA Reads of Transcriptomes. It analyzes gene expression profiles for two transcriptome sequencing datasets. For better understanding of the biological meaning from the comparison among transcriptomes, PARRoT further provides linkage between these virtual transcripts and their potential function through showing best hits in SwissProt, NR database, assigning GO terms. Our demo datasets showed that PARRoT can analyze two paired-end transcriptomic datasets of approximately 100 million reads within just three hours. In this study, we proposed and implemented a strategy to analyze transcriptomes from non-reference organisms which offers the opportunity to quantify and compare transcriptome profiles through a homolog based virtual transcriptome reference. By using the homolog based reference, our strategy effectively avoids the problems that may cause from inconsistencies among transcriptomes. This strategy will shed lights on the field of comparative genomics for non-model organism. We have implemented PARRoT as a web service which is freely available at http://parrot.cgu.edu.tw .
Comparing Data Sets: Implicit Summaries of the Statistical Properties of Number Sets
ERIC Educational Resources Information Center
Morris, Bradley J.; Masnick, Amy M.
2015-01-01
Comparing datasets, that is, sets of numbers in context, is a critical skill in higher order cognition. Although much is known about how people compare single numbers, little is known about how number sets are represented and compared. We investigated how subjects compared datasets that varied in their statistical properties, including ratio of…
GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome.
Simovski, Boris; Vodák, Daniel; Gundersen, Sveinung; Domanska, Diana; Azab, Abdulrahman; Holden, Lars; Holden, Marit; Grytten, Ivar; Rand, Knut; Drabløs, Finn; Johansen, Morten; Mora, Antonio; Lund-Andersen, Christin; Fromm, Bastian; Eskeland, Ragnhild; Gabrielsen, Odd Stokke; Ferkingstad, Egil; Nakken, Sigve; Bengtsen, Mads; Nederbragt, Alexander Johan; Thorarensen, Hildur Sif; Akse, Johannes Andreas; Glad, Ingrid; Hovig, Eivind; Sandve, Geir Kjetil
2017-07-01
Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no. © The Author 2017. Published by Oxford University Press.
Zhang, H Y; Shi, W H; Zhang, M; Yin, L; Pang, C; Feng, T P; Zhang, L; Ren, Y C; Wang, B Y; Yang, X Y; Zhou, J M; Han, C Y; Zhao, Y; Zhao, J Z; Hu, D S
2016-05-01
To provide a noninvasive type 2 diabetes mellitus (T2DM) prediction model for a rural Chinese population. From July to August, 2007 and July to August, 2008, a total of 20 194 participants aged ≥18 years were selected by cluster sampling technique from a rural population in two townships of Henan province, China. Data were collected by questionnaire interview, anthropometric measurement, and fasting plasma glucose and lipid profile examination. A total 17 265 participants were followed up from July to August, 2013 and July to October, 2014. Finally, 12 285 participants were selected for analysis. Data for these participants were randomly divided into a derivation group (derivation dataset, n= 6 143) and validation group (validation dataset, n=6 142) by 1∶1, respectively. Randomization was carried out by the use of computer-generated random numbers. A Cox regression model was used to analyze risk factors of T2DM in the derivation dataset. A T2DM prediction model was established by multiplying β by 10 for each significant variable. After the total score was calculated by the model, analysis of the receiver operating characteristic (ROC) curve was performed. The area under the ROC curve (AUC) was used for evaluating model predictability. Furthermore, the model's predictability was validated in the validation dataset and compared with the Finnish Diabetes Risk Score (FINDRISC) model. A total 779 of 12 285 participants developed T2DM during the 6-year study period. The incidence rate was 6.12% in the derivation dataset (n=376) and 6.56% in the validation dataset (n=403). The difference was not statistically significant (χ(2)=1.00, P=0.316). A total of four noninvasive T2DM prediction models were established using the Cox regression model. The ROCs of the risk score calculated by the prediction models indicated that the AUCs of these models were similar (0.67-0.70). The AUC and Youden index of model 4 was the highest. The optimal cut-off value, sensitivity, specificity, and Youden index were scores of 25, 65.96%, 66.47%, and 0.32, respectively. Age, sleep time, BMI, waist circumference, and hypertension were selected as predictive variables. Using age<30 years as reference, β values were 1.07, 1.58, and 1.67 and assigned scores were 11, 16, and 17 for age groups 30-44, 45-59, and ≥60 years, respectively. Using sleep time<8.0 h/d as reference, the β value and assigned score were 0.27 and 3, respectively, for sleep time ≥10.0 h/d. Using BMI 18.5-23.9 kg/m(2) as reference, β values were 0.53 and 1.00 and assigned scores 5 and 10, respectively, for BMI 24.0-27.9 kg/m(2), and ≥28.0 kg/m(2). Using waist circumference <85 cm for males/< 80 cm for females as reference, β values were 0.44 and 0.65 and assigned scores 4 and 7, respectively, for 85 cm ≤ waist circumference <90 cm for males/80 cm≤ waist circumference <85 cm for females, and waist circumference ≥90 cm for males/≥85 cm for females. Using nonhypertension as reference, the respective β value and assigned score were 0.34 and 3 for hypertension. The AUC performance of this model and the FINDRISC model was 0.66 and 0.64 (P=0.135), respectively, in the validation dataset. Based on this cohort study, a noninvasive prediction model that included age, sleep time, BMI, waist circumference, and hypertension was established, which is equivalent to the FINDRISC model and applicable to a rural Chinese population.
NASA Astrophysics Data System (ADS)
Ma, Yaping; Xiao, Yegui; Wei, Guo; Sun, Jinwei
2016-01-01
In this paper, a multichannel nonlinear adaptive noise canceller (ANC) based on the generalized functional link artificial neural network (FLANN, GFLANN) is proposed for fetal electrocardiogram (FECG) extraction. A FIR filter and a GFLANN are equipped in parallel in each reference channel to respectively approximate the linearity and nonlinearity between the maternal ECG (MECG) and the composite abdominal ECG (AECG). A fast scheme is also introduced to reduce the computational cost of the FLANN and the GFLANN. Two (2) sets of ECG time sequences, one synthetic and one real, are utilized to demonstrate the improved effectiveness of the proposed nonlinear ANC. The real dataset is derived from the Physionet non-invasive FECG database (PNIFECGDB) including 55 multichannel recordings taken from a pregnant woman. It contains two subdatasets that consist of 14 and 8 recordings, respectively, with each recording being 90 s long. Simulation results based on these two datasets reveal, on the whole, that the proposed ANC does enjoy higher capability to deal with nonlinearity between MECG and AECG as compared with previous ANCs in terms of fetal QRS (FQRS)-related statistics and morphology of the extracted FECG waveforms. In particular, for the second real subdataset, the F1-measure results produced by the PCA-based template subtraction (TSpca) technique and six (6) single-reference channel ANCs using LMS- and RLS-based FIR filters, Volterra filter, FLANN, GFLANN, and adaptive echo state neural network (ESN a ) are 92.47%, 93.70%, 94.07%, 94.22%, 94.90%, 94.90%, and 95.46%, respectively. The same F1-measure statistical results from five (5) multi-reference channel ANCs (LMS- and RLS-based FIR filters, Volterra filter, FLANN, and GFLANN) for the second real subdataset turn out to be 94.08%, 94.29%, 94.68%, 94.91%, and 94.96%, respectively. These results indicate that the ESN a and GFLANN perform best, with the ESN a being slightly better than the GFLANN but about four times more computationally expensive than the GFLANN, which makes the GFLANN a good alternative for NI-FECG extraction.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ma, R; Zhu, X; Li, S
Purpose: High Dose Rate (HDR) brachytherapy forward planning is principally an iterative process; hence, plan quality is affected by planners’ experiences and limited planning time. Thus, this may lead to sporadic errors and inconsistencies in planning. A statistical tool based on previous approved clinical treatment plans would help to maintain the consistency of planning quality and improve the efficiency of second checking. Methods: An independent dose calculation tool was developed from commercial software. Thirty-three previously approved cervical HDR plans with the same prescription dose (550cGy), applicator type, and treatment protocol were examined, and ICRU defined reference point doses (bladder, vaginalmore » mucosa, rectum, and points A/B) along with dwell times were collected. Dose calculation tool then calculated appropriate range with a 95% confidence interval for each parameter obtained, which would be used as the benchmark for evaluation of those parameters in future HDR treatment plans. Model quality was verified using five randomly selected approved plans from the same dataset. Results: Dose variations appears to be larger at the reference point of bladder and mucosa as compared with rectum. Most reference point doses from verification plans fell between the predicted range, except the doses of two points of rectum and two points of reference position A (owing to rectal anatomical variations & clinical adjustment in prescription points, respectively). Similar results were obtained for tandem and ring dwell times despite relatively larger uncertainties. Conclusion: This statistical tool provides an insight into clinically acceptable range of cervical HDR plans, which could be useful in plan checking and identifying potential planning errors, thus improving the consistency of plan quality.« less
Vision-based gait impairment analysis for aided diagnosis.
Ortells, Javier; Herrero-Ezquerro, María Trinidad; Mollineda, Ramón A
2018-02-12
Gait is a firsthand reflection of health condition. This belief has inspired recent research efforts to automate the analysis of pathological gait, in order to assist physicians in decision-making. However, most of these efforts rely on gait descriptions which are difficult to understand by humans, or on sensing technologies hardly available in ambulatory services. This paper proposes a number of semantic and normalized gait features computed from a single video acquired by a low-cost sensor. Far from being conventional spatio-temporal descriptors, features are aimed at quantifying gait impairment, such as gait asymmetry from several perspectives or falling risk. They were designed to be invariant to frame rate and image size, allowing cross-platform comparisons. Experiments were formulated in terms of two databases. A well-known general-purpose gait dataset is used to establish normal references for features, while a new database, introduced in this work, provides samples under eight different walking styles: one normal and seven impaired patterns. A number of statistical studies were carried out to prove the sensitivity of features at measuring the expected pathologies, providing enough evidence about their accuracy. Graphical Abstract Graphical abstract reflecting main contributions of the manuscript: at the top, a robust, semantic and easy-to-interpret feature set to describe impaired gait patterns; at the bottom, a new dataset consisting of video-recordings of a number of volunteers simulating different patterns of pathological gait, where features were statistically assessed.
Statistical Compression for Climate Model Output
NASA Astrophysics Data System (ADS)
Hammerling, D.; Guinness, J.; Soh, Y. J.
2017-12-01
Numerical climate model simulations run at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus is it important to develop methods for representing the full datasets by smaller compressed versions. We propose a statistical compression and decompression algorithm based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full dataset given the summary statistics. We decompress the data by computing conditional expectations and conditional simulations from the model given the summary statistics. Conditional expectations represent our best estimate of the original data but are subject to oversmoothing in space and time. Conditional simulations introduce realistic small-scale noise so that the decompressed fields are neither too smooth nor too rough compared with the original data. Considerable attention is paid to accurately modeling the original dataset-one year of daily mean temperature data-particularly with regard to the inherent spatial nonstationarity in global fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured, while allowing for fast decompression and conditional emulation on modest computers.
Application of multivariate statistical techniques in microbial ecology
Paliy, O.; Shankar, V.
2016-01-01
Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large scale ecological datasets. Especially noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions, and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amounts of data, powerful statistical techniques of multivariate analysis are well suited to analyze and interpret these datasets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular dataset. In this review we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive, and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and dataset structure. PMID:26786791
An integrated pan-tropical biomass map using multiple reference datasets.
Avitabile, Valerio; Herold, Martin; Heuvelink, Gerard B M; Lewis, Simon L; Phillips, Oliver L; Asner, Gregory P; Armston, John; Ashton, Peter S; Banin, Lindsay; Bayol, Nicolas; Berry, Nicholas J; Boeckx, Pascal; de Jong, Bernardus H J; DeVries, Ben; Girardin, Cecile A J; Kearsley, Elizabeth; Lindsell, Jeremy A; Lopez-Gonzalez, Gabriela; Lucas, Richard; Malhi, Yadvinder; Morel, Alexandra; Mitchard, Edward T A; Nagy, Laszlo; Qie, Lan; Quinones, Marcela J; Ryan, Casey M; Ferry, Slik J W; Sunderland, Terry; Laurin, Gaia Vaglio; Gatti, Roberto Cazzolla; Valentini, Riccardo; Verbeeck, Hans; Wijaya, Arief; Willcock, Simon
2016-04-01
We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of field observations and locally calibrated high-resolution biomass maps, harmonized and upscaled to 14 477 1-km AGB estimates. Our data fusion approach uses bias removal and weighted linear averaging that incorporates and spatializes the biomass patterns indicated by the reference data. The method was applied independently in areas (strata) with homogeneous error patterns of the input (Saatchi and Baccini) maps, which were estimated from the reference data and additional covariates. Based on the fused map, we estimated AGB stock for the tropics (23.4 N-23.4 S) of 375 Pg dry mass, 9-18% lower than the Saatchi and Baccini estimates. The fused map also showed differing spatial patterns of AGB over large areas, with higher AGB density in the dense forest areas in the Congo basin, Eastern Amazon and South-East Asia, and lower values in Central America and in most dry vegetation areas of Africa than either of the input maps. The validation exercise, based on 2118 estimates from the reference dataset not used in the fusion process, showed that the fused map had a RMSE 15-21% lower than that of the input maps and, most importantly, nearly unbiased estimates (mean bias 5 Mg dry mass ha(-1) vs. 21 and 28 Mg ha(-1) for the input maps). The fusion method can be applied at any scale including the policy-relevant national level, where it can provide improved biomass estimates by integrating existing regional biomass maps as input maps and additional, country-specific reference datasets. © 2015 John Wiley & Sons Ltd.
DeltaSA tool for source apportionment benchmarking, description and sensitivity analysis
NASA Astrophysics Data System (ADS)
Pernigotti, D.; Belis, C. A.
2018-05-01
DeltaSA is an R-package and a Java on-line tool developed at the EC-Joint Research Centre to assist and benchmark source apportionment applications. Its key functionalities support two critical tasks in this kind of studies: the assignment of a factor to a source in factor analytical models (source identification) and the model performance evaluation. The source identification is based on the similarity between a given factor and source chemical profiles from public databases. The model performance evaluation is based on statistical indicators used to compare model output with reference values generated in intercomparison exercises. The references values are calculated as the ensemble average of the results reported by participants that have passed a set of testing criteria based on chemical profiles and time series similarity. In this study, a sensitivity analysis of the model performance criteria is accomplished using the results of a synthetic dataset where "a priori" references are available. The consensus modulated standard deviation punc gives the best choice for the model performance evaluation when a conservative approach is adopted.
Evaluation of Global Observations-Based Evapotranspiration Datasets and IPCC AR4 Simulations
NASA Technical Reports Server (NTRS)
Mueller, B.; Seneviratne, S. I.; Jimenez, C.; Corti, T.; Hirschi, M.; Balsamo, G.; Ciais, P.; Dirmeyer, P.; Fisher, J. B.; Guo, Z.;
2011-01-01
Quantification of global land evapotranspiration (ET) has long been associated with large uncertainties due to the lack of reference observations. Several recently developed products now provide the capacity to estimate ET at global scales. These products, partly based on observational data, include satellite ]based products, land surface model (LSM) simulations, atmospheric reanalysis output, estimates based on empirical upscaling of eddycovariance flux measurements, and atmospheric water balance datasets. The LandFlux-EVAL project aims to evaluate and compare these newly developed datasets. Additionally, an evaluation of IPCC AR4 global climate model (GCM) simulations is presented, providing an assessment of their capacity to reproduce flux behavior relative to the observations ]based products. Though differently constrained with observations, the analyzed reference datasets display similar large-scale ET patterns. ET from the IPCC AR4 simulations was significantly smaller than that from the other products for India (up to 1 mm/d) and parts of eastern South America, and larger in the western USA, Australia and China. The inter-product variance is lower across the IPCC AR4 simulations than across the reference datasets in several regions, which indicates that uncertainties may be underestimated in the IPCC AR4 models due to shared biases of these simulations.
Chung, Dongjun; Kim, Hang J; Zhao, Hongyu
2017-02-01
Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with hundreds of phenotypes and diseases, which have provided clinical and medical benefits to patients with novel biomarkers and therapeutic targets. However, identification of risk variants associated with complex diseases remains challenging as they are often affected by many genetic variants with small or moderate effects. There has been accumulating evidence suggesting that different complex traits share common risk basis, namely pleiotropy. Recently, several statistical methods have been developed to improve statistical power to identify risk variants for complex traits through a joint analysis of multiple GWAS datasets by leveraging pleiotropy. While these methods were shown to improve statistical power for association mapping compared to separate analyses, they are still limited in the number of phenotypes that can be integrated. In order to address this challenge, in this paper, we propose a novel statistical framework, graph-GPA, to integrate a large number of GWAS datasets for multiple phenotypes using a hidden Markov random field approach. Application of graph-GPA to a joint analysis of GWAS datasets for 12 phenotypes shows that graph-GPA improves statistical power to identify risk variants compared to statistical methods based on smaller number of GWAS datasets. In addition, graph-GPA also promotes better understanding of genetic mechanisms shared among phenotypes, which can potentially be useful for the development of improved diagnosis and therapeutics. The R implementation of graph-GPA is currently available at https://dongjunchung.github.io/GGPA/.
Global daily reference evapotranspiration modeling and evaluation
Senay, G.B.; Verdin, J.P.; Lietzow, R.; Melesse, Assefa M.
2008-01-01
Accurate and reliable evapotranspiration (ET) datasets are crucial in regional water and energy balance studies. Due to the complex instrumentation requirements, actual ET values are generally estimated from reference ET values by adjustment factors using coefficients for water stress and vegetation conditions, commonly referred to as crop coefficients. Until recently, the modeling of reference ET has been solely based on important weather variables collected from weather stations that are generally located in selected agro-climatic locations. Since 2001, the National Oceanic and Atmospheric Administration’s Global Data Assimilation System (GDAS) has been producing six-hourly climate parameter datasets that are used to calculate daily reference ET for the whole globe at 1-degree spatial resolution. The U.S. Geological Survey Center for Earth Resources Observation and Science has been producing daily reference ET (ETo) since 2001, and it has been used on a variety of operational hydrological models for drought and streamflow monitoring all over the world. With the increasing availability of local station-based reference ET estimates, we evaluated the GDAS-based reference ET estimates using data from the California Irrigation Management Information System (CIMIS). Daily CIMIS reference ET estimates from 85 stations were compared with GDAS-based reference ET at different spatial and temporal scales using five-year daily data from 2002 through 2006. Despite the large difference in spatial scale (point vs. ∼100 km grid cell) between the two datasets, the correlations between station-based ET and GDAS-ET were very high, exceeding 0.97 on a daily basis to more than 0.99 on time scales of more than 10 days. Both the temporal and spatial correspondences in trend/pattern and magnitudes between the two datasets were satisfactory, suggesting the reliability of using GDAS parameter-based reference ET for regional water and energy balance studies in many parts of the world. While the study revealed the potential of GDAS ETo for large-scale hydrological applications, site-specific use of GDAS ETo in complex hydro-climatic regions such as coastal areas and rugged terrain may require the application of bias correction and/or disaggregation of the GDAS ETo using downscaling techniques.
Re-Construction of Reference Population and Generating Weights by Decision Tree
2017-07-21
2017 Claflin University Orangeburg, SC 29115 DEFENSE EQUAL OPPORTUNITY MANAGEMENT INSTITUTE RESEARCH, DEVELOPMENT, AND STRATEGIC...Original Dataset 32 List of Figures in Appendix B Figure 1: Flow and Components of Project 20 Figure 2: Decision Tree 31 Figure 3: Effects of Weight...can compare the sample data. The dataset of this project has the reference population on unit level for group and gender, which is in red-dotted box
The International Human Epigenome Consortium Data Portal.
Bujold, David; Morais, David Anderson de Lima; Gauthier, Carol; Côté, Catherine; Caron, Maxime; Kwan, Tony; Chen, Kuang Chung; Laperle, Jonathan; Markovits, Alexei Nordell; Pastinen, Tomi; Caron, Bryan; Veilleux, Alain; Jacques, Pierre-Étienne; Bourque, Guillaume
2016-11-23
The International Human Epigenome Consortium (IHEC) coordinates the production of reference epigenome maps through the characterization of the regulome, methylome, and transcriptome from a wide range of tissues and cell types. To define conventions ensuring the compatibility of datasets and establish an infrastructure enabling data integration, analysis, and sharing, we developed the IHEC Data Portal (http://epigenomesportal.ca/ihec). The portal provides access to >7,000 reference epigenomic datasets, generated from >600 tissues, which have been contributed by seven international consortia: ENCODE, NIH Roadmap, CEEHRC, Blueprint, DEEP, AMED-CREST, and KNIH. The portal enhances the utility of these reference maps by facilitating the discovery, visualization, analysis, download, and sharing of epigenomics data. The IHEC Data Portal is the official source to navigate through IHEC datasets and represents a strategy for unifying the distributed data produced by international research consortia. Crown Copyright © 2016. Published by Elsevier Inc. All rights reserved.
Validation of the Hospital Episode Statistics Outpatient Dataset in England.
Thorn, Joanna C; Turner, Emma; Hounsome, Luke; Walsh, Eleanor; Donovan, Jenny L; Verne, Julia; Neal, David E; Hamdy, Freddie C; Martin, Richard M; Noble, Sian M
2016-02-01
The Hospital Episode Statistics (HES) dataset is a source of administrative 'big data' with potential for costing purposes in economic evaluations alongside clinical trials. This study assesses the validity of coverage in the HES outpatient dataset. Men who died of, or with, prostate cancer were selected from a prostate-cancer screening trial (CAP, Cluster randomised triAl of PSA testing for Prostate cancer). Details of visits that took place after 1/4/2003 to hospital outpatient departments for conditions related to prostate cancer were extracted from medical records (MR); these appointments were sought in the HES outpatient dataset based on date. The matching procedure was repeated for periods before and after 1/4/2008, when the HES outpatient dataset was accredited as a national statistic. 4922 outpatient appointments were extracted from MR for 370 men. 4088 appointments recorded in MR were identified in the HES outpatient dataset (83.1%; 95% confidence interval [CI] 82.0-84.1). For appointments occurring prior to 1/4/2008, 2195/2755 (79.7%; 95% CI 78.2-81.2) matches were observed, while 1893/2167 (87.4%; 95% CI 86.0-88.9) appointments occurring after 1/4/2008 were identified (p for difference <0.001). 215/370 men (58.1%) had at least one appointment in the MR review that was unmatched in HES, 155 men (41.9%) had all their appointments identified, and 20 men (5.4%) had no appointments identified in HES. The HES outpatient dataset appears reasonably valid for research, particularly following accreditation. The dataset may be a suitable alternative to collecting MR data from hospital notes within a trial, although caution should be exercised with data collected prior to accreditation.
Challenges in Collating Spirometry Reference Data for South-Asian Children: An Observational Study
Lum, Sooky; Bountziouka, Vassiliki; Quanjer, Philip; Sonnappa, Samatha; Wade, Angela; Beardsmore, Caroline; Chhabra, Sunil K.; Chudasama, Rajesh K.; Cook, Derek G.; Harding, Seeromanie; Kuehni, Claudia E.; Prasad, K. V. V.; Whincup, Peter H.; Lee, Simon; Stocks, Janet
2016-01-01
Availability of sophisticated statistical modelling for developing robust reference equations has improved interpretation of lung function results. In 2012, the Global Lung function Initiative(GLI) published the first global all-age, multi-ethnic reference equations for spirometry but these lacked equations for those originating from the Indian subcontinent (South-Asians). The aims of this study were to assess the extent to which existing GLI-ethnic adjustments might fit South-Asian paediatric spirometry data, assess any similarities and discrepancies between South-Asian datasets and explore the feasibility of deriving a suitable South-Asian GLI-adjustment. Methods Spirometry datasets from South-Asian children were collated from four centres in India and five within the UK. Records with transcription errors, missing values for height or spirometry, and implausible values were excluded(n = 110). Results Following exclusions, cross-sectional data were available from 8,124 children (56.3% male; 5–17 years). When compared with GLI-predicted values from White Europeans, forced expired volume in 1s (FEV1) and forced vital capacity (FVC) in South-Asian children were on average 15% lower, ranging from 4–19% between centres. By contrast, proportional reductions in FEV1 and FVC within all but two datasets meant that the FEV1/FVC ratio remained independent of ethnicity. The ‘GLI-Other’ equation fitted data from North India reasonably well while ‘GLI-Black’ equations provided a better approximation for South-Asian data than the ‘GLI-White’ equation. However, marked discrepancies in the mean lung function z-scores between centres especially when examined according to socio-economic conditions precluded derivation of a single South-Asian GLI-adjustment. Conclusion Until improved and more robust prediction equations can be derived, we recommend the use of ‘GLI-Black’ equations for interpreting most South-Asian data, although ‘GLI-Other’ may be more appropriate for North Indian data. Prospective data collection using standardised protocols to explore potential sources of variation due to socio-economic circumstances, secular changes in growth/predictors of lung function and ethnicities within the South-Asian classification are urgently required. PMID:27119342
Reducing Information Overload in Large Seismic Data Sets
DOE Office of Scientific and Technical Information (OSTI.GOV)
HAMPTON,JEFFERY W.; YOUNG,CHRISTOPHER J.; MERCHANT,BION J.
2000-08-02
Event catalogs for seismic data can become very large. Furthermore, as researchers collect multiple catalogs and reconcile them into a single catalog that is stored in a relational database, the reconciled set becomes even larger. The sheer number of these events makes searching for relevant events to compare with events of interest problematic. Information overload in this form can lead to the data sets being under-utilized and/or used incorrectly or inconsistently. Thus, efforts have been initiated to research techniques and strategies for helping researchers to make better use of large data sets. In this paper, the authors present their effortsmore » to do so in two ways: (1) the Event Search Engine, which is a waveform correlation tool and (2) some content analysis tools, which area combination of custom-built and commercial off-the-shelf tools for accessing, managing, and querying seismic data stored in a relational database. The current Event Search Engine is based on a hierarchical clustering tool known as the dendrogram tool, which is written as a MatSeis graphical user interface. The dendrogram tool allows the user to build dendrogram diagrams for a set of waveforms by controlling phase windowing, down-sampling, filtering, enveloping, and the clustering method (e.g. single linkage, complete linkage, flexible method). It also allows the clustering to be based on two or more stations simultaneously, which is important to bridge gaps in the sparsely recorded event sets anticipated in such a large reconciled event set. Current efforts are focusing on tools to help the researcher winnow the clusters defined using the dendrogram tool down to the minimum optimal identification set. This will become critical as the number of reference events in the reconciled event set continually grows. The dendrogram tool is part of the MatSeis analysis package, which is available on the Nuclear Explosion Monitoring Research and Engineering Program Web Site. As part of the research into how to winnow the reference events in these large reconciled event sets, additional database query approaches have been developed to provide windows into these datasets. These custom built content analysis tools help identify dataset characteristics that can potentially aid in providing a basis for comparing similar reference events in these large reconciled event sets. Once these characteristics can be identified, algorithms can be developed to create and add to the reduced set of events used by the Event Search Engine. These content analysis tools have already been useful in providing information on station coverage of the referenced events and basic statistical, information on events in the research datasets. The tools can also provide researchers with a quick way to find interesting and useful events within the research datasets. The tools could also be used as a means to review reference event datasets as part of a dataset delivery verification process. There has also been an effort to explore the usefulness of commercially available web-based software to help with this problem. The advantages of using off-the-shelf software applications, such as Oracle's WebDB, to manipulate, customize and manage research data are being investigated. These types of applications are being examined to provide access to large integrated data sets for regional seismic research in Asia. All of these software tools would provide the researcher with unprecedented power without having to learn the intricacies and complexities of relational database systems.« less
Probabilistic atlas and geometric variability estimation to drive tissue segmentation.
Xu, Hao; Thirion, Bertrand; Allassonnière, Stéphanie
2014-09-10
Computerized anatomical atlases play an important role in medical image analysis. While an atlas usually refers to a standard or mean image also called template, which presumably represents well a given population, it is not enough to characterize the observed population in detail. A template image should be learned jointly with the geometric variability of the shapes represented in the observations. These two quantities will in the sequel form the atlas of the corresponding population. The geometric variability is modeled as deformations of the template image so that it fits the observations. In this paper, we provide a detailed analysis of a new generative statistical model based on dense deformable templates that represents several tissue types observed in medical images. Our atlas contains both an estimation of probability maps of each tissue (called class) and the deformation metric. We use a stochastic algorithm for the estimation of the probabilistic atlas given a dataset. This atlas is then used for atlas-based segmentation method to segment the new images. Experiments are shown on brain T1 MRI datasets. Copyright © 2014 John Wiley & Sons, Ltd.
An Intelligent Polar Cyberinfrastrucuture to Support Spatiotemporal Decision Making
NASA Astrophysics Data System (ADS)
Song, M.; Li, W.; Zhou, X.
2014-12-01
In the era of big data, polar sciences have already faced an urgent demand of utilizing intelligent approaches to support precise and effective spatiotemporal decision-making. Service-oriented cyberinfrastructure has advantages of seamlessly integrating distributed computing resources, and aggregating a variety of geospatial data derived from Earth observation network. This paper focuses on building a smart service-oriented cyberinfrastructure to support intelligent question answering related to polar datasets. The innovation of this polar cyberinfrastructure includes: (1) a problem-solving environment that parses geospatial question in natural language, builds geoprocessing rules, composites atomic processing services and executes the entire workflow; (2) a self-adaptive spatiotemporal filter that is capable of refining query constraints through semantic analysis; (3) a dynamic visualization strategy to support results animation and statistics in multiple spatial reference systems; and (4) a user-friendly online portal to support collaborative decision-making. By means of this polar cyberinfrastructure, we intend to facilitate integration of distributed and heterogeneous Arctic datasets and comprehensive analysis of multiple environmental elements (e.g. snow, ice, permafrost) to provide a better understanding of the environmental variation in circumpolar regions.
FMAP: Functional Mapping and Analysis Pipeline for metagenomics and metatranscriptomics studies.
Kim, Jiwoong; Kim, Min Soo; Koh, Andrew Y; Xie, Yang; Zhan, Xiaowei
2016-10-10
Given the lack of a complete and comprehensive library of microbial reference genomes, determining the functional profile of diverse microbial communities is challenging. The available functional analysis pipelines lack several key features: (i) an integrated alignment tool, (ii) operon-level analysis, and (iii) the ability to process large datasets. Here we introduce our open-sourced, stand-alone functional analysis pipeline for analyzing whole metagenomic and metatranscriptomic sequencing data, FMAP (Functional Mapping and Analysis Pipeline). FMAP performs alignment, gene family abundance calculations, and statistical analysis (three levels of analyses are provided: differentially-abundant genes, operons and pathways). The resulting output can be easily visualized with heatmaps and functional pathway diagrams. FMAP functional predictions are consistent with currently available functional analysis pipelines. FMAP is a comprehensive tool for providing functional analysis of metagenomic/metatranscriptomic sequencing data. With the added features of integrated alignment, operon-level analysis, and the ability to process large datasets, FMAP will be a valuable addition to the currently available functional analysis toolbox. We believe that this software will be of great value to the wider biology and bioinformatics communities.
Kim, Yusung; Tomé, Wolfgang A
2007-11-01
To investigate the effects of distorted head-and-neck (H&N) intensity-modulated radiation therapy (IMRT) dose distributions (hot and cold spots) on normal tissue complication probability (NTCP) and tumor control probability (TCP) due to dental-metal artifacts. Five patients' IMRT treatment plans have been analyzed, employing five different planning image data-sets: (a) uncorrected (UC); (b) homogeneous uncorrected (HUC); (c) sinogram completion corrected (SCC); (d) minimum-value-corrected (MVC); and (e) streak-artifact-reduction including minimum-value-correction (SAR-MVC), which has been taken as the reference data-set. The effects on NTCP and TCP were evaluated using the Lyman-NTCP model and the Logistic-TCP model, respectively. When compared to the predicted NTCP obtained using the reference data-set, the treatment plan based on the original CT data-set (UC) yielded an increase in NTCP of 3.2 and 2.0% for the spared parotid gland and the spinal cord, respectively. While for the treatment plans based on the MVC CT data-set the NTCP increased by a 1.1% and a 0.1% for the spared parotid glands and the spinal cord, respectively. In addition, the MVC correction method showed a reduction in TCP for target volumes (MVC: delta TCP = -0.6% vs. UC: delta TCP = -1.9%) with respect to that of the reference CT data-set. Our results indicate that the presence of dental-metal-artifacts in H&N planning CT data-sets has an impact on the estimates of TCP and NTCP. In particular dental-metal-artifacts lead to an increase in NTCP for the spared parotid glands and a slight decrease in TCP for target volumes.
Illuminating the Depths of the MagIC (Magnetics Information Consortium) Database
NASA Astrophysics Data System (ADS)
Koppers, A. A. P.; Minnett, R.; Jarboe, N.; Jonestrask, L.; Tauxe, L.; Constable, C.
2015-12-01
The Magnetics Information Consortium (http://earthref.org/MagIC/) is a grass-roots cyberinfrastructure effort envisioned by the paleo-, geo-, and rock magnetic scientific community. Its mission is to archive their wealth of peer-reviewed raw data and interpretations from magnetics studies on natural and synthetic samples. Many of these valuable data are legacy datasets that were never published in their entirety, some resided in other databases that are no longer maintained, and others were never digitized from the field notebooks and lab work. Due to the volume of data collected, most studies, modern and legacy, only publish the interpreted results and, occasionally, a subset of the raw data. MagIC is making an extraordinary effort to archive these data in a single data model, including the raw instrument measurements if possible. This facilitates the reproducibility of the interpretations, the re-interpretation of the raw data as the community introduces new techniques, and the compilation of heterogeneous datasets that are otherwise distributed across multiple formats and physical locations. MagIC has developed tools to assist the scientific community in many stages of their workflow. Contributors easily share studies (in a private mode if so desired) in the MagIC Database with colleagues and reviewers prior to publication, publish the data online after the study is peer reviewed, and visualize their data in the context of the rest of the contributions to the MagIC Database. From organizing their data in the MagIC Data Model with an online editable spreadsheet, to validating the integrity of the dataset with automated plots and statistics, MagIC is continually lowering the barriers to transforming dark data into transparent and reproducible datasets. Additionally, this web application generalizes to other databases in MagIC's umbrella website (EarthRef.org) so that the Geochemical Earth Reference Model (http://earthref.org/GERM/) portal, Seamount Biogeosciences Network (http://earthref.org/SBN/), EarthRef Digital Archive (http://earthref.org/ERDA/) and EarthRef Reference Database (http://earthref.org/ERR/) benefit from its development.
Comparing physiographic maps with different categorisations
NASA Astrophysics Data System (ADS)
Zawadzka, J.; Mayr, T.; Bellamy, P.; Corstanje, R.
2015-02-01
This paper addresses the need for a robust map comparison method suitable for finding similarities between thematic maps with different forms of categorisations. In our case, the requirement was to establish the information content of newly derived physiographic maps with regards to set of reference maps for a study area in England and Wales. Physiographic maps were derived from the 90 m resolution SRTM DEM, using a suite of existing and new digital landform mapping methods with the overarching purpose of enhancing the physiographic unit component of the Soil and Terrain database (SOTER). Reference maps were seven soil and landscape datasets mapped at scales ranging from 1:200,000 to 1:5,000,000. A review of commonly used statistical methods for categorical comparisons was performed and of these, the Cramer's V statistic was identified as the most appropriate for comparison of maps with different legends. Interpretation of multiple Cramer's V values resulting from one-by-one comparisons of the physiographic and baseline maps was facilitated by multi-dimensional scaling and calculation of average distances between the maps. The method allowed for finding similarities and dissimilarities amongst physiographic maps and baseline maps and informed the recommendation of the most suitable methodology for terrain analysis in the context of soil mapping.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Papadimitroulas, P; Kagadis, GC; Loudos, G
Purpose: Our purpose is to evaluate the administered absorbed dose in pediatric, nuclear imaging studies. Monte Carlo simulations with the incorporation of pediatric computational models can serve as reference for the accurate determination of absorbed dose. The procedure of the calculated dosimetric factors is described, while a dataset of reference doses is created. Methods: Realistic simulations were executed using the GATE toolkit and a series of pediatric computational models, developed by the “IT'IS Foundation”. The series of the phantoms used in our work includes 6 models in the range of 5–14 years old (3 boys and 3 girls). Pre-processing techniquesmore » were applied to the images, to incorporate the phantoms in GATE simulations. The resolution of the phantoms was set to 2 mm3. The most important organ densities were simulated according to the GATE “Materials Database”. Several used radiopharmaceuticals in SPECT and PET applications are being tested, following the EANM pediatric dosage protocol. The biodistributions of the several isotopes used as activity maps in the simulations, were derived by the literature. Results: Initial results of absorbed dose per organ (mGy) are presented in a 5 years old girl from the whole body exposure to 99mTc - SestaMIBI, 30 minutes after administration. Heart, kidney, liver, ovary, pancreas and brain are the most critical organs, in which the S-factors are calculated. The statistical uncertainty in the simulation procedure was kept lower than 5%. The Sfactors for each target organ are calculated in Gy/(MBq*sec) with highest dose being absorbed in kidneys and pancreas (9.29*10{sup 10} and 0.15*10{sup 10} respectively). Conclusion: An approach for the accurate dosimetry on pediatric models is presented, creating a reference dosage dataset for several radionuclides in children computational models with the advantages of MC techniques. Our study is ongoing, extending our investigation to other reference models and evaluating the results with clinical estimated doses.« less
Options for accessing datasets for incidence, mortality, county populations, standard populations, expected survival, and SEER-linked and specialized data. Plus variable definitions, documentation for reporting and using datasets, statistical software (SEER*Stat), and observational research resources.
EnviroAtlas -- Austin, TX -- One Meter Resolution Urban Land Cover Data (2010) Web Service
This EnviroAtlas web service supports research and online mapping activities related to EnviroAtlas (https://www.epa.gov/enviroatlas ). The Austin, TX EnviroAtlas One Meter-scale Urban Land Cover (MULC) Data were generated from United States Department of Agriculture (USDA) National Agricultural Imagery Program (NAIP) four band (red, green, blue, and near infrared) aerial photography at 1 m spatial resolution from multiple dates in May, 2010. Six land cover classes were mapped: water, impervious surfaces, soil and barren land, trees, grass-herbaceous non-woody vegetation, and agriculture. An accuracy assessment of 600 completely random and 55 stratified random photo interpreted reference points yielded an overall User's fuzzy accuracy of 87 percent. The area mapped is the US Census Bureau's 2010 Urban Statistical Area for Austin, TX plus a 1 km buffer. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas
Development of a risk index for prediction of abnormal pap test results in Serbia.
Vukovic, Dejana; Antic, Ljiljana; Vasiljevic, Mladenko; Antic, Dragan; Matejic, Bojana
2015-01-01
Serbia is one of the countries with highest incidence and mortality rates for cervical cancer in Central and South Eastern Europe. Introducing a risk index could provide a powerful means for targeting groups at high likelihood of having an abnormal cervical smear and increase efficiency of screening. The aim of the present study was to create and assess validity ofa index for prediction of an abnormal Pap test result. The study population was drawn from patients attending Departments for Women's Health in two primary health care centers in Serbia. Out of 525 respondents 350 were randomly selected and data obtained from them were used as the index creation dataset. Data obtained from the remaining 175 were used as an index validation data set. Age at first intercourse under 18, more than 4 sexual partners, history of STD and multiparity were attributed statistical weights 16, 15, 14 and 13, respectively. The distribution of index scores in index-creation data set showed that most respondents had a score 0 (54.9%). In the index-creation dataset mean index score was 10.3 (SD-13.8), and in the validation dataset the mean was 9.1 (SD=13.2). The advantage of such scoring system is that it is simple, consisting of only four elements, so it could be applied to identify women with high risk for cervical cancer that would be referred for further examination.
Gururaj, Anupama E.; Chen, Xiaoling; Pournejati, Saeid; Alter, George; Hersh, William R.; Demner-Fushman, Dina; Ohno-Machado, Lucila
2017-01-01
Abstract The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval. Database URL: https://biocaddie.org/benchmark-data PMID:29220453
A reference human genome dataset of the BGISEQ-500 sequencer.
Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian
2017-05-01
BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.
Jia, Erik; Chen, Tianlu
2018-01-01
Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp. PMID:29385130
Gan, Lin; Denecke, Bernd
2013-06-24
It came to our attention that a paper has recently been published concerning one of the GEO datasets (GSE34413) we cited in our published paper [1]. The original reference (reference 27) cited for this dataset leads to a paper about a similar study from the same research group [2]. In order to provide readers with exact citation information, we would like to update reference 27 in our previous paper to the new published paper concerning GSE34413 [3]. The authors apologize for this inconvenience. [...].
Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle
Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic readsmore » to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. In conclusion, the algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.« less
Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance
Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle
2014-09-29
Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic readsmore » to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. In conclusion, the algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.« less
A multitemporal (1979-2009) land-use/land-cover dataset of the binational Santa Cruz Watershed
2011-01-01
Trends derived from multitemporal land-cover data can be used to make informed land management decisions and to help managers model future change scenarios. We developed a multitemporal land-use/land-cover dataset for the binational Santa Cruz watershed of southern Arizona, United States, and northern Sonora, Mexico by creating a series of land-cover maps at decadal intervals (1979, 1989, 1999, and 2009) using Landsat Multispectral Scanner and Thematic Mapper data and a classification and regression tree classifier. The classification model exploited phenological changes of different land-cover spectral signatures through the use of biseasonal imagery collected during the (dry) early summer and (wet) late summer following rains from the North American monsoon. Landsat images were corrected to remove atmospheric influences, and the data were converted from raw digital numbers to surface reflectance values. The 14-class land-cover classification scheme is based on the 2001 National Land Cover Database with a focus on "Developed" land-use classes and riverine "Forest" and "Wetlands" cover classes required for specific watershed models. The classification procedure included the creation of several image-derived and topographic variables, including digital elevation model derivatives, image variance, and multitemporal Kauth-Thomas transformations. The accuracy of the land-cover maps was assessed using a random-stratified sampling design, reference aerial photography, and digital imagery. This showed high accuracy results, with kappa values (the statistical measure of agreement between map and reference data) ranging from 0.80 to 0.85.
Performance of the CORDEX regional climate models in simulating offshore wind and wind potential
NASA Astrophysics Data System (ADS)
Kulkarni, Sumeet; Deo, M. C.; Ghosh, Subimal
2018-03-01
This study is oriented towards quantification of the skill addition by regional climate models (RCMs) in the parent general circulation models (GCMs) while simulating wind speed and wind potential with particular reference to the Indian offshore region. To arrive at a suitable reference dataset, the performance of wind outputs from three different reanalysis datasets is evaluated. The comparison across the RCMs and their corresponding parent GCMs is done on the basis of annual/seasonal wind statistics, intermodel bias, wind climatology, and classes of wind potential. It was observed that while the RCMs could simulate spatial variability of winds, well for certain subregions, they generally failed to replicate the overall spatial pattern, especially in monsoon and winter. Various causes of biases in RCMs were determined by assessing corresponding maps of wind vectors, surface temperature, and sea-level pressure. The results highlight the necessity to carefully assess the RCM-yielded winds before using them for sensitive applications such as coastal vulnerability and hazard assessment. A supplementary outcome of this study is in form of wind potential atlas, based on spatial distribution of wind classes. This could be beneficial in suitably identifying viable subregions for developing offshore wind farms by intercomparing both the RCM and GCM outcomes. It is encouraging that most of the RCMs and GCMs indicate that around 70% of the Indian offshore locations in monsoon would experience mean wind potential greater than 200 W/m2.
NASA Astrophysics Data System (ADS)
Dungan, J. L.; Wang, W.; Hashimoto, H.; Michaelis, A.; Milesi, C.; Ichii, K.; Nemani, R. R.
2009-12-01
In support of NACP, we are conducting an ensemble modeling exercise using the Terrestrial Observation and Prediction System (TOPS) to evaluate uncertainties among ecosystem models, satellite datasets, and in-situ measurements. The models used in the experiment include public-domain versions of Biome-BGC, LPJ, TOPS-BGC, and CASA, driven by a consistent set of climate fields for North America at 8km resolution and daily/monthly time steps over the period of 1982-2006. The reference datasets include MODIS Gross Primary Production (GPP) and Net Primary Production (NPP) products, Fluxnet measurements, and other observational data. The simulation results and the reference datasets are consistently processed and systematically compared in the climate (temperature-precipitation) space; in particular, an alternative to the Taylor diagram is developed to facilitate model-data intercomparisons in multi-dimensional space. The key findings of this study indicate that: the simulated GPP/NPP fluxes are in general agreement with observations over forests, but are biased low (underestimated) over non-forest types; large uncertainties of biomass and soil carbon stocks are found among the models (and reference datasets), often induced by seemingly “small” differences in model parameters and implementation details; the simulated Net Ecosystem Production (NEP) mainly responds to non-respiratory disturbances (e.g. fire) in the models and therefore is difficult to compare with flux data; and the seasonality and interannual variability of NEP varies significantly among models and reference datasets. These findings highlight the problem inherent in relying on only one modeling approach to map surface carbon fluxes and emphasize the pressing necessity of expanded and enhanced monitoring systems to narrow critical structural and parametrical uncertainties among ecosystem models.
The 3D Reference Earth Model (REM-3D): Update and Outlook
NASA Astrophysics Data System (ADS)
Lekic, V.; Moulik, P.; Romanowicz, B. A.; Dziewonski, A. M.
2016-12-01
Elastic properties of the Earth's interior (e.g. density, rigidity, compressibility, anisotropy) vary spatially due to changes in temperature, pressure, composition, and flow. In the 20th century, seismologists have constructed reference models of how these quantities vary with depth, notably the PREM model of Dziewonski and Anderson (1981). These 1D reference earth models have proven indispensable in earthquake location, imaging of interior structure, understanding material properties under extreme conditions, and as a reference in other fields, such as particle physics and astronomy. Over the past three decades, more sophisticated efforts by seismologists have yielded several generations of models of how properties vary not only with depth, but also laterally. Yet, though these three-dimensional (3D) models exhibit compelling similarities at large scales, differences in the methodology, representation of structure, and dataset upon which they are based, have prevented the creation of 3D community reference models. We propose to overcome these challenges by compiling, reconciling, and distributing a long period (>15 s) reference seismic dataset, from which we will construct a 3D seismic reference model (REM-3D) for the Earth's mantle, which will come in two flavors: a long wavelength smoothly parameterized model and a set of regional profiles. Here, we summarize progress made in the construction of the reference long period dataset, and present preliminary versions of the REM-3D in order to illustrate the two flavors of REM-3D and their relative advantages and disadvantages. As a community reference model and with fully quantified uncertainties and tradeoffs, REM-3D will facilitate Earth imaging studies, earthquake characterization, inferences on temperature and composition in the deep interior, and be of improved utility to emerging scientific endeavors, such as neutrino geoscience. In this presentation, we outline the outlook for setting up advisory community working groups and the community workshop that would assess progress, evaluate model and dataset performance, identify avenues for improvement, and recommend strategies for maximizing model adoption in and utility for the deep Earth community.
NASA Astrophysics Data System (ADS)
Rowlands, G.; Kiyani, K. H.; Chapman, S. C.; Watkins, N. W.
2009-12-01
Quantitative analysis of solar wind fluctuations are often performed in the context of intermittent turbulence and center around methods to quantify statistical scaling, such as power spectra and structure functions which assume a stationary process. The solar wind exhibits large scale secular changes and so the question arises as to whether the timeseries of the fluctuations is non-stationary. One approach is to seek a local stationarity by parsing the time interval over which statistical analysis is performed. Hence, natural systems such as the solar wind unavoidably provide observations over restricted intervals. Consequently, due to a reduction of sample size leading to poorer estimates, a stationary stochastic process (time series) can yield anomalous time variation in the scaling exponents, suggestive of nonstationarity. The variance in the estimates of scaling exponents computed from an interval of N observations is known for finite variance processes to vary as ~1/N as N becomes large for certain statistical estimators; however, the convergence to this behavior will depend on the details of the process, and may be slow. We study the variation in the scaling of second-order moments of the time-series increments with N for a variety of synthetic and “real world” time series, and we find that in particular for heavy tailed processes, for realizable N, one is far from this ~1/N limiting behavior. We propose a semiempirical estimate for the minimum N needed to make a meaningful estimate of the scaling exponents for model stochastic processes and compare these with some “real world” time series from the solar wind. With fewer datapoints the stationary timeseries becomes indistinguishable from a nonstationary process and we illustrate this with nonstationary synthetic datasets. Reference article: K. H. Kiyani, S. C. Chapman and N. W. Watkins, Phys. Rev. E 79, 036109 (2009).
Collaborative derivation of reference intervals for major clinical laboratory tests in Japan.
Ichihara, Kiyoshi; Yomamoto, Yoshikazu; Hotta, Taeko; Hosogaya, Shigemi; Miyachi, Hayato; Itoh, Yoshihisa; Ishibashi, Midori; Kang, Dongchon
2016-05-01
Three multicentre studies of reference intervals were conducted recently in Japan. The Committee on Common Reference Intervals of the Japan Society of Clinical Chemistry sought to establish common reference intervals for 40 laboratory tests which were measured in common in the three studies and regarded as well harmonized in Japan. The study protocols were comparable with recruitment mostly from hospital workers with body mass index ≤28 and no medications. Age and sex distributions were made equal to obtain a final data size of 6345 individuals. Between-subgroup differences were expressed as the SD ratio (between-subgroup SD divided by SD representing the reference interval). Between-study differences were all within acceptable levels, and thus the three datasets were merged. By adopting SD ratio ≥0.50 as a guide, sex-specific reference intervals were necessary for 12 assays. Age-specific reference intervals for females partitioned at age 45 were required for five analytes. The reference intervals derived by the parametric method resulted in appreciable narrowing of the ranges by applying the latent abnormal values exclusion method in 10 items which were closely associated with prevalent disorders among healthy individuals. Sex- and age-related profiles of reference values, derived from individuals with no abnormal results in major tests, showed peculiar patterns specific to each analyte. Common reference intervals for nationwide use were developed for 40 major tests, based on three multicentre studies by advanced statistical methods. Sex- and age-related profiles of reference values are of great relevance not only for interpreting test results, but for applying clinical decision limits specified in various clinical guidelines. © The Author(s) 2015.
Fast and global authenticity screening of honey using ¹H-NMR profiling.
Spiteri, Marc; Jamin, Eric; Thomas, Freddy; Rebours, Agathe; Lees, Michèle; Rogers, Karyne M; Rutledge, Douglas N
2015-12-15
An innovative analytical approach was developed to tackle the most common adulterations and quality deviations in honey. Using proton-NMR profiling coupled to suitable quantification procedures and statistical models, analytical criteria were defined to check the authenticity of both mono- and multi-floral honey. The reference data set used was a worldwide collection of more than 800 honeys, covering most of the economically significant botanical and geographical origins. Typical plant nectar markers can be used to check monofloral honey labeling. Spectral patterns and natural variability were established for multifloral honeys, and marker signals for sugar syrups were identified by statistical comparison with a commercial dataset of ca. 200 honeys. Although the results are qualitative, spiking experiments have confirmed the ability of the method to detect sugar addition down to 10% levels in favorable cases. Within the same NMR experiments, quantification of glucose, fructose, sucrose and 5-HMF (regulated parameters) was performed. Finally markers showing the onset of fermentation are described. Copyright © 2014 Elsevier Ltd. All rights reserved.
Data Resource Profile: The European Union Statistics on Income and Living Conditions (EU-SILC).
Arora, Vishal S; Karanikolos, Marina; Clair, Amy; Reeves, Aaron; Stuckler, David; McKee, Martin
2015-04-01
Social and economic policies are inextricably linked with population health outcomes in Europe, yet few datasets are able to fully explore and compare this relationship across European countries. The European Union Statistics on Income and Living Conditions (EU-SILC) survey aims to address this gap using microdata on income, living conditions and health. EU-SILC contains both cross-sectional and longitudinal elements, with nationally representative samples of individuals 16 years and older in 28 European Union member states as well as Iceland, Norway and Switzerland. Data collection began in 2003 in Belgium, Denmark, Ireland, Greece, Luxembourg and Austria, with subsequent expansion across Europe. By 2011, all 28 EU member states, plus three others, were included in the dataset. Although EU-SILC is administered by Eurostat, the data are output-harmonized so that countries are required to collect specified data items but are free to determine sampling strategies for data collection purposes. EU-SILC covers approximately 500,000 European residents for its cross-sectional survey annually. Whereas aggregated data from EU-SILC are publicly available [http://ec.europa.eu/eurostat/web/income-and-living-conditions/data/main-tables], microdata are only available to research organizations subject to approval by Eurostat. Please refer to [http://epp.eurostat.ec.europa.eu/portal/page/portal/microdata/eu_silc] for further information regarding microdata access. © The Author 2015; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.
Statistical analysis of QC data and estimation of fuel rod behaviour
NASA Astrophysics Data System (ADS)
Heins, L.; Groβ, H.; Nissen, K.; Wunderlich, F.
1991-02-01
The behaviour of fuel rods while in reactor is influenced by many parameters. As far as fabrication is concerned, fuel pellet diameter and density, and inner cladding diameter are important examples. Statistical analyses of quality control data show a scatter of these parameters within the specified tolerances. At present it is common practice to use a combination of superimposed unfavorable tolerance limits (worst case dataset) in fuel rod design calculations. Distributions are not considered. The results obtained in this way are very conservative but the degree of conservatism is difficult to quantify. Probabilistic calculations based on distributions allow the replacement of the worst case dataset by a dataset leading to results with known, defined conservatism. This is achieved by response surface methods and Monte Carlo calculations on the basis of statistical distributions of the important input parameters. The procedure is illustrated by means of two examples.
Carroll, Adam J; Badger, Murray R; Harvey Millar, A
2010-07-14
Standardization of analytical approaches and reporting methods via community-wide collaboration can work synergistically with web-tool development to result in rapid community-driven expansion of online data repositories suitable for data mining and meta-analysis. In metabolomics, the inter-laboratory reproducibility of gas-chromatography/mass-spectrometry (GC/MS) makes it an obvious target for such development. While a number of web-tools offer access to datasets and/or tools for raw data processing and statistical analysis, none of these systems are currently set up to act as a public repository by easily accepting, processing and presenting publicly submitted GC/MS metabolomics datasets for public re-analysis. Here, we present MetabolomeExpress, a new File Transfer Protocol (FTP) server and web-tool for the online storage, processing, visualisation and statistical re-analysis of publicly submitted GC/MS metabolomics datasets. Users may search a quality-controlled database of metabolite response statistics from publicly submitted datasets by a number of parameters (eg. metabolite, species, organ/biofluid etc.). Users may also perform meta-analysis comparisons of multiple independent experiments or re-analyse public primary datasets via user-friendly tools for t-test, principal components analysis, hierarchical cluster analysis and correlation analysis. They may interact with chromatograms, mass spectra and peak detection results via an integrated raw data viewer. Researchers who register for a free account may upload (via FTP) their own data to the server for online processing via a novel raw data processing pipeline. MetabolomeExpress https://www.metabolome-express.org provides a new opportunity for the general metabolomics community to transparently present online the raw and processed GC/MS data underlying their metabolomics publications. Transparent sharing of these data will allow researchers to assess data quality and draw their own insights from published metabolomics datasets.
A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video
2011-06-01
orders of magnitude larger than existing datasets such CAVIAR [7]. TRECVID 2008 airport dataset [16] contains 100 hours of video, but, it provides only...entire human figure (e.g., above shoulder), amounting to 500% human to video 2Some statistics are approximate, obtained from the CAVIAR 1st scene and...and diversity in both col- lection sites and viewpoints. In comparison to surveillance datasets such as CAVIAR [7] and TRECVID [16] shown in Fig. 3
Long-term coastal measurements for large-scale climate trends characterization
NASA Astrophysics Data System (ADS)
Pomaro, Angela; Cavaleri, Luigi; Lionello, Piero
2017-04-01
Multi-decadal time-series of observational wave data beginning in the late 1970's are relatively rare. The present study refers to the analysis of the 37-year long directional wave time-series recorded between 1979 and 2015 at the CNR-ISMAR (Institute of Marine Sciences of the Italian National Research Council) "Acqua Alta" oceanographic research tower, located in the Northern Adriatic Sea, 15 km offshore the Venice lagoon, on 16 m depth. The extent of the time series allows to exploit its content not only for modelling purposes or short-term statistical analyses, but also at the climatological scale thanks to the peculiar meteorological and oceanographic aspects of the coastal area where this relevant infrastructure has been installed. We explore the dataset both to characterize the local average climate and its variability, and to detect the possible long-term trends that might be suggestive of, or emphasize, large scale circulation patterns and trends. Measured data are essential for the assessment, and often for the calibration, of model data, generally, if long enough, also the reference also for climate studies. By applying this analysis to an area well characterized from the meteorological point of view, we first assess the changes in time based on measured data, and then we compare them to the ones derived from the ERA-Interim regional simulation over the same area, thus showing the strong improvement that is still needed to get reliable climate models projections on coastal areas and the Mediterranean Region as a whole. Moreover, long term hindcast aiming at climatic considerations are well known for 1) underestimating, if their resolution is not high enough, the actual wave heights as well as for 2) being strongly affected by different conditions over time that are likely to introduce spurious trends of variable magnitude. In particular the different amount, in time, of assimilated data by the hindcast models, directly and indirectly affects the results, making it difficult, if not impossible, to distinguish the imposed effects from the climate signal itself, as demonstrated by Aarnes et al. (2015). From this point of view the problem is that long-term measured datasets are relatively unique, due to the cost and technical difficulty of maintaining fixed instrumental equipment over time, as well as of assuring the homogeneity and availability of the entire dataset. For this reason we are furthermore working on the publication of the quality controlled dataset to make it widely available for open-access research purposes. The analysis and homogenization of the original dataset has actually required a substantial part of the time spent on the study, because of the strong impact that the quality of the data may have on the final result. We consider this particularly relevant, especially when referring to coastal areas, where the lack of reliable satellite data makes it difficult to improve the model capability to resolve the local peculiar oceanographic processes. We describe in detail any step and procedure used in producing the data, including full descriptions of the experimental design, data acquisition assays, and any computational processing needed to support the technical quality of the dataset.
Links to sources of cancer-related statistics, including the Surveillance, Epidemiology and End Results (SEER) Program, SEER-Medicare datasets, cancer survivor prevalence data, and the Cancer Trends Progress Report.
Progeny Clustering: A Method to Identify Biological Phenotypes
Hu, Chenyue W.; Kornblau, Steven M.; Slater, John H.; Qutub, Amina A.
2015-01-01
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476
Twin Data That Made a Big Difference, and That Deserve to Be Better-Known and Used in Teaching
ERIC Educational Resources Information Center
Campbell, Harlan; Hanley, James A.
2017-01-01
Because of their efficiency and ability to keep many other factors constant, twin studies have a special appeal for investigators. Just as with any teaching dataset, a "matched-sets" dataset used to illustrate a statistical model should be compelling, still relevant, and valid. Indeed, such a "model dataset" should meet the…
Polygenic scores via penalized regression on summary statistics.
Mak, Timothy Shin Heng; Porsch, Robert Milan; Choi, Shing Wan; Zhou, Xueya; Sham, Pak Chung
2017-09-01
Polygenic scores (PGS) summarize the genetic contribution of a person's genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating PGS have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can use LD information available elsewhere to supplement such analyses. To answer this question, we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and P-value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred. © 2017 WILEY PERIODICALS, INC.
Assessing the reproducibility of discriminant function analyses
Andrew, Rose L.; Albert, Arianne Y.K.; Renaut, Sebastien; Rennison, Diana J.; Bock, Dan G.
2015-01-01
Data are the foundation of empirical research, yet all too often the datasets underlying published papers are unavailable, incorrect, or poorly curated. This is a serious issue, because future researchers are then unable to validate published results or reuse data to explore new ideas and hypotheses. Even if data files are securely stored and accessible, they must also be accompanied by accurate labels and identifiers. To assess how often problems with metadata or data curation affect the reproducibility of published results, we attempted to reproduce Discriminant Function Analyses (DFAs) from the field of organismal biology. DFA is a commonly used statistical analysis that has changed little since its inception almost eight decades ago, and therefore provides an opportunity to test reproducibility among datasets of varying ages. Out of 100 papers we initially surveyed, fourteen were excluded because they did not present the common types of quantitative result from their DFA or gave insufficient details of their DFA. Of the remaining 86 datasets, there were 15 cases for which we were unable to confidently relate the dataset we received to the one used in the published analysis. The reasons ranged from incomprehensible or absent variable labels, the DFA being performed on an unspecified subset of the data, or the dataset we received being incomplete. We focused on reproducing three common summary statistics from DFAs: the percent variance explained, the percentage correctly assigned and the largest discriminant function coefficient. The reproducibility of the first two was fairly high (20 of 26, and 44 of 60 datasets, respectively), whereas our success rate with the discriminant function coefficients was lower (15 of 26 datasets). When considering all three summary statistics, we were able to completely reproduce 46 (65%) of 71 datasets. While our results show that a majority of studies are reproducible, they highlight the fact that many studies still are not the carefully curated research that the scientific community and public expects. PMID:26290793
Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S; Langs, Georg; Simader, Christian; Waldstein, Sebastian M; Schmidt-Erfurth, Ursula M
2016-01-01
Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge.
Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S.; Langs, Georg; Simader, Christian
2016-01-01
Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge. PMID:27579177
Genders, Tessa S S; Steyerberg, Ewout W; Nieman, Koen; Galema, Tjebbe W; Mollet, Nico R; de Feyter, Pim J; Krestin, Gabriel P; Alkadhi, Hatem; Leschka, Sebastian; Desbiolles, Lotus; Meijs, Matthijs F L; Cramer, Maarten J; Knuuti, Juhani; Kajander, Sami; Bogaert, Jan; Goetschalckx, Kaatje; Cademartiri, Filippo; Maffei, Erica; Martini, Chiara; Seitun, Sara; Aldrovandi, Annachiara; Wildermuth, Simon; Stinn, Björn; Fornaro, Jürgen; Feuchtner, Gudrun; De Zordo, Tobias; Auer, Thomas; Plank, Fabian; Friedrich, Guy; Pugliese, Francesca; Petersen, Steffen E; Davies, L Ceri; Schoepf, U Joseph; Rowe, Garrett W; van Mieghem, Carlos A G; van Driessche, Luc; Sinitsyn, Valentin; Gopalan, Deepa; Nikolaou, Konstantin; Bamberg, Fabian; Cury, Ricardo C; Battle, Juan; Maurovich-Horvat, Pál; Bartykowszki, Andrea; Merkely, Bela; Becker, Dávid; Hadamitzky, Martin; Hausleiter, Jörg; Dewey, Marc; Zimmermann, Elke; Laule, Michael
2012-01-01
Objectives To develop prediction models that better estimate the pretest probability of coronary artery disease in low prevalence populations. Design Retrospective pooled analysis of individual patient data. Setting 18 hospitals in Europe and the United States. Participants Patients with stable chest pain without evidence for previous coronary artery disease, if they were referred for computed tomography (CT) based coronary angiography or catheter based coronary angiography (indicated as low and high prevalence settings, respectively). Main outcome measures Obstructive coronary artery disease (≥50% diameter stenosis in at least one vessel found on catheter based coronary angiography). Multiple imputation accounted for missing predictors and outcomes, exploiting strong correlation between the two angiography procedures. Predictive models included a basic model (age, sex, symptoms, and setting), clinical model (basic model factors and diabetes, hypertension, dyslipidaemia, and smoking), and extended model (clinical model factors and use of the CT based coronary calcium score). We assessed discrimination (c statistic), calibration, and continuous net reclassification improvement by cross validation for the four largest low prevalence datasets separately and the smaller remaining low prevalence datasets combined. Results We included 5677 patients (3283 men, 2394 women), of whom 1634 had obstructive coronary artery disease found on catheter based coronary angiography. All potential predictors were significantly associated with the presence of disease in univariable and multivariable analyses. The clinical model improved the prediction, compared with the basic model (cross validated c statistic improvement from 0.77 to 0.79, net reclassification improvement 35%); the coronary calcium score in the extended model was a major predictor (0.79 to 0.88, 102%). Calibration for low prevalence datasets was satisfactory. Conclusions Updated prediction models including age, sex, symptoms, and cardiovascular risk factors allow for accurate estimation of the pretest probability of coronary artery disease in low prevalence populations. Addition of coronary calcium scores to the prediction models improves the estimates. PMID:22692650
Cheyney, Melissa; Bovbjerg, Marit; Everson, Courtney; Gordon, Wendy; Hannibal, Darcy; Vedam, Saraswathi
2014-01-01
In 2004, the Midwives Alliance of North America's (MANA's) Division of Research developed a Web-based data collection system to gather information on the practices and outcomes associated with midwife-led births in the United States. This system, called the MANA Statistics Project (MANA Stats), grew out of a widely acknowledged need for more reliable data on outcomes by intended place of birth. This article describes the history and development of the MANA Stats birth registry and provides an analysis of the 2.0 dataset's content, strengths, and limitations. Data collection and review procedures for the MANA Stats 2.0 dataset are described, along with methods for the assessment of data accuracy. We calculated descriptive statistics for client demographics and contributing midwife credentials, and assessed the quality of data by calculating point estimates, 95% confidence intervals, and kappa statistics for key outcomes on pre- and postreview samples of records. The MANA Stats 2.0 dataset (2004-2009) contains 24,848 courses of care, 20,893 of which are for women who planned a home or birth center birth at the onset of labor. The majority of these records were planned home births (81%). Births were attended primarily by certified professional midwives (73%), and clients were largely white (92%), married (87%), and college-educated (49%). Data quality analyses of 9932 records revealed no differences between pre- and postreviewed samples for 7 key benchmarking variables (kappa, 0.98-1.00). The MANA Stats 2.0 data were accurately entered by participants; any errors in this dataset are likely random and not systematic. The primary limitation of the 2.0 dataset is that the sample was captured through voluntary participation; thus, it may not accurately reflect population-based outcomes. The dataset's primary strength is that it will allow for the examination of research questions on normal physiologic birth and midwife-led birth outcomes by intended place of birth. © 2014 by the American College of Nurse-Midwives.
AbdelRahman, Samir E; Zhang, Mingyuan; Bray, Bruce E; Kawamoto, Kensaku
2014-05-27
The aim of this study was to propose an analytical approach to develop high-performing predictive models for congestive heart failure (CHF) readmission using an operational dataset with incomplete records and changing data over time. Our analytical approach involves three steps: pre-processing, systematic model development, and risk factor analysis. For pre-processing, variables that were absent in >50% of records were removed. Moreover, the dataset was divided into a validation dataset and derivation datasets which were separated into three temporal subsets based on changes to the data over time. For systematic model development, using the different temporal datasets and the remaining explanatory variables, the models were developed by combining the use of various (i) statistical analyses to explore the relationships between the validation and the derivation datasets; (ii) adjustment methods for handling missing values; (iii) classifiers; (iv) feature selection methods; and (iv) discretization methods. We then selected the best derivation dataset and the models with the highest predictive performance. For risk factor analysis, factors in the highest-performing predictive models were analyzed and ranked using (i) statistical analyses of the best derivation dataset, (ii) feature rankers, and (iii) a newly developed algorithm to categorize risk factors as being strong, regular, or weak. The analysis dataset consisted of 2,787 CHF hospitalizations at University of Utah Health Care from January 2003 to June 2013. In this study, we used the complete-case analysis and mean-based imputation adjustment methods; the wrapper subset feature selection method; and four ranking strategies based on information gain, gain ratio, symmetrical uncertainty, and wrapper subset feature evaluators. The best-performing models resulted from the use of a complete-case analysis derivation dataset combined with the Class-Attribute Contingency Coefficient discretization method and a voting classifier which averaged the results of multi-nominal logistic regression and voting feature intervals classifiers. Of 42 final model risk factors, discharge disposition, discretized age, and indicators of anemia were the most significant. This model achieved a c-statistic of 86.8%. The proposed three-step analytical approach enhanced predictive model performance for CHF readmissions. It could potentially be leveraged to improve predictive model performance in other areas of clinical medicine.
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping.
Pinho, Ana Luísa; Amadon, Alexis; Ruest, Torsten; Fabre, Murielle; Dohmatob, Elvis; Denghien, Isabelle; Ginisty, Chantal; Becuwe-Desmidt, Séverine; Roger, Séverine; Laurier, Laurence; Joly-Testault, Véronique; Médiouni-Cloarec, Gaëlle; Doublé, Christine; Martins, Bernadette; Pinel, Philippe; Eger, Evelyn; Varoquaux, Gaël; Pallier, Christophe; Dehaene, Stanislas; Hertz-Pannier, Lucie; Thirion, Bertrand
2018-06-12
Functional Magnetic Resonance Imaging (fMRI) has furthered brain mapping on perceptual, motor, as well as higher-level cognitive functions. However, to date, no data collection has systematically addressed the functional mapping of cognitive mechanisms at a fine spatial scale. The Individual Brain Charting (IBC) project stands for a high-resolution multi-task fMRI dataset that intends to provide the objective basis toward a comprehensive functional atlas of the human brain. The data refer to a cohort of 12 participants performing many different tasks. The large amount of task-fMRI data on the same subjects yields a precise mapping of the underlying functions, free from both inter-subject and inter-site variability. The present article gives a detailed description of the first release of the IBC dataset. It comprises a dozen of tasks, addressing both low- and high- level cognitive functions. This openly available dataset is thus intended to become a reference for cognitive brain mapping.
Accurate mass measurement: terminology and treatment of data.
Brenton, A Gareth; Godfrey, A Ruth
2010-11-01
High-resolution mass spectrometry has become ever more accessible with improvements in instrumentation, such as modern FT-ICR and Orbitrap mass spectrometers. This has resulted in an increase in the number of articles submitted for publication quoting accurate mass data. There is a plethora of terms related to accurate mass analysis that are in current usage, many employed incorrectly or inconsistently. This article is based on a set of notes prepared by the authors for research students and staff in our laboratories as a guide to the correct terminology and basic statistical procedures to apply in relation to mass measurement, particularly for accurate mass measurement. It elaborates on the editorial by Gross in 1994 regarding the use of accurate masses for structure confirmation. We have presented and defined the main terms in use with reference to the International Union of Pure and Applied Chemistry (IUPAC) recommendations for nomenclature and symbolism for mass spectrometry. The correct use of statistics and treatment of data is illustrated as a guide to new and existing mass spectrometry users with a series of examples as well as statistical methods to compare different experimental methods and datasets. Copyright © 2010. Published by Elsevier Inc.
Giambartolomei, Claudia; Vukcevic, Damjan; Schadt, Eric E; Franke, Lude; Hingorani, Aroon D; Wallace, Chris; Plagnol, Vincent
2014-05-01
Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
NASA Astrophysics Data System (ADS)
Lin, G.; Stephan, E.; Elsethagen, T.; Meng, D.; Riihimaki, L. D.; McFarlane, S. A.
2012-12-01
Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in applications. It determines how likely certain outcomes are if some aspects of the system are not exactly known. UQ studies such as the atmosphere datasets greatly increased in size and complexity because they now comprise of additional complex iterative steps, involve numerous simulation runs and can consist of additional analytical products such as charts, reports, and visualizations to explain levels of uncertainty. These new requirements greatly expand the need for metadata support beyond the NetCDF convention and vocabulary and as a result an additional formal data provenance ontology is required to provide a historical explanation of the origin of the dataset that include references between the explanations and components within the dataset. This work shares a climate observation data UQ science use case and illustrates how to reduce climate observation data uncertainty and use a linked science application called Provenance Environment (ProvEn) to enable and facilitate scientific teams to publish, share, link, and discover knowledge about the UQ research results. UQ results include terascale datasets that are published to an Earth Systems Grid Federation (ESGF) repository. Uncertainty exists in observation data sets, which is due to sensor data process (such as time averaging), sensor failure in extreme weather conditions, and sensor manufacture error etc. To reduce the uncertainty in the observation data sets, a method based on Principal Component Analysis (PCA) was proposed to recover the missing values in observation data. Several large principal components (PCs) of data with missing values are computed based on available values using an iterative method. The computed PCs can approximate the true PCs with high accuracy given a condition of missing values is met; the iterative method greatly improve the computational efficiency in computing PCs. Moreover, noise removal is done at the same time during the process of computing missing values by using only several large PCs. The uncertainty quantification is done through statistical analysis of the distribution of different PCs. To record above UQ process, and provide an explanation on the uncertainty before and after the UQ process on the observation data sets, additional data provenance ontology, such as ProvEn, is necessary. In this study, we demonstrate how to reduce observation data uncertainty on climate model-observation test beds and using ProvEn to record the UQ process on ESGF. ProvEn demonstrates how a scientific team conducting UQ studies can discover dataset links using its domain knowledgebase, allowing them to better understand and convey the UQ study research objectives, the experimental protocol used, the resulting dataset lineage, related analytical findings, ancillary literature citations, along with the social network of scientists associated with the study. Climate scientists will not only benefit from understanding a particular dataset within a knowledge context, but also benefit from the cross reference of knowledge among the numerous UQ studies being stored in ESGF.
Yoshida, Hiroyuki; Shibata, Hiroko; Izutsu, Ken-Ichi; Goda, Yukihiro
2017-01-01
The current Japanese Ministry of Health Labour and Welfare (MHLW)'s Guideline for Bioequivalence Studies of Generic Products uses averaged dissolution rates for the assessment of dissolution similarity between test and reference formulations. This study clarifies how the application of model-independent multivariate confidence region procedure (Method B), described in the European Medical Agency and U.S. Food and Drug Administration guidelines, affects similarity outcomes obtained empirically from dissolution profiles with large variations in individual dissolution rates. Sixty-one datasets of dissolution profiles for immediate release, oral generic, and corresponding innovator products that showed large variation in individual dissolution rates in generic products were assessed on their similarity by using the f 2 statistics defined in the MHLW guidelines (MHLW f 2 method) and two different Method B procedures, including a bootstrap method applied with f 2 statistics (BS method) and a multivariate analysis method using the Mahalanobis distance (MV method). The MHLW f 2 and BS methods provided similar dissolution similarities between reference and generic products. Although a small difference in the similarity assessment may be due to the decrease in the lower confidence interval for expected f 2 values derived from the large variation in individual dissolution rates, the MV method provided results different from those obtained through MHLW f 2 and BS methods. Analysis of actual dissolution data for products with large individual variations would provide valuable information towards an enhanced understanding of these methods and their possible incorporation in the MHLW guidelines.
Pharmacists subjected to disciplinary action: characteristics and risk factors.
Phipps, Denham L; Noyce, Peter R; Walshe, Kieran; Parker, Dianne; Ashcroft, Darren M
2011-10-01
OBJECTIVE To establish whether there are any characteristics of pharmacists that predict their likelihood of being subjected to disciplinary action. METHODS The setting was the Royal Pharmaceutical Society of Great Britain's Disciplinary Committee. One hundred and seventeen pharmacists, all of whom had been referred to the Disciplinary Committee, were matched with a quota sample of 580 pharmacists who had not been subjected to disciplinary action but that matched the disciplined pharmacists on a set of demographic factors (gender, country of residence, year of registration). Frequency analysis and regression analysis were used to compare the two groups of pharmacists in terms of sector of work, ethnicity, age and country of training. Descriptive statistics were also obtained from the disciplined pharmacists to further explore characteristics of disciplinary cases and those pharmacists who undergo them. KEY FINDINGS While a number of characteristics appeared to increase the likelihood of a pharmacist being referred to the disciplinary committee, only one of these - working in a community pharmacy - was statistically significant. Professional misconduct accounted for a greater proportion of referrals than did clinical malpractice, and approximately one-fifth of pharmacists who went before the Disciplinary Committee had previously been disciplined by the Society. CONCLUSIONS This study provides initial evidence of pharmacist characteristics that are associated with an increased risk of being disciplined, based upon the data currently available. It is recommended that follow-up work is carried out using a more extensive dataset in order to confirm the statistical trends identified here. © 2011 The Authors. IJPP © 2011 Royal Pharmaceutical Society.
Statistical tests and identifiability conditions for pooling and analyzing multisite datasets
Zhou, Hao Henry; Singh, Vikas; Johnson, Sterling C.; Wahba, Grace
2018-01-01
When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer’s disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies. PMID:29386387
WIND Toolkit Offshore Summary Dataset
DOE Office of Scientific and Technical Information (OSTI.GOV)
Draxl, Caroline; Musial, Walt; Scott, George
This dataset contains summary statistics for offshore wind resources for the continental United States derived from the Wind Integration National Datatset (WIND) Toolkit. These data are available in two formats: GDB - Compressed geodatabases containing statistical summaries aligned with lease blocks (aliquots) stored in a GIS format. These data are partitioned into Pacific, Atlantic, and Gulf resource regions. HDF5 - Statistical summaries of all points in the offshore Pacific, Atlantic, and Gulf offshore regions. These data are located on the original WIND Toolkit grid and have not been reassigned or downsampled to lease blocks. These data were developed under contractmore » by NREL for the Bureau of Oceanic Energy Management (BOEM).« less
Statistical characterization of short wind waves from stereo images of the sea surface
NASA Astrophysics Data System (ADS)
Mironov, Alexey; Yurovskaya, Maria; Dulov, Vladimir; Hauser, Danièle; Guérin, Charles-Antoine
2013-04-01
We propose a methodology to extract short-scale statistical characteristics of the sea surface topography by means of stereo image reconstruction. The possibilities and limitations of the technique are discussed and tested on a data set acquired from an oceanographic platform at the Black Sea. The analysis shows that reconstruction of the topography based on stereo method is an efficient way to derive non-trivial statistical properties of surface short- and intermediate-waves (say from 1 centimer to 1 meter). Most technical issues pertaining to this type of datasets (limited range of scales, lacunarity of data or irregular sampling) can be partially overcome by appropriate processing of the available points. The proposed technique also allows one to avoid linear interpolation which dramatically corrupts properties of retrieved surfaces. The processing technique imposes that the field of elevation be polynomially detrended, which has the effect of filtering out the large scales. Hence the statistical analysis can only address the small-scale components of the sea surface. The precise cut-off wavelength, which is approximatively half the patch size, can be obtained by applying a high-pass frequency filter on the reference gauge time records. The results obtained for the one- and two-points statistics of small-scale elevations are shown consistent, at least in order of magnitude, with the corresponding gauge measurements as well as other experimental measurements available in the literature. The calculation of the structure functions provides a powerful tool to investigate spectral and statistical properties of the field of elevations. Experimental parametrization of the third-order structure function, the so-called skewness function, is one of the most important and original outcomes of this study. This function is of primary importance in analytical scattering models from the sea surface and was up to now unavailable in field conditions. Due to the lack of precise reference measurements for the small-scale wave field, we could not quantify exactly the accuracy of the retrieval technique. However, it appeared clearly that the obtained accuracy is good enough for the estimation of second-order statistical quantities (such as the correlation function), acceptable for third-order quantities (such as the skwewness function) and insufficient for fourth-order quantities (such as kurtosis). Therefore, the stereo technique in the present stage should not be thought as a self-contained universal tool to characterize the surface statistics. Instead, it should be used in conjunction with other well calibrated but sparse reference measurement (such as wave gauges) for cross-validation and calibration. It then completes the statistical analysis in as much as it provides a snapshot of the three-dimensional field and allows for the evaluation of higher-order spatial statistics.
Zhu, Yun; Fan, Ruzong; Xiong, Momiao
2017-01-01
Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore correlation information of genetic variants, effectively reduce data dimensions, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new statistic method referred to as a quadratically regularized functional CCA (QRFCCA) for association analysis which combines three approaches: (1) quadratically regularized matrix factorization, (2) functional data analysis and (3) canonical correlation analysis (CCA). Large-scale simulations show that the QRFCCA has a much higher power than that of the ten competing statistics while retaining the appropriate type 1 errors. To further evaluate performance, the QRFCCA and ten other statistics are applied to the whole genome sequencing dataset from the TwinsUK study. We identify a total of 79 genes with rare variants and 67 genes with common variants significantly associated with the 46 traits using QRFCCA. The results show that the QRFCCA substantially outperforms the ten other statistics. PMID:29040274
Maes, Dirk; Vanreusel, Wouter; Herremans, Marc; Vantieghem, Pieter; Brosens, Dimitri; Gielen, Karin; Beck, Olivier; Van Dyck, Hans; Desmet, Peter; Natuurpunt, Vlinderwerkgroep
2016-01-01
Abstract In this data paper, we describe two datasets derived from two sources, which collectively represent the most complete overview of butterflies in Flanders and the Brussels Capital Region (northern Belgium). The first dataset (further referred to as the INBO dataset – http://doi.org/10.15468/njgbmh) contains 761,660 records of 70 species and is compiled by the Research Institute for Nature and Forest (INBO) in cooperation with the Butterfly working group of Natuurpunt (Vlinderwerkgroep). It is derived from the database Vlinderdatabank at the INBO, which consists of (historical) collection and literature data (1830-2001), for which all butterfly specimens in institutional and available personal collections were digitized and all entomological and other relevant publications were checked for butterfly distribution data. It also contains observations and monitoring data for the period 1991-2014. The latter type were collected by a (small) butterfly monitoring network where butterflies were recorded using a standardized protocol. The second dataset (further referred to as the Natuurpunt dataset – http://doi.org/10.15468/ezfbee) contains 612,934 records of 63 species and is derived from the database http://waarnemingen.be, hosted at the nature conservation NGO Natuurpunt in collaboration with Stichting Natuurinformatie. This dataset contains butterfly observations by volunteers (citizen scientists), mainly since 2008. Together, these datasets currently contain a total of 1,374,594 records, which are georeferenced using the centroid of their respective 5 × 5 km² Universal Transverse Mercator (UTM) grid cell. Both datasets are published as open data and are available through the Global Biodiversity Information Facility (GBIF). PMID:27199606
Maes, Dirk; Vanreusel, Wouter; Herremans, Marc; Vantieghem, Pieter; Brosens, Dimitri; Gielen, Karin; Beck, Olivier; Van Dyck, Hans; Desmet, Peter; Natuurpunt, Vlinderwerkgroep
2016-01-01
In this data paper, we describe two datasets derived from two sources, which collectively represent the most complete overview of butterflies in Flanders and the Brussels Capital Region (northern Belgium). The first dataset (further referred to as the INBO dataset - http://doi.org/10.15468/njgbmh) contains 761,660 records of 70 species and is compiled by the Research Institute for Nature and Forest (INBO) in cooperation with the Butterfly working group of Natuurpunt (Vlinderwerkgroep). It is derived from the database Vlinderdatabank at the INBO, which consists of (historical) collection and literature data (1830-2001), for which all butterfly specimens in institutional and available personal collections were digitized and all entomological and other relevant publications were checked for butterfly distribution data. It also contains observations and monitoring data for the period 1991-2014. The latter type were collected by a (small) butterfly monitoring network where butterflies were recorded using a standardized protocol. The second dataset (further referred to as the Natuurpunt dataset - http://doi.org/10.15468/ezfbee) contains 612,934 records of 63 species and is derived from the database http://waarnemingen.be, hosted at the nature conservation NGO Natuurpunt in collaboration with Stichting Natuurinformatie. This dataset contains butterfly observations by volunteers (citizen scientists), mainly since 2008. Together, these datasets currently contain a total of 1,374,594 records, which are georeferenced using the centroid of their respective 5 × 5 km² Universal Transverse Mercator (UTM) grid cell. Both datasets are published as open data and are available through the Global Biodiversity Information Facility (GBIF).
Capturing Fine Details Involving Low-Cost Sensors -a Comparative Study
NASA Astrophysics Data System (ADS)
Rehany, N.; Barsi, A.; Lovas, T.
2017-11-01
Capturing the fine details on the surface of small objects is a real challenge to many conventional surveying methods. Our paper discusses the investigation of several data acquisition technologies, such as arm scanner, structured light scanner, terrestrial laser scanner, object line-scanner, DSLR camera, and mobile phone camera. A palm-sized embossed sculpture reproduction was used as a test object; it has been surveyed by all the instruments. The result point clouds and meshes were then analyzed, using the arm scanner's dataset as reference. In addition to general statistics, the results have been evaluated based both on 3D deviation maps and 2D deviation graphs; the latter allows even more accurate analysis of the characteristics of the different data acquisition approaches. Additionally, own-developed local minimum maps were created that nicely visualize the potential level of detail provided by the applied technologies. Besides the usual geometric assessment, the paper discusses the different resource needs (cost, time, expertise) of the discussed techniques. Our results proved that even amateur sensors operated by amateur users can provide high quality datasets that enable engineering analysis. Based on the results, the paper contains an outlook to potential future investigations in this field.
The role of image registration in brain mapping
Toga, A.W.; Thompson, P.M.
2008-01-01
Image registration is a key step in a great variety of biomedical imaging applications. It provides the ability to geometrically align one dataset with another, and is a prerequisite for all imaging applications that compare datasets across subjects, imaging modalities, or across time. Registration algorithms also enable the pooling and comparison of experimental findings across laboratories, the construction of population-based brain atlases, and the creation of systems to detect group patterns in structural and functional imaging data. We review the major types of registration approaches used in brain imaging today. We focus on their conceptual basis, the underlying mathematics, and their strengths and weaknesses in different contexts. We describe the major goals of registration, including data fusion, quantification of change, automated image segmentation and labeling, shape measurement, and pathology detection. We indicate that registration algorithms have great potential when used in conjunction with a digital brain atlas, which acts as a reference system in which brain images can be compared for statistical analysis. The resulting armory of registration approaches is fundamental to medical image analysis, and in a brain mapping context provides a means to elucidate clinical, demographic, or functional trends in the anatomy or physiology of the brain. PMID:19890483
Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution.
Gangnon, Ronald E
2012-03-01
The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, whereas rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset. © 2011, The International Biometric Society.
Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution
Gangnon, Ronald E.
2011-01-01
Summary The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, while rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset. PMID:21762118
Transport Statistics - Transport - UNECE
Statistics and Data Online Infocards Database SDG Papers E-Road Census Traffic Census Map Traffic Census 2015 available. Two new datasets have been added to the transport statistics database: bus and coach statistics Database Evaluations Follow UNECE Facebook Rss Twitter You tube Contact us Instagram Flickr Google+ Â
NASA Astrophysics Data System (ADS)
Sanchez-Roman, Antonio; Ruiz, Simón; Pascual, Ananda; Guinehut, Stéphanie; Mourre, Baptiste
2016-04-01
The existing Argo network provides essential data in near real time to constrain monitoring and forecasting centers and strongly complements the observations of the ocean surface from space. The comparison of Sea Level Anomalies (SLA) provided by satellite altimeters with in-situ Dynamic Heights Anomalies (DHA) derived from the temperature and salinity profiles of Argo floats contribute to better characterize the error budget associated with the altimeter observations. In this work, performed in the frame of the E-AIMS FP7 European Project, we focus on the Argo observing system in the Mediterranean Sea and its impact on SLA fields provided by satellite altimetry measurements in the basin. Namely, we focus on the sensitivity of specific SLA gridded merged products provided by AVISO in the Mediterranean to the reference depth (400 or 900 dbar) selected in the computation of the Argo Dynamic Height (DH) as an integration of the Argo T/S profiles through the water column. This reference depth will have impact on the number of valid Argo profiles and therefore on their temporal sampling and the coverage by the network used to compare with altimeter data. To compare both datasets, altimeter grids and synthetic climatologies used to compute DHA were spatially and temporally interpolated at the position and time of each in-situ Argo profile by a mapping method based on an optimal interpolation scheme. The analysis was conducted in the entire Mediterranean Sea and different sub-regions of the basin. The second part of this work is devoted to investigate which configuration in terms of spatial sampling of the Argo array in the Mediterranean will properly reproduce the mesoscale dynamics in this basin, which is comprehensively captured by new standards of specific altimeter products for this region. To do that, several Observing System Simulation Experiments (OSSEs) were conducted assuming that altimetry data computed from AVISO specific reanalysis gridded merged product for the Mediterranean as the "true" field. The choice of the reference depth of Argo profiles impacts the number of valid profiles used to compute DHA and therefore the spatial coverage by the network. Results show that the impact of the reference level in the computation of Argo DH is statistically significant since the standard deviation of the differences between DH computed from Altimetry and Argo data referred to reference depth of 400 dbar and 900 dbar are quite different (4.85 and 5.11 cm, respectively). Therefore, 400 dbar should be taken as reference depth to compute DHA from Argo data in the Mediterranean. On the contrary, similar scores are obtained when shallow floats are not included in the computation (4.85 cm against 4.87 cm). In any case, we must highlight that all the studies show significant correlations (95 %) higher than 0.70 between Altimetry and Argo data with a STD for the differences between both datasets of around 4.90 cm. Furthermore, the sub-basin study shows improved statistics for the eastern sub-basin for DHA referred to 400 dbar while minimum values are obtained for the western sub-basin when computing DHA referred to 900 dbar. On the other hand, results about the OSSEs suggest that maintaining an array of Argo floats of 100×100 km, the variance of the large-scale signal and most of the mesoscale features of SLA fields are recovered. Therefore, the network coverage should be enlarged in the Mediterranean in order to achieve at least this spatial resolution.
An Adaptive Prediction-Based Approach to Lossless Compression of Floating-Point Volume Data.
Fout, N; Ma, Kwan-Liu
2012-12-01
In this work, we address the problem of lossless compression of scientific and medical floating-point volume data. We propose two prediction-based compression methods that share a common framework, which consists of a switched prediction scheme wherein the best predictor out of a preset group of linear predictors is selected. Such a scheme is able to adapt to different datasets as well as to varying statistics within the data. The first method, called APE (Adaptive Polynomial Encoder), uses a family of structured interpolating polynomials for prediction, while the second method, which we refer to as ACE (Adaptive Combined Encoder), combines predictors from previous work with the polynomial predictors to yield a more flexible, powerful encoder that is able to effectively decorrelate a wide range of data. In addition, in order to facilitate efficient visualization of compressed data, our scheme provides an option to partition floating-point values in such a way as to provide a progressive representation. We compare our two compressors to existing state-of-the-art lossless floating-point compressors for scientific data, with our data suite including both computer simulations and observational measurements. The results demonstrate that our polynomial predictor, APE, is comparable to previous approaches in terms of speed but achieves better compression rates on average. ACE, our combined predictor, while somewhat slower, is able to achieve the best compression rate on all datasets, with significantly better rates on most of the datasets.
Lee, L.; Helsel, D.
2005-01-01
Trace contaminants in water, including metals and organics, often are measured at sufficiently low concentrations to be reported only as values below the instrument detection limit. Interpretation of these "less thans" is complicated when multiple detection limits occur. Statistical methods for multiply censored, or multiple-detection limit, datasets have been developed for medical and industrial statistics, and can be employed to estimate summary statistics or model the distributions of trace-level environmental data. We describe S-language-based software tools that perform robust linear regression on order statistics (ROS). The ROS method has been evaluated as one of the most reliable procedures for developing summary statistics of multiply censored data. It is applicable to any dataset that has 0 to 80% of its values censored. These tools are a part of a software library, or add-on package, for the R environment for statistical computing. This library can be used to generate ROS models and associated summary statistics, plot modeled distributions, and predict exceedance probabilities of water-quality standards. ?? 2005 Elsevier Ltd. All rights reserved.
Estimation of parameters of dose volume models and their confidence limits
NASA Astrophysics Data System (ADS)
van Luijk, P.; Delvigne, T. C.; Schilstra, C.; Schippers, J. M.
2003-07-01
Predictions of the normal-tissue complication probability (NTCP) for the ranking of treatment plans are based on fits of dose-volume models to clinical and/or experimental data. In the literature several different fit methods are used. In this work frequently used methods and techniques to fit NTCP models to dose response data for establishing dose-volume effects, are discussed. The techniques are tested for their usability with dose-volume data and NTCP models. Different methods to estimate the confidence intervals of the model parameters are part of this study. From a critical-volume (CV) model with biologically realistic parameters a primary dataset was generated, serving as the reference for this study and describable by the NTCP model. The CV model was fitted to this dataset. From the resulting parameters and the CV model, 1000 secondary datasets were generated by Monte Carlo simulation. All secondary datasets were fitted to obtain 1000 parameter sets of the CV model. Thus the 'real' spread in fit results due to statistical spreading in the data is obtained and has been compared with estimates of the confidence intervals obtained by different methods applied to the primary dataset. The confidence limits of the parameters of one dataset were estimated using the methods, employing the covariance matrix, the jackknife method and directly from the likelihood landscape. These results were compared with the spread of the parameters, obtained from the secondary parameter sets. For the estimation of confidence intervals on NTCP predictions, three methods were tested. Firstly, propagation of errors using the covariance matrix was used. Secondly, the meaning of the width of a bundle of curves that resulted from parameters that were within the one standard deviation region in the likelihood space was investigated. Thirdly, many parameter sets and their likelihood were used to create a likelihood-weighted probability distribution of the NTCP. It is concluded that for the type of dose response data used here, only a full likelihood analysis will produce reliable results. The often-used approximations, such as the usage of the covariance matrix, produce inconsistent confidence limits on both the parameter sets and the resulting NTCP values.
Wang, Zhi-wei; Wu, Xiao-dong; Yue, Guang-yang; Zhao, Lin; Wang, Qian; Nan, Zhuo-tong; Qin, Yu; Wu, Tong-hua; Shi, Jian-zong; Zou, De-fu
2016-02-01
Recently considerable researches have focused on monitoring vegetation changes because of its important role in regula- ting the terrestrial carbon cycle and the climate system. There were the largest areas with high-altitudes in the Qinghai-Tibet Plateau (QTP), which is often referred to as the third pole of the world. And vegetation in this region is significantly sensitive to the global warming. Meanwhile NDVI dataset was one of the most useful tools to monitor the vegetation activity with high spatial and temporal resolution, which is a normalized transform of the near-infrared radiation (NIR) to red reflectance ratio. Therefore, an extended GIMMS NDVI dataset from 1982-2006 to 1982-2014 was presented using a unary linear regression by MODIS dataset from 2000 to 2014 in QTP. Compared with previous researches, the accuracy of the extended NDVI dataset was improved again with consideration the residuals derived from scale transformation. So the model of extend NDVI dataset could be a new method to integrate different NDVI products. With the extended NDVI dataset, we found that in growing season there was a statistically significant increase (0.000 4 yr⁻¹, r² = 0.585 9, p < 0.001) in QTP from 1982 to 2014. During the study pe- riod, the trends of NDVI were significantly increased in spring (0.000 5 yr⁻¹, r² = 0.295 4, p = 0.001), summer (0.000 3 yr⁻¹, r² = 0.105 3, p = 0.065) and autumn respectively (0.000 6 yr⁻¹, r² = 0.436 7, p < 0.001). Due to the increased vegeta- tion activity in Qinghai-Tibet Plateau from 1982 to 2014, the magnitude of carbon sink was accumulated in this region also at this same period. Then the data of temperature and precipitation was used to explore the reason of vegetation changed. Although the trends of them are both increased, the correlation between NDVI and temperature is higher than precipitation in vegetation grow- ing season, spring, summer and autumn. Furthermore, there is significant spatial heterogeneity of the changing trends for ND- VI, temperature and precipitation at Qinghai-Tibet Plateau scale.
SU-E-J-85: Leave-One-Out Perturbation (LOOP) Fitting Algorithm for Absolute Dose Film Calibration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chu, A; Ahmad, M; Chen, Z
2014-06-01
Purpose: To introduce an outliers-recognition fitting routine for film dosimetry. It cannot only be flexible with any linear and non-linear regression but also can provide information for the minimal number of sampling points, critical sampling distributions and evaluating analytical functions for absolute film-dose calibration. Methods: The technique, leave-one-out (LOO) cross validation, is often used for statistical analyses on model performance. We used LOO analyses with perturbed bootstrap fitting called leave-one-out perturbation (LOOP) for film-dose calibration . Given a threshold, the LOO process detects unfit points (“outliers”) compared to other cohorts, and a bootstrap fitting process follows to seek any possibilitiesmore » of using perturbations for further improvement. After that outliers were reconfirmed by a traditional t-test statistics and eliminated, then another LOOP feedback resulted in the final. An over-sampled film-dose- calibration dataset was collected as a reference (dose range: 0-800cGy), and various simulated conditions for outliers and sampling distributions were derived from the reference. Comparisons over the various conditions were made, and the performance of fitting functions, polynomial and rational functions, were evaluated. Results: (1) LOOP can prove its sensitive outlier-recognition by its statistical correlation to an exceptional better goodness-of-fit as outliers being left-out. (2) With sufficient statistical information, the LOOP can correct outliers under some low-sampling conditions that other “robust fits”, e.g. Least Absolute Residuals, cannot. (3) Complete cross-validated analyses of LOOP indicate that the function of rational type demonstrates a much superior performance compared to the polynomial. Even with 5 data points including one outlier, using LOOP with rational function can restore more than a 95% value back to its reference values, while the polynomial fitting completely failed under the same conditions. Conclusion: LOOP can cooperate with any fitting routine functioning as a “robust fit”. In addition, it can be set as a benchmark for film-dose calibration fitting performance.« less
GenomeGraphs: integrated genomic data visualization with R.
Durinck, Steffen; Bullard, James; Spellman, Paul T; Dudoit, Sandrine
2009-01-06
Biological studies involve a growing number of distinct high-throughput experiments to characterize samples of interest. There is a lack of methods to visualize these different genomic datasets in a versatile manner. In addition, genomic data analysis requires integrated visualization of experimental data along with constantly changing genomic annotation and statistical analyses. We developed GenomeGraphs, as an add-on software package for the statistical programming environment R, to facilitate integrated visualization of genomic datasets. GenomeGraphs uses the biomaRt package to perform on-line annotation queries to Ensembl and translates these to gene/transcript structures in viewports of the grid graphics package. This allows genomic annotation to be plotted together with experimental data. GenomeGraphs can also be used to plot custom annotation tracks in combination with different experimental data types together in one plot using the same genomic coordinate system. GenomeGraphs is a flexible and extensible software package which can be used to visualize a multitude of genomic datasets within the statistical programming environment R.
De Hertogh, Benoît; De Meulder, Bertrand; Berger, Fabrice; Pierre, Michael; Bareke, Eric; Gaigneaux, Anthoula; Depiereux, Eric
2010-01-11
Recent reanalysis of spike-in datasets underscored the need for new and more accurate benchmark datasets for statistical microarray analysis. We present here a fresh method using biologically-relevant data to evaluate the performance of statistical methods. Our novel method ranks the probesets from a dataset composed of publicly-available biological microarray data and extracts subset matrices with precise information/noise ratios. Our method can be used to determine the capability of different methods to better estimate variance for a given number of replicates. The mean-variance and mean-fold change relationships of the matrices revealed a closer approximation of biological reality. Performance analysis refined the results from benchmarks published previously.We show that the Shrinkage t test (close to Limma) was the best of the methods tested, except when two replicates were examined, where the Regularized t test and the Window t test performed slightly better. The R scripts used for the analysis are available at http://urbm-cluster.urbm.fundp.ac.be/~bdemeulder/.
Accuracy of five intraoral scanners compared to indirect digitalization.
Güth, Jan-Frederik; Runkel, Cornelius; Beuer, Florian; Stimmelmayr, Michael; Edelhoff, Daniel; Keul, Christine
2017-06-01
Direct and indirect digitalization offer two options for computer-aided design (CAD)/ computer-aided manufacturing (CAM)-generated restorations. The aim of this study was to evaluate the accuracy of different intraoral scanners and compare them to the process of indirect digitalization. A titanium testing model was directly digitized 12 times with each intraoral scanner: (1) CS 3500 (CS), (2) Zfx Intrascan (ZFX), (3) CEREC AC Bluecam (BLU), (4) CEREC AC Omnicam (OC) and (5) True Definition (TD). As control, 12 polyether impressions were taken and the referring plaster casts were digitized indirectly with the D-810 laboratory scanner (CON). The accuracy (trueness/precision) of the datasets was evaluated by an analysing software (Geomagic Qualify 12.1) using a "best fit alignment" of the datasets with a highly accurate reference dataset of the testing model, received from industrial computed tomography. Direct digitalization using the TD showed the significant highest overall "trueness", followed by CS. Both performed better than CON. BLU, ZFX and OC showed higher differences from the reference dataset than CON. Regarding the overall "precision", the CS 3500 intraoral scanner and the True Definition showed the best performance. CON, BLU and OC resulted in significantly higher precision than ZFX did. Within the limitations of this in vitro study, the accuracy of the ascertained datasets was dependent on the scanning system. The direct digitalization was not superior to indirect digitalization for all tested systems. Regarding the accuracy, all tested intraoral scanning technologies seem to be able to reproduce a single quadrant within clinical acceptable accuracy. However, differences were detected between the tested systems.
Jędrkiewicz, Renata; Tsakovski, Stefan; Lavenu, Aurore; Namieśnik, Jacek; Tobiszewski, Marek
2018-02-01
Novel methodology for grouping and ranking with application of self-organizing maps and multicriteria decision analysis is presented. The dataset consists of 22 objects that are analytical procedures applied to furan determination in food samples. They are described by 10 variables, referred to their analytical performance, environmental and economic aspects. Multivariate statistics analysis allows to limit the amount of input data for ranking analysis. Assessment results show that the most beneficial procedures are based on microextraction techniques with GC-MS final determination. It is presented how the information obtained from both tools complement each other. The applicability of combination of grouping and ranking is also discussed. Copyright © 2017 Elsevier B.V. All rights reserved.
Using statistical text classification to identify health information technology incidents
Chai, Kevin E K; Anthony, Stephen; Coiera, Enrico; Magrabi, Farah
2013-01-01
Objective To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. Design We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. Text classifiers using regularized logistic regression were evaluated with both ‘balanced’ (50% HIT) and ‘stratified’ (0.297% HIT) datasets for training, validation, and testing. Dataset preparation, feature extraction, feature selection, cross-validation, classification, performance evaluation, and error analysis were performed iteratively to further improve the classifiers. Feature-selection techniques such as removing short words and stop words, stemming, lemmatization, and principal component analysis were examined. Measurements κ statistic, F1 score, precision and recall. Results Classification performance was similar on both the stratified (0.954 F1 score) and balanced (0.995 F1 score) datasets. Stemming was the most effective technique, reducing the feature set size to 79% while maintaining comparable performance. Training with balanced datasets improved recall (0.989) but reduced precision (0.165). Conclusions Statistical text classification appears to be a feasible method for identifying HIT reports within large databases of incidents. Automated identification should enable more HIT problems to be detected, analyzed, and addressed in a timely manner. Semi-supervised learning may be necessary when applying machine learning to big data analysis of patient safety incidents and requires further investigation. PMID:23666777
NASA Technical Reports Server (NTRS)
Gardner, Adrian
2010-01-01
National Aeronautical and Space Administration (NASA) weather and atmospheric environmental organizations are insatiable consumers of geophysical, hydrometeorological and solar weather statistics. The expanding array of internet-worked sensors producing targeted physical measurements has generated an almost factorial explosion of near real-time inputs to topical statistical datasets. Normalizing and value-based parsing of such statistical datasets in support of time-constrained weather and environmental alerts and warnings is essential, even with dedicated high-performance computational capabilities. What are the optimal indicators for advanced decision making? How do we recognize the line between sufficient statistical sampling and excessive, mission destructive sampling ? How do we assure that the normalization and parsing process, when interpolated through numerical models, yields accurate and actionable alerts and warnings? This presentation will address the integrated means and methods to achieve desired outputs for NASA and consumers of its data.
Kim, Stephanie; Eliot, Melissa; Koestler, Devin C; Houseman, Eugene A; Wetmur, James G; Wiencke, John K; Kelsey, Karl T
2016-09-01
We examined whether variation in blood-based epigenome-wide association studies could be more completely explained by augmenting existing reference DNA methylation libraries. We compared existing and enhanced libraries in predicting variability in three publicly available 450K methylation datasets that collected whole-blood samples. Models were fit separately to each CpG site and used to estimate the additional variability when adjustments for cell composition were made with each library. Calculation of the mean difference in the CpG-specific residual sums of squares error between models for an arthritis, aging and metabolic syndrome dataset, indicated that an enhanced library explained significantly more variation across all three datasets (p < 10(-3)). Pathologically important immune cell subtypes can explain important variability in epigenome-wide association studies done in blood.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Greenblatt, Jeffery B.; Yang, Hung-Chia; Desroches, Louis-Benoit
2013-04-01
We present two post-stratification weighting methods to validate survey data collected using Amazon Mechanical Turk (AMT). Two surveys focused on appliance and consumer electronics devices were administered in the spring and summer of 2012 to each of approximately 3,000 U.S. households. Specifically, the surveys asked questions about residential refrigeration products, televisions (TVs) and set-top boxes (STBs). Filtered data were assigned weights using each of two weighting methods, termed “sequential” and “simultaneous,” by examining up to eight demographic variables (income, education, gender, race, Hispanic origin, number of occupants, ages of occupants, and geographic region) in comparison to reference U.S. demographic datamore » from the 2009 Residential Energy Consumption Survey (RECS). Five key questions from the surveys (number of refrigerators, number of freezers, number of TVs, number of STBs and primary service provider) were evaluated with a set of statistical tests to determine whether either method improved the agreement of AMT with reference data, and if so, which method was better. The statistical tests used were: differences in proportions, distributions of proportions (using Pearson’s chi-squared test), and differences in average numbers of devices as functions of all demographic variables. The results indicated that both methods generally improved the agreement between AMT and reference data, sometimes greatly, but that the simultaneous method was usually superior to the sequential method. Some differences in sample populations were found between the AMT and reference data. Differences in the proportion of STBs reflected large changes in the STB market since the time our reference data was acquired in 2009. Differences in the proportions of some primary service providers suggested real sample bias, with the possible explanation that AMT user are more likely to subscribe to providers who also provide home internet service. Differences in other variables, while statistically significant in some cases, were nonetheless considered to be minor. Depending on the intended purpose of the data collected using AMT, these biases may or may not be important; to correct them, additional questions and/or further post-survey adjustments could be employed. In general, based on the analysis methods and the sample datasets used in this study, AMT surveys appeared to provide useful data on appliance and consumer electronics devices.« less
Analysing Relationships Between Urban Land Use Fragmentation Metrics and Socio-Economic Variables
NASA Astrophysics Data System (ADS)
Sapena, M.; Ruiz, L. A.; Goerlich, F. J.
2016-06-01
Analysing urban regions is essential for their correct monitoring and planning. This is mainly accounted for the sharp increase of people living in urban areas, and consequently, the need to manage them. At the same time there has been a rise in the use of spatial and statistical datasets, such as the Urban Atlas, which offers high-resolution urban land use maps obtained from satellite imagery, and the Urban Audit, which provides statistics of European cities and their surroundings. In this study, we analyse the relations between urban fragmentation metrics derived from Land Use and Land Cover (LULC) data from the Urban Atlas dataset, and socio-economic data from the Urban Audit for the reference years 2006 and 2012. We conducted the analysis on a sample of sixty-eight Functional Urban Areas (FUAs). One-date and two-date based fragmentation indices were computed for each FUA, land use class and date. Correlation tests and principal component analysis were then applied to select the most representative indices. Finally, multiple regression models were tested to explore the prediction of socio-economic variables, using different combinations of land use metrics as explanatory variables, both at a given date and in a dynamic context. The outcomes show that demography, living conditions, labour, and transportation variables have a clear relation with the morphology of the FUAs. This methodology allows us to compare European FUAs in terms of the spatial distribution of the land use classes, their complexity, and their structural changes, as well as to preview and model different growth patterns and socio-economic indicators.
ToxMiner Software Interface for Visualizing and Analyzing ToxCast Data
The ToxCast dataset represents a collection of assays and endpoints that will require both standard statistical approaches as well as customized data analysis workflows. To analyze this unique dataset, we have developed an integrated database with Javabased interface called ToxMi...
Efficient and Flexible Climate Analysis with Python in a Cloud-Based Distributed Computing Framework
NASA Astrophysics Data System (ADS)
Gannon, C.
2017-12-01
As climate models become progressively more advanced, and spatial resolution further improved through various downscaling projects, climate projections at a local level are increasingly insightful and valuable. However, the raw size of climate datasets presents numerous hurdles for analysts wishing to develop customized climate risk metrics or perform site-specific statistical analysis. Four Twenty Seven, a climate risk consultancy, has implemented a Python-based distributed framework to analyze large climate datasets in the cloud. With the freedom afforded by efficiently processing these datasets, we are able to customize and continually develop new climate risk metrics using the most up-to-date data. Here we outline our process for using Python packages such as XArray and Dask to evaluate netCDF files in a distributed framework, StarCluster to operate in a cluster-computing environment, cloud computing services to access publicly hosted datasets, and how this setup is particularly valuable for generating climate change indicators and performing localized statistical analysis.
SPICE: exploration and analysis of post-cytometric complex multivariate datasets.
Roederer, Mario; Nozzi, Joshua L; Nason, Martha C
2011-02-01
Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
Generation of synthetic flood hydrographs by hydrological donors (SHYDONHY method)
NASA Astrophysics Data System (ADS)
Paquet, Emmanuel
2017-04-01
For the design of hydraulic infrastructures like dams, a design hydrograph is required in most of the cases. Some of its features (e.g. peak value, duration, volume) corresponding to a given return period are computed thanks to a wide range of methods: historical records, mono or multivariate statistical analysis, stochastic simulation, etc. Then various methods have been proposed to construct design hydrographs having such characteristics, ranging from traditional unit-hydrograph to statistical methods (Yue et al., 2002). A new method to build design hydrographs (or more generally synthetic hydrographs) is introduced here, named SHYDONHY, French acronym for "Synthèse d'HYdrogrammes par DONneurs HYdrologiques". It is based on an extensive database of 100 000 flood hydrographs recorded at hourly time-step on 1300 gauging stations in France and Switzerland, covering a wide range of catchment size and climatology. For each station, an average of two hydrographs per year of record has been selected by a peak-over-threshold (POT) method with independence criteria (Lang et al., 1999). This sampling ensures that only hydrographs of intense floods are gathered in the dataset. For a given catchment, where few or no hydrograph is available at the outlet, a sub-set of 10 "donor stations" is selected within the complete dataset, considering several criteria: proximity, size, mean annual values and regimes for both total runoff and POT-selected floods. This sub-set of stations (and their corresponding flood hydrographs) will allow to: • Estimate a characteristic duration of flood hydrographs (e.g. duration for which the discharge is above 50% of the peak value). • For a given duration (e.g. one day), estimate the average peak-to- volume ratio of floods. • For a given duration and peak-to-volume ratio, generation of a synthetic reference hydrograph by combining appropriate hydrographs of the sub-set. • For a given daily discharge sequence, being observed or generated for extreme flood estimation, generate a suitable synthetic hydrograph, also by combining selected hydrographs of the sub-set. The reliability of the method is assessed by performing a jackknife validation on the whole dataset of stations, in particular by reconstructing the hydrograph of the biggest flood of each station and comparing it to the actual one. Some applications are presented, e.g. the coupling of SHYDONHY with the SCHADEX method (Paquet et al., 2003) for the stochastic simulation of extreme reservoir level in dams. References: Lang, M., Ouarda, T. B. M. J., & Bobée, B. (1999). Towards operational guidelines for over-threshold modeling. Journal of hydrology, 225(3), 103-117. Paquet, E., Garavaglia, F., Garçon, R., & Gailhard, J. (2013). The SCHADEX method: A semi-continuous rainfall-runoff simulation for extreme flood estimation. Journal of Hydrology, 495, 23-37. Yue, S., Ouarda, T. B., Bobée, B., Legendre, P., & Bruneau, P. (2002). Approach for describing statistical properties of flood hydrograph. Journal of hydrologic engineering, 7(2), 147-153.
Highland, Steven; James, R R
2016-04-01
Honey bee (Apis mellifera L., Hymenoptera: Apidae) colonies have experienced profound fluctuations, especially declines, in the past few decades. Long-term datasets on honey bees are needed to identify the most important environmental and cultural factors associated with these changes. While a few such datasets exist, scientists have been hesitant to use some of these due to perceived shortcomings in the data. We compared data and trends for three datasets. Two come from the US Department of Agriculture's National Agricultural Statistics Service (NASS), Agricultural Statistics Board: one is the annual survey of honey-producing colonies from the Annual Bee and Honey program (ABH), and the other is colony counts from the Census of Agriculture conducted every five years. The third dataset we developed from the number of colonies registered annually by some states. We compared the long-term patterns of change in colony numbers among the datasets on a state-by-state basis. The three datasets often showed similar hive numbers and trends varied by state, with differences between datasets being greatest for those states receiving a large number of migratory colonies. Dataset comparisons provide a method to estimate the number of colonies in a state used for pollination versus honey production. Some states also had separate data for local and migratory colonies, allowing one to determine whether the migratory colonies were typically used for pollination or honey production. The Census of Agriculture should provide the most accurate long-term data on colony numbers, but only every five years. © The Authors 2016. Published by Oxford University Press on behalf of Entomological Society of America. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Modeling extreme sea levels due to tropical and extra-tropical cyclones at the global-scale
NASA Astrophysics Data System (ADS)
Muis, S.; Lin, N.; Verlaan, M.; Winsemius, H.; Ward, P.; Aerts, J.
2017-12-01
Extreme sea levels, a combination of storm surges and astronomical tides, can cause catastrophic floods. Due to their intense wind speeds and low pressure, tropical cyclones (TCs) typically cause higher storm surges than extra-tropical cyclones (ETCs), but ETCs may still contribute significantly to the overall flood risk. In this contribution, we show a novel approach to model extreme sea levels due to both tropical and extra-tropical cyclones at the global-scale. Using a global hydrodynamic model we have developed the Global Tide and Surge Reanalysis (GTSR) dataset (Muis et al., 2016), which provides daily maximum timeseries of storm tide from 1979 to 2014. GTSR is based on wind and pressure fields from the ERA-Interim climate reanalysis (Dee at al., 2011). A severe limitation of the GTSR dataset is the underrepresentation of TCs. This is due to the relatively coarse grid resolution of ERA-Interim, which means that the strong intensities of TCs are not fully included. Furthermore, the length of ERA-Interim is too short to estimate the probabilities of extreme TCs in a reliable way. We will discuss potential ways to address this limitation, and demonstrate how to improve the global GTSR framework. We will apply the improved framework to the east coast of the United States. First, we improve our meteorological forcing by applying a parametric hurricane model (Holland 1980), and we improve the tide and surge reanalysis dataset (Muis et al., 2016) by explicitly modeling the historical TCs in the Extended Best Track dataset (Demuth et al., 2006). Second, we improve our sampling by statistically extending the observed TC record to many thousands of years (Emanuel et al., 2006). The improved framework allows for the mapping of probabilities of extreme sea levels, including extremes TC events, for the east coast of the United States. ReferencesDee et al (2011). The ERA-Interim reanalysis: configuration and performance of the data assimilation system. Q. J. R. Meteorol. Soc. 137, 553-97. Emanuel et al (2006). A Statistical Deterministic Approach to Hurricane Risk Assessment/ Bull. Am. Meteorol. Soc. 87, 299-314. Holland (1980). An analytic model of the wind and pressure profiles in hurricanes. Mon. Weather Rev. 108, 1212-1218. Muis et al (2016). A global reanalysis of storm surge and extreme sea levels. Nat. Commun. 7, 1-11
Improving the discoverability, accessibility, and citability of omics datasets: a case report.
Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J
2017-03-01
Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Analysis models for the estimation of oceanic fields
NASA Technical Reports Server (NTRS)
Carter, E. F.; Robinson, A. R.
1987-01-01
A general model for statistically optimal estimates is presented for dealing with scalar, vector and multivariate datasets. The method deals with anisotropic fields and treats space and time dependence equivalently. Problems addressed include the analysis, or the production of synoptic time series of regularly gridded fields from irregular and gappy datasets, and the estimate of fields by compositing observations from several different instruments and sampling schemes. Technical issues are discussed, including the convergence of statistical estimates, the choice of representation of the correlations, the influential domain of an observation, and the efficiency of numerical computations.
Statistical tests and identifiability conditions for pooling and analyzing multisite datasets.
Zhou, Hao Henry; Singh, Vikas; Johnson, Sterling C; Wahba, Grace
2018-02-13
When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer's disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies. Copyright © 2018 the Author(s). Published by PNAS.
NASA Astrophysics Data System (ADS)
Ribeiro Fontoura, Jessica; Allasia, Daniel; Herbstrith Froemming, Gabriel; Freitas Ferreira, Pedro; Tassi, Rutineia
2016-04-01
Evapotranspiration is a key process of hydrological cycle and a sole term that links land surface water balance and land surface energy balance. Due to the higher information requirements of the Penman-Monteith method and the existing data uncertainty, simplified empirical methods for calculating potential and actual evapotranspiration are widely used in hydrological models. This is especially important in Brazil, where the monitoring of meteorological data is precarious. In this study were compared different methods for estimating evapotranspiration for Rio Grande do Sul, the Southernmost State of Brazil, aiming to suggest alternatives to the recommended method (Penman-Monteith-FAO 56) for estimate daily reference evapotranspiration (ETo) when meteorological data is missing or not available. The input dataset included daily and hourly-observed data from conventional and automatic weather stations respectively maintained by the National Weather Institute of Brazil (INMET) from the period of 1 January 2007 to 31 January 2010. Dataset included maximum temperature (Tmax, °C), minimum temperature (Tmin, °C), mean relative humidity (%), wind speed at 2 m height (u2, m s-1), daily solar radiation (Rs, MJ m- 2) and atmospheric pressure (kPa) that were grouped at daily time-step. Was tested the Food and Agriculture Organization of the United Nations (FAO) Penman-Monteith method (PM) at its full form, against PM assuming missing several variables not normally available in Brazil in order to calculate daily reference ETo. Missing variables were estimated as suggested in FAO56 publication or from climatological means. Furthermore, PM was also compared against the following simplified empirical methods: Hargreaves-Samani, Priestley-Taylor, Mccloud, McGuiness-Bordne, Romanenko, Radiation-Temperature, Tanner-Pelton. The statistical analysis indicates that even if just Tmin and Tmax are available, it is better to use PM estimating missing variables from syntetic data than simplified empirical methods evaluated except for Tanner-Pelton and Priestley-Taylor.
Statistical analysis of large simulated yield datasets for studying climate effects
USDA-ARS?s Scientific Manuscript database
Ensembles of process-based crop models are now commonly used to simulate crop growth and development for climate scenarios of temperature and/or precipitation changes corresponding to different projections of atmospheric CO2 concentrations. This approach generates large datasets with thousands of de...
Finding the Maine Story in Hugh Cumbersome National Monitoring Datasets
What’s a manager, analyst, or concerned citizen to do with the complex datasets generated by State and Federal monitoring efforts? Is it possible to use such information to address Maine’s environmental issues without having a degree in informatics and statistics? This presentati...
Network Intrusion Dataset Assessment
2013-03-01
Security, 6(1):173–180, October 2009. abs/0911.0787. 70 • Jungsuk Song, Hiroki Takakura, Yasuo Okabe, and Koji Nakao. “Toward a more practical...Inoue, and Koji Nakao. “Statistical analysis of honeypot data and building of Kyoto 2006+ dataset for NIDS evaluation”. BADGERS ’11: Proceedings of
Optimal SVM parameter selection for non-separable and unbalanced datasets.
Jiang, Peng; Missoum, Samy; Chen, Zhao
2014-10-01
This article presents a study of three validation metrics used for the selection of optimal parameters of a support vector machine (SVM) classifier in the case of non-separable and unbalanced datasets. This situation is often encountered when the data is obtained experimentally or clinically. The three metrics selected in this work are the area under the ROC curve (AUC), accuracy, and balanced accuracy. These validation metrics are tested using computational data only, which enables the creation of fully separable sets of data. This way, non-separable datasets, representative of a real-world problem, can be created by projection onto a lower dimensional sub-space. The knowledge of the separable dataset, unknown in real-world problems, provides a reference to compare the three validation metrics using a quantity referred to as the "weighted likelihood". As an application example, the study investigates a classification model for hip fracture prediction. The data is obtained from a parameterized finite element model of a femur. The performance of the various validation metrics is studied for several levels of separability, ratios of unbalance, and training set sizes.
NASA Astrophysics Data System (ADS)
Horton, Pascal; Weingartner, Rolf; Brönnimann, Stefan
2017-04-01
The analogue method is a statistical downscaling method for precipitation prediction. It uses similarity in terms of synoptic-scale predictors with situations in the past in order to provide a probabilistic prediction for the day of interest. It has been used for decades in a context of weather or flood forecasting, and is more recently also applied to climate studies, whether for reconstruction of past weather conditions or future climate impact studies. In order to evaluate the relationship between synoptic scale predictors and the local weather variable of interest, e.g. precipitation, reanalysis datasets are necessary. Nowadays, the number of available reanalysis datasets increases. These are generated by different atmospheric models with different assimilation techniques and offer various spatial and temporal resolutions. A major difference between these datasets is also the length of the archive they provide. While some datasets start at the beginning of the satellite era (1980) and assimilate these data, others aim at homogeneity on a longer period (e.g. 20th century) and only assimilate conventional observations. The context of the application of analogue methods might drive the choice of an appropriate dataset, for example when the archive length is a leading criterion. However, in many studies, a reanalysis dataset is subjectively chosen, according to the user's preferences or the ease of access. The impact of this choice on the results of the downscaling procedure is rarely considered and no comprehensive comparison has been undertaken so far. In order to fill this gap and to advise on the choice of appropriate datasets, nine different global reanalysis datasets were compared in seven distinct versions of analogue methods, over 300 precipitation stations in Switzerland. Significant differences in terms of prediction performance were identified. Although the impact of the reanalysis dataset on the skill score varies according to the chosen predictor, be it atmospheric circulation or thermodynamic variables, some hierarchy between the datasets is often preserved. This work can thus help choosing an appropriate dataset for the analogue method, or raise awareness of the consequences of using a certain dataset.
A quality assured surface wind database in Eastern Canada
NASA Astrophysics Data System (ADS)
Lucio-Eceiza, E. E.; González-Rouco, J. F.; Navarro, J.; Beltrami, H.; Jiménez, P. A.; García-Bustamante, E.; Hidalgo, A.
2012-04-01
This work summarizes the results of a Quality Assurance (QA) procedure applied to wind data centred over a wide area in Eastern Canada. The region includes the provinces of Quebec, Prince Edward Island, New Brunswick, Nova Scotia, Newfoundland, Labrador and parts of the north-eastern U.S. (Maine, New Hampshire, Massachusetts, New York and Vermont). The data set consists of 527 stations compiled from three different sources: 344 land sites from Environment Canada (EC; 1940-2009), 40 buoys distributed over the East Coast and the Canadian Great Lakes provided by the Department of Fisheries and Oceans (DFO; 1988-2008), and 143 land sites over both eastern Canada and north-eastern U.S. provided by the National Center of Atmospheric Research (NCAR; 1975-2007). The complexity of the QA process is enhanced in this case by the variety of institutional observational protocols that lead to different temporal resolutions (hourly, 3-h and 6-h), unit systems (km/h in EC; m/s in DFO and knots in NCAR), time references (e.g. UTC, UTC+1, UTC-5, UTC-4), etc. Initial corrections comprised the establishment of common reference systems for time (UTC) and units (MKS). The QA applied on the resulting dataset is structured in three steps that involve the detection and correction of: manipulation errors (i.e. repetitions); unrealistic values and ranges in wind module and direction; abnormally low (e.g. long constant periods) and high variations (e.g. extreme values and inhomogeneities). Results from the first step indicate 22 sites (8 EC; 14 DFO) showing temporal patterns that are unrealistically repeated along the stations. After the QA is applied, the dataset will be subject to statistical and dynamical downscaling studies. The statistical approaches will allow for an understanding of the wind field variability related to changes in the large scale atmospheric circulation as well as their dependence on local/regional features like topography, land-sea contrasts, snow/ice presence, etc. The dynamical downscaling will allow for process understanding assessments by performing high spatial resolution simulations with the WRF model. Finally, model validation will be targeted through the comparison with observations.
QualityML: a dictionary for quality metadata encoding
NASA Astrophysics Data System (ADS)
Ninyerola, Miquel; Sevillano, Eva; Serral, Ivette; Pons, Xavier; Zabala, Alaitz; Bastin, Lucy; Masó, Joan
2014-05-01
The scenario of rapidly growing geodata catalogues requires tools focused on facilitate users the choice of products. Having quality fields populated in metadata allow the users to rank and then select the best fit-for-purpose products. In this direction, we have developed the QualityML (http://qualityml.geoviqua.org), a dictionary that contains hierarchically structured concepts to precisely define and relate quality levels: from quality classes to quality measurements. Generically, a quality element is the path that goes from the higher level (quality class) to the lowest levels (statistics or quality metrics). This path is used to encode quality of datasets in the corresponding metadata schemas. The benefits of having encoded quality, in the case of data producers, are related with improvements in their product discovery and better transmission of their characteristics. In the case of data users, particularly decision-makers, they would find quality and uncertainty measures to take the best decisions as well as perform dataset intercomparison. Also it allows other components (such as visualization, discovery, or comparison tools) to be quality-aware and interoperable. On one hand, the QualityML is a profile of the ISO geospatial metadata standards providing a set of rules for precisely documenting quality indicator parameters that is structured in 6 levels. On the other hand, QualityML includes semantics and vocabularies for the quality concepts. Whenever possible, if uses statistic expressions from the UncertML dictionary (http://www.uncertml.org) encoding. However it also extends UncertML to provide list of alternative metrics that are commonly used to quantify quality. A specific example, based on a temperature dataset, is shown below. The annual mean temperature map has been validated with independent in-situ measurements to obtain a global error of 0.5 ° C. Level 0: Quality class (e.g., Thematic accuracy) Level 1: Quality indicator (e.g., Quantitative attribute correctness) Level 2: Measurement field (e.g., DifferentialErrors1D) Level 3: Statistic or Metric (e.g., Half-lengthConfidenceInterval) Level 4: Units (e.g. Celsius degrees) Level 5: Value (e.g.0.5) Level 6: Specifications. Additional information on how the measurement took place, citation of the reference data, the traceability of the process and a publication describing the validation process encoded using new 19157 elements or the GeoViQua (http://www.geoviqua.org) Quality Model (PQM-UQM) extensions to the ISO models. Finally, keep in mind, that QualityML is not just suitable for encoding dataset level but also considers pixel and object level uncertainties. This is done by link the metadata quality descriptions with layers representing not just the data but the uncertainty values associated with each geospatial element.
An Improved Incremental Learning Approach for KPI Prognosis of Dynamic Fuel Cell System.
Yin, Shen; Xie, Xiaochen; Lam, James; Cheung, Kie Chung; Gao, Huijun
2016-12-01
The key performance indicator (KPI) has an important practical value with respect to the product quality and economic benefits for modern industry. To cope with the KPI prognosis issue under nonlinear conditions, this paper presents an improved incremental learning approach based on available process measurements. The proposed approach takes advantage of the algorithm overlapping of locally weighted projection regression (LWPR) and partial least squares (PLS), implementing the PLS-based prognosis in each locally linear model produced by the incremental learning process of LWPR. The global prognosis results including KPI prediction and process monitoring are obtained from the corresponding normalized weighted means of all the local models. The statistical indicators for prognosis are enhanced as well by the design of novel KPI-related and KPI-unrelated statistics with suitable control limits for non-Gaussian data. For application-oriented purpose, the process measurements from real datasets of a proton exchange membrane fuel cell system are employed to demonstrate the effectiveness of KPI prognosis. The proposed approach is finally extended to a long-term voltage prediction for potential reference of further fuel cell applications.
Topographic ERP analyses: a step-by-step tutorial review.
Murray, Micah M; Brunet, Denis; Michel, Christoph M
2008-06-01
In this tutorial review, we detail both the rationale for as well as the implementation of a set of analyses of surface-recorded event-related potentials (ERPs) that uses the reference-free spatial (i.e. topographic) information available from high-density electrode montages to render statistical information concerning modulations in response strength, latency, and topography both between and within experimental conditions. In these and other ways these topographic analysis methods allow the experimenter to glean additional information and neurophysiologic interpretability beyond what is available from canonical waveform analyses. In this tutorial we present the example of somatosensory evoked potentials (SEPs) in response to stimulation of each hand to illustrate these points. For each step of these analyses, we provide the reader with both a conceptual and mathematical description of how the analysis is carried out, what it yields, and how to interpret its statistical outcome. We show that these topographic analysis methods are intuitive and easy-to-use approaches that can remove much of the guesswork often confronting ERP researchers and also assist in identifying the information contained within high-density ERP datasets.
DISSCO: direct imputation of summary statistics allowing covariates
Xu, Zheng; Duan, Qing; Yan, Song; Chen, Wei; Li, Mingyao; Lange, Ethan; Li, Yun
2015-01-01
Background: Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates. Methods: We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO). Results: We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9–15.2% for variants with minor allele frequency <5%. Availability and implementation: http://www.unc.edu/∼yunmli/DISSCO. Contact: yunli@med.unc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25810429
DISSCO: direct imputation of summary statistics allowing covariates.
Xu, Zheng; Duan, Qing; Yan, Song; Chen, Wei; Li, Mingyao; Lange, Ethan; Li, Yun
2015-08-01
Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates. We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO). We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9-15.2% for variants with minor allele frequency <5%. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Wide-Open: Accelerating public data release by automating detection of overdue datasets
Poon, Hoifung; Howe, Bill
2017-01-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819
Wide-Open: Accelerating public data release by automating detection of overdue datasets.
Grechkin, Maxim; Poon, Hoifung; Howe, Bill
2017-06-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
Khan, Nazeer; Siddiqui, Junaid S; Baig-Ansari, Naila
2018-01-01
Background Growth charts are essential tools used by pediatricians as well as public health researchers in assessing and monitoring the well-being of pediatric populations. Development of these growth charts, especially for children above five years of age, is challenging and requires current anthropometric data and advanced statistical analysis. These growth charts are generally presented as a series of smooth centile curves. A number of modeling approaches are available for generating growth charts and applying these on national datasets is important for generating country-specific reference growth charts. Objective To demonstrate that quantile regression (QR) as a viable statistical approach to construct growth reference charts and to assess the applicability of the World Health Organization (WHO) 2007 growth standards to a large Pakistani population of school-going children. Methodology This is a secondary data analysis using anthropometric data of 9,515 students from a Pakistani survey conducted between 2007 and 2014 in four cities of Pakistan. Growth reference charts were created using QR as well as the LMS (Box-Cox transformation (L), the median (M), and the generalized coefficient of variation (S)) method and then compared with WHO 2007 growth standards. Results Centile values estimated by the LMS method and QR procedure had few differences. The centile values attained from QR procedure of BMI-for-age, weight-for-age, and height-for-age of Pakistani children were lower than the standard WHO 2007 centile. Conclusion QR should be considered as an alternative method to develop growth charts for its simplicity and lack of necessity to transform data. WHO 2007 standards are not suitable for Pakistani children. PMID:29632748
Iftikhar, Sundus; Khan, Nazeer; Siddiqui, Junaid S; Baig-Ansari, Naila
2018-02-02
Background Growth charts are essential tools used by pediatricians as well as public health researchers in assessing and monitoring the well-being of pediatric populations. Development of these growth charts, especially for children above five years of age, is challenging and requires current anthropometric data and advanced statistical analysis. These growth charts are generally presented as a series of smooth centile curves. A number of modeling approaches are available for generating growth charts and applying these on national datasets is important for generating country-specific reference growth charts. Objective To demonstrate that quantile regression (QR) as a viable statistical approach to construct growth reference charts and to assess the applicability of the World Health Organization (WHO) 2007 growth standards to a large Pakistani population of school-going children. Methodology This is a secondary data analysis using anthropometric data of 9,515 students from a Pakistani survey conducted between 2007 and 2014 in four cities of Pakistan. Growth reference charts were created using QR as well as the LMS (Box-Cox transformation (L), the median (M), and the generalized coefficient of variation (S)) method and then compared with WHO 2007 growth standards. Results Centile values estimated by the LMS method and QR procedure had few differences. The centile values attained from QR procedure of BMI-for-age, weight-for-age, and height-for-age of Pakistani children were lower than the standard WHO 2007 centile. Conclusion QR should be considered as an alternative method to develop growth charts for its simplicity and lack of necessity to transform data. WHO 2007 standards are not suitable for Pakistani children.
Muenzing, Sascha E A; van Ginneken, Bram; Viergever, Max A; Pluim, Josien P W
2014-04-01
We introduce a boosting algorithm to improve on existing methods for deformable image registration (DIR). The proposed DIRBoost algorithm is inspired by the theory on hypothesis boosting, well known in the field of machine learning. DIRBoost utilizes a method for automatic registration error detection to obtain estimates of local registration quality. All areas detected as erroneously registered are subjected to boosting, i.e. undergo iterative registrations by employing boosting masks on both the fixed and moving image. We validated the DIRBoost algorithm on three different DIR methods (ANTS gSyn, NiftyReg, and DROP) on three independent reference datasets of pulmonary image scan pairs. DIRBoost reduced registration errors significantly and consistently on all reference datasets for each DIR algorithm, yielding an improvement of the registration accuracy by 5-34% depending on the dataset and the registration algorithm employed. Copyright © 2014 Elsevier B.V. All rights reserved.
Boulesteix, Anne-Laure; Wilson, Rory; Hapfelmeier, Alexander
2017-09-09
The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly "evidence-based". Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of "evidence-based" statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. We suggest that benchmark studies-a method of assessment of statistical methods using real-world datasets-might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.
Wynants, Laure; Timmerman, Dirk; Verbakel, Jan Y; Testa, Antonia; Savelli, Luca; Fischerova, Daniela; Franchi, Dorella; Van Holsbeke, Caroline; Epstein, Elisabeth; Froyman, Wouter; Guerriero, Stefano; Rossi, Alberto; Fruscio, Robert; Leone, Francesco Pg; Bourne, Tom; Valentin, Lil; Van Calster, Ben
2017-09-01
Purpose: To evaluate the utility of preoperative diagnostic models for ovarian cancer based on ultrasound and/or biomarkers for referring patients to specialized oncology care. The investigated models were RMI, ROMA, and 3 models from the International Ovarian Tumor Analysis (IOTA) group [LR2, ADNEX, and the Simple Rules risk score (SRRisk)]. Experimental Design: A secondary analysis of prospectively collected data from 2 cross-sectional cohort studies was performed to externally validate diagnostic models. A total of 2,763 patients (2,403 in dataset 1 and 360 in dataset 2) from 18 centers (11 oncology centers and 7 nononcology hospitals) in 6 countries participated. Excised tissue was histologically classified as benign or malignant. The clinical utility of the preoperative diagnostic models was assessed with net benefit (NB) at a range of risk thresholds (5%-50% risk of malignancy) to refer patients to specialized oncology care. We visualized results with decision curves and generated bootstrap confidence intervals. Results: The prevalence of malignancy was 41% in dataset 1 and 40% in dataset 2. For thresholds up to 10% to 15%, RMI and ROMA had a lower NB than referring all patients. SRRisks and ADNEX demonstrated the highest NB. At a threshold of 20%, the NBs of ADNEX, SRrisks, and RMI were 0.348, 0.350, and 0.270, respectively. Results by menopausal status and type of center (oncology vs. nononcology) were similar. Conclusions: All tested IOTA methods, especially ADNEX and SRRisks, are clinically more useful than RMI and ROMA to select patients with adnexal masses for specialized oncology care. Clin Cancer Res; 23(17); 5082-90. ©2017 AACR . ©2017 American Association for Cancer Research.
SAFE: SPARQL Federation over RDF Data Cubes with Access Control.
Khan, Yasar; Saleem, Muhammad; Mehdi, Muntazir; Hogan, Aidan; Mehmood, Qaiser; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh
2017-02-01
Several query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data. We present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE's indexing system does not hold any data instances-it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change. We validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.
A downscaling method for the assessment of local climate change
NASA Astrophysics Data System (ADS)
Bruno, E.; Portoghese, I.; Vurro, M.
2009-04-01
The use of complimentary models is necessary to study the impact of climate change scenarios on the hydrological response at different space-time scales. However, the structure of GCMs is such that their space resolution (hundreds of kilometres) is too coarse and not adequate to describe the variability of extreme events at basin scale (Burlando and Rosso, 2002). To bridge the space-time gap between the climate scenarios and the usual scale of the inputs for hydrological prediction models is a fundamental requisite for the evaluation of climate change impacts on water resources. Since models operate a simplification of a complex reality, their results cannot be expected to fit with climate observations. Identifying local climate scenarios for impact analysis implies the definition of more detailed local scenario by downscaling GCMs or RCMs results. Among the output correction methods we consider the statistical approach by Déqué (2007) reported as a ‘Variable correction method' in which the correction of model outputs is obtained by a function build with the observation dataset and operating a quantile-quantile transformation (Q-Q transform). However, in the case of daily precipitation fields the Q-Q transform is not able to correct the temporal property of the model output concerning the dry-wet lacunarity process. An alternative correction method is proposed based on a stochastic description of the arrival-duration-intensity processes in coherence with the Poissonian Rectangular Pulse scheme (PRP) (Eagleson, 1972). In this proposed approach, the Q-Q transform is applied to the PRP variables derived from the daily rainfall datasets. Consequently the corrected PRP parameters are used for the synthetic generation of statistically homogeneous rainfall time series that mimic the persistency of daily observations for the reference period. Then the PRP parameters are forced through the GCM scenarios to generate local scale rainfall records for the 21st century. The statistical parameters characterizing daily storm occurrence, storm intensity and duration needed to apply the PRP scheme are considered among STARDEX collection of extreme indices.
Data-driven probability concentration and sampling on manifold
DOE Office of Scientific and Technical Information (OSTI.GOV)
Soize, C., E-mail: christian.soize@univ-paris-est.fr; Ghanem, R., E-mail: ghanem@usc.edu
2016-09-15
A new methodology is proposed for generating realizations of a random vector with values in a finite-dimensional Euclidean space that are statistically consistent with a dataset of observations of this vector. The probability distribution of this random vector, while a priori not known, is presumed to be concentrated on an unknown subset of the Euclidean space. A random matrix is introduced whose columns are independent copies of the random vector and for which the number of columns is the number of data points in the dataset. The approach is based on the use of (i) the multidimensional kernel-density estimation methodmore » for estimating the probability distribution of the random matrix, (ii) a MCMC method for generating realizations for the random matrix, (iii) the diffusion-maps approach for discovering and characterizing the geometry and the structure of the dataset, and (iv) a reduced-order representation of the random matrix, which is constructed using the diffusion-maps vectors associated with the first eigenvalues of the transition matrix relative to the given dataset. The convergence aspects of the proposed methodology are analyzed and a numerical validation is explored through three applications of increasing complexity. The proposed method is found to be robust to noise levels and data complexity as well as to the intrinsic dimension of data and the size of experimental datasets. Both the methodology and the underlying mathematical framework presented in this paper contribute new capabilities and perspectives at the interface of uncertainty quantification, statistical data analysis, stochastic modeling and associated statistical inverse problems.« less
An empirical understanding of triple collocation evaluation measure
NASA Astrophysics Data System (ADS)
Scipal, Klaus; Doubkova, Marcela; Hegyova, Alena; Dorigo, Wouter; Wagner, Wolfgang
2013-04-01
Triple collocation method is an advanced evaluation method that has been used in the soil moisture field for only about half a decade. The method requires three datasets with an independent error structure that represent an identical phenomenon. The main advantages of the method are that it a) doesn't require a reference dataset that has to be considered to represent the truth, b) limits the effect of random and systematic errors of other two datasets, and c) simultaneously assesses the error of three datasets. The objective of this presentation is to assess the triple collocation error (Tc) of the ASAR Global Mode Surface Soil Moisture (GM SSM 1) km dataset and highlight problems of the method related to its ability to cancel the effect of error of ancillary datasets. In particular, the goal is to a) investigate trends in Tc related to the change in spatial resolution from 5 to 25 km, b) to investigate trends in Tc related to the choice of a hydrological model, and c) to study the relationship between Tc and other absolute evaluation methods (namely RMSE and Error Propagation EP). The triple collocation method is implemented using ASAR GM, AMSR-E, and a model (either AWRA-L, GLDAS-NOAH, or ERA-Interim). First, the significance of the relationship between the three soil moisture datasets was tested that is a prerequisite for the triple collocation method. Second, the trends in Tc related to the choice of the third reference dataset and scale were assessed. For this purpose the triple collocation is repeated replacing AWRA-L with two different globally available model reanalysis dataset operating at different spatial resolution (ERA-Interim and GLDAS-NOAH). Finally, the retrieved results were compared to the results of the RMSE and EP evaluation measures. Our results demonstrate that the Tc method does not eliminate the random and time-variant systematic errors of the second and the third dataset used in the Tc. The possible reasons include the fact a) that the TC method could not fully function with datasets acting at very different spatial resolutions, or b) that the errors were not fully independent as initially assumed.
Kelder, Johannes C; Cowie, Martin R; McDonagh, Theresa A; Hardman, Suzanna M C; Grobbee, Diederick E; Cost, Bernard; Hoes, Arno W
2011-06-01
Diagnosing early stages of heart failure with mild symptoms is difficult. B-type natriuretic peptide (BNP) has promising biochemical test characteristics, but its diagnostic yield on top of readily available diagnostic knowledge has not been sufficiently quantified in early stages of heart failure. To quantify the added diagnostic value of BNP for the diagnosis of heart failure in a population relevant to GPs and validate the findings in an independent primary care patient population. Individual patient data meta-analysis followed by external validation. The additional diagnostic yield of BNP above standard clinical information was compared with ECG and chest x-ray results. Derivation was performed on two existing datasets from Hillingdon (n=127) and Rotterdam (n=149) while the UK Natriuretic Peptide Study (n=306) served as validation dataset. Included were patients with suspected heart failure referred to a rapid-access diagnostic outpatient clinic. Case definition was according to the ESC guideline. Logistic regression was used to assess discrimination (with the c-statistic) and calibration. Of the 276 patients in the derivation set, 30.8% had heart failure. The clinical model (encompassing age, gender, known coronary artery disease, diabetes, orthopnoea, elevated jugular venous pressure, crackles, pitting oedema and S3 gallop) had a c-statistic of 0.79. Adding, respectively, chest x-ray results, ECG results or BNP to the clinical model increased the c-statistic to 0.84, 0.85 and 0.92. Neither ECG nor chest x-ray added significantly to the 'clinical plus BNP' model. All models had adequate calibration. The 'clinical plus BNP' diagnostic model performed well in an independent cohort with comparable inclusion criteria (c-statistic=0.91 and adequate calibration). Using separate cut-off values for 'ruling in' (typically implying referral for echocardiography) and for 'ruling out' heart failure--creating a grey zone--resulted in insufficient proportions of patients with a correct diagnosis. BNP has considerable diagnostic value in addition to signs and symptoms in patients suspected of heart failure in primary care. However, using BNP alone with the currently recommended cut-off levels is not sufficient to make a reliable diagnosis of heart failure.
NASA Astrophysics Data System (ADS)
Melville, Bethany; Lucieer, Arko; Aryal, Jagannath
2018-04-01
This paper presents a random forest classification approach for identifying and mapping three types of lowland native grassland communities found in the Tasmanian Midlands region. Due to the high conservation priority assigned to these communities, there has been an increasing need to identify appropriate datasets that can be used to derive accurate and frequently updateable maps of community extent. Therefore, this paper proposes a method employing repeat classification and statistical significance testing as a means of identifying the most appropriate dataset for mapping these communities. Two datasets were acquired and analysed; a Landsat ETM+ scene, and a WorldView-2 scene, both from 2010. Training and validation data were randomly subset using a k-fold (k = 50) approach from a pre-existing field dataset. Poa labillardierei, Themeda triandra and lowland native grassland complex communities were identified in addition to dry woodland and agriculture. For each subset of randomly allocated points, a random forest model was trained based on each dataset, and then used to classify the corresponding imagery. Validation was performed using the reciprocal points from the independent subset that had not been used to train the model. Final training and classification accuracies were reported as per class means for each satellite dataset. Analysis of Variance (ANOVA) was undertaken to determine whether classification accuracy differed between the two datasets, as well as between classifications. Results showed mean class accuracies between 54% and 87%. Class accuracy only differed significantly between datasets for the dry woodland and Themeda grassland classes, with the WorldView-2 dataset showing higher mean classification accuracies. The results of this study indicate that remote sensing is a viable method for the identification of lowland native grassland communities in the Tasmanian Midlands, and that repeat classification and statistical significant testing can be used to identify optimal datasets for vegetation community mapping.
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Elger, Kirsten; Bertelmann, Roland; Klump, Jens
2016-04-01
With the foundation of DataCite in 2009 and the technical infrastructure installed in the last six years it has become very easy to create citable dataset DOIs. Nowadays, dataset DOIs are increasingly accepted and required by journals in reference lists of manuscripts. In addition, DataCite provides usage statistics [1] of assigned DOIs and offers a public search API to make research data count. By linking related information to the data, they become more useful for future generations of scientists. For this purpose, several identifier systems, as ISBN for books, ISSN for journals, DOI for articles or related data, Orcid for authors, and IGSN for physical samples can be attached to DOIs using the DataCite metadata schema [2]. While these are good preconditions to publish data, free and open solutions that help with the curation of data, the publication of research data, and the assignment of DOIs in one software seem to be rare. At GFZ Potsdam we built a modular software stack that is made of several free and open software solutions and we established 'GFZ Data Services'. 'GFZ Data Services' provides storage, a metadata editor for publication and a facility to moderate minted DOIs. All software solutions are connected through web APIs, which makes it possible to reuse and integrate established software. Core component of 'GFZ Data Services' is an eSciDoc [3] middleware that is used as central storage, and has been designed along the OAIS reference model for digital preservation. Thus, data are stored in self-contained packages that are made of binary file-based data and XML-based metadata. The eSciDoc infrastructure provides access control to data and it is able to handle half-open datasets, which is useful in embargo situations when a subset of the research data are released after an adequate period. The data exchange platform panMetaDocs [4] makes use of eSciDoc's REST API to upload file-based data into eSciDoc and uses a metadata editor [5] to annotate the files with metadata. The metadata editor has a user-friendly interface with nominal lists, extensive explanations, and an interactive mapping tool to provide assistance to scientists describing the data. It is possible to deposit metadata templates to fill certain fields with default values. The metadata editor generates metadata in the schemas ISO19139, NASA GCMD DIF, and DataCite and could be extended for other schemas. panMetaDocs is able to mint dataset DOIs through DOIDB, which is our component to moderate dataset DOIs issued through 'GFZ Data Services'. DOIDB accepts metadata in the schemas ISO19139, DIF, and DataCite. In addition, DOIDB provides an OAI-PMH interface to disseminate all deposited metadata to data portals. The presentation of datasets on DOI landing pages is done though XSLT stylesheet transformation of the XML-based metadata. The landing pages have been designed to meet needs of scientists. We are able to render the metadata to different layouts. Furthermore, additional information about datasets and publications is assembled into the webpage by querying public databases on the internet. The work presented here will focus on technical details of the software stack. [1] http://stats.datacite.org [2] http://www.dlib.org/dlib/january11/starr/01starr.html [3] http://www.escidoc.org [4] http://panmetadocs.sf.net [5] http://github.com/ulbricht
Jong, Victor L; Novianti, Putri W; Roes, Kit C B; Eijkemans, Marinus J C
2014-12-01
The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.
Sensitivity analysis of reference evapotranspiration to sensor accuracy
USDA-ARS?s Scientific Manuscript database
Meteorological sensor networks are often used across agricultural regions to calculate the ASCE Standardized Reference ET Equation, and inaccuracies in individual sensors can lead to inaccuracies in ET estimates. Multiyear datasets from the semi-arid Colorado Agricultural Meteorological (CoAgMet) an...
Jiang, Wei; Yu, Weichuan
2017-02-15
In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual measurements, summary-statistics-based meta-analysis methods have become popular to jointly analyze datasets from multiple GWASs. In this paper, we propose a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). We prove that our method is the most powerful summary-statistics-based joint analysis method when controlling the false discovery rate at a certain level. In particular, the Jlfdr-based method achieves higher power than commonly used meta-analysis methods when analyzing heterogeneous datasets from multiple GWASs. Simulation experiments demonstrate the superior power of our method over meta-analysis methods. Also, our method discovers more associations than meta-analysis methods from empirical datasets of four phenotypes. The R-package is available at: http://bioinformatics.ust.hk/Jlfdr.html . eeyu@ust.hk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
The Southampton-York Natural Scenes (SYNS) dataset: Statistics of surface attitude
Adams, Wendy J.; Elder, James H.; Graf, Erich W.; Leyland, Julian; Lugtigheid, Arthur J.; Muryy, Alexander
2016-01-01
Recovering 3D scenes from 2D images is an under-constrained task; optimal estimation depends upon knowledge of the underlying scene statistics. Here we introduce the Southampton-York Natural Scenes dataset (SYNS: https://syns.soton.ac.uk), which provides comprehensive scene statistics useful for understanding biological vision and for improving machine vision systems. In order to capture the diversity of environments that humans encounter, scenes were surveyed at random locations within 25 indoor and outdoor categories. Each survey includes (i) spherical LiDAR range data (ii) high-dynamic range spherical imagery and (iii) a panorama of stereo image pairs. We envisage many uses for the dataset and present one example: an analysis of surface attitude statistics, conditioned on scene category and viewing elevation. Surface normals were estimated using a novel adaptive scale selection algorithm. Across categories, surface attitude below the horizon is dominated by the ground plane (0° tilt). Near the horizon, probability density is elevated at 90°/270° tilt due to vertical surfaces (trees, walls). Above the horizon, probability density is elevated near 0° slant due to overhead structure such as ceilings and leaf canopies. These structural regularities represent potentially useful prior assumptions for human and machine observers, and may predict human biases in perceived surface attitude. PMID:27782103
RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes.
Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa
2017-08-29
Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/.
RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes
Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa
2017-01-01
Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/. PMID:28850115
Large scale atmospheric drivers for heat waves in the Mediterranean Basin
NASA Astrophysics Data System (ADS)
Pasqui, Massimiliano; Di Giuseppe, Edmondo
2016-04-01
West African Heat Low (WAHL) is one of the prominent dynamical components of the West African Monsoon (WAM) system playing a key role in the summer atmospheric circulation over Mediterranean as well. It is characterized by a semi-permanent low pressure system generated and maintained by surface heating over the western part of Saharan desert in summer, and a divergent flux pattern above the atmospheric boundary level. In this study we analyse the formation and occurrence of heat waves in the Mediterranean Basin connected to the WAHL regimes in combination with the subtropical anticyclone regimes over North Atlantic basin (the "Azore High") . In this work, heat waves are defined when more than 6 consecutive days with a daily temperature above 90th percentile corresponding threshold are observed. We use 1971-2000 as reference period for thresholds calculation, based on two datasets: a) the European Climate Assessment & Dataset (ECAD/E-OBS) data; b) the Berkeley-Earth Project data; the analysis period covers March-September from 1951 to 2015 and 1951 to 2011 respectively. The WAHL index is calculated following the method proposed by Chauvin et al. (2010) and based on NCAR/NCEP Reanalysis dataset, while the Azore High pressure system regimes variability are computed as in Davis et al. (1997). We show that a statistical relationship between heat waves in Western and Central Mediterranean Basin and WAHL mechanism exists, being the latter a prominent causal factor. The relationships and causal connections between WAHL and Azores High atmospheric systems are also analysed to highlight potential implications for heat waves outlooks and early warning systems.
Reference datasets for bioequivalence trials in a two-group parallel design.
Fuglsang, Anders; Schütz, Helmut; Labes, Detlew
2015-03-01
In order to help companies qualify and validate the software used to evaluate bioequivalence trials with two parallel treatment groups, this work aims to define datasets with known results. This paper puts a total 11 datasets into the public domain along with proposed consensus obtained via evaluations from six different software packages (R, SAS, WinNonlin, OpenOffice Calc, Kinetica, EquivTest). Insofar as possible, datasets were evaluated with and without the assumption of equal variances for the construction of a 90% confidence interval. Not all software packages provide functionality for the assumption of unequal variances (EquivTest, Kinetica), and not all packages can handle datasets with more than 1000 subjects per group (WinNonlin). Where results could be obtained across all packages, one showed questionable results when datasets contained unequal group sizes (Kinetica). A proposal is made for the results that should be used as validation targets.
P-MartCancer-Interactive Online Software to Enable Analysis of Shotgun Cancer Proteomic Datasets.
Webb-Robertson, Bobbie-Jo M; Bramer, Lisa M; Jensen, Jeffrey L; Kobold, Markus A; Stratton, Kelly G; White, Amanda M; Rodland, Karin D
2017-11-01
P-MartCancer is an interactive web-based software environment that enables statistical analyses of peptide or protein data, quantitated from mass spectrometry-based global proteomics experiments, without requiring in-depth knowledge of statistical programming. P-MartCancer offers a series of statistical modules associated with quality assessment, peptide and protein statistics, protein quantification, and exploratory data analyses driven by the user via customized workflows and interactive visualization. Currently, P-MartCancer offers access and the capability to analyze multiple cancer proteomic datasets generated through the Clinical Proteomics Tumor Analysis Consortium at the peptide, gene, and protein levels. P-MartCancer is deployed as a web service (https://pmart.labworks.org/cptac.html), alternatively available via Docker Hub (https://hub.docker.com/r/pnnl/pmart-web/). Cancer Res; 77(21); e47-50. ©2017 AACR . ©2017 American Association for Cancer Research.
A Synergy Cropland of China by Fusing Multiple Existing Maps and Statistics.
Lu, Miao; Wu, Wenbin; You, Liangzhi; Chen, Di; Zhang, Li; Yang, Peng; Tang, Huajun
2017-07-12
Accurate information on cropland extent is critical for scientific research and resource management. Several cropland products from remotely sensed datasets are available. Nevertheless, significant inconsistency exists among these products and the cropland areas estimated from these products differ considerably from statistics. In this study, we propose a hierarchical optimization synergy approach (HOSA) to develop a hybrid cropland map of China, circa 2010, by fusing five existing cropland products, i.e., GlobeLand30, Climate Change Initiative Land Cover (CCI-LC), GlobCover 2009, MODIS Collection 5 (MODIS C5), and MODIS Cropland, and sub-national statistics of cropland area. HOSA simplifies the widely used method of score assignment into two steps, including determination of optimal agreement level and identification of the best product combination. The accuracy assessment indicates that the synergy map has higher accuracy of spatial locations and better consistency with statistics than the five existing datasets individually. This suggests that the synergy approach can improve the accuracy of cropland mapping and enhance consistency with statistics.
Spatiotemporal dataset on Chinese population distribution and its driving factors from 1949 to 2013.
Wang, Lizhe; Chen, Lajiao
2016-07-05
Spatio-temporal data on human population and its driving factors is critical to understanding and responding to population problems. Unfortunately, such spatio-temporal data on a large scale and over the long term are often difficult to obtain. Here, we present a dataset on Chinese population distribution and its driving factors over a remarkably long period, from 1949 to 2013. Driving factors of population distribution were selected according to the push-pull migration laws, which were summarized into four categories: natural environment, natural resources, economic factors and social factors. Natural environment and natural resources indicators were calculated using Geographic Information System (GIS) and Remote Sensing (RS) techniques, whereas economic and social factors from 1949 to 2013 were collected from the China Statistical Yearbook and China Compendium of Statistics from 1949 to 2008. All of the data were quality controlled and unified into an identical dataset with the same spatial scope and time period. The dataset is expected to be useful for understanding how population responds to and impacts environmental change.
Spatiotemporal dataset on Chinese population distribution and its driving factors from 1949 to 2013
NASA Astrophysics Data System (ADS)
Wang, Lizhe; Chen, Lajiao
2016-07-01
Spatio-temporal data on human population and its driving factors is critical to understanding and responding to population problems. Unfortunately, such spatio-temporal data on a large scale and over the long term are often difficult to obtain. Here, we present a dataset on Chinese population distribution and its driving factors over a remarkably long period, from 1949 to 2013. Driving factors of population distribution were selected according to the push-pull migration laws, which were summarized into four categories: natural environment, natural resources, economic factors and social factors. Natural environment and natural resources indicators were calculated using Geographic Information System (GIS) and Remote Sensing (RS) techniques, whereas economic and social factors from 1949 to 2013 were collected from the China Statistical Yearbook and China Compendium of Statistics from 1949 to 2008. All of the data were quality controlled and unified into an identical dataset with the same spatial scope and time period. The dataset is expected to be useful for understanding how population responds to and impacts environmental change.
a Comparative Analysis of Five Cropland Datasets in Africa
NASA Astrophysics Data System (ADS)
Wei, Y.; Lu, M.; Wu, W.
2018-04-01
The food security, particularly in Africa, is a challenge to be resolved. The cropland area and spatial distribution obtained from remote sensing imagery are vital information. In this paper, according to cropland area and spatial location, we compare five global cropland datasets including CCI Land Cover, GlobCover, MODIS Collection 5, GlobeLand30 and Unified Cropland in circa 2010 of Africa in terms of cropland area and spatial location. The accuracy of cropland area calculated from five datasets was analyzed compared with statistic data. Based on validation samples, the accuracies of spatial location for the five cropland products were assessed by error matrix. The results show that GlobeLand30 has the best fitness with the statistics, followed by MODIS Collection 5 and Unified Cropland, GlobCover and CCI Land Cover have the lower accuracies. For the accuracy of spatial location of cropland, GlobeLand30 reaches the highest accuracy, followed by Unified Cropland, MODIS Collection 5 and GlobCover, CCI Land Cover has the lowest accuracy. The spatial location accuracy of five datasets in the Csa with suitable farming condition is generally higher than in the Bsk.
Removal of BCG artefact from concurrent fMRI-EEG recordings based on EMD and PCA.
Javed, Ehtasham; Faye, Ibrahima; Malik, Aamir Saeed; Abdullah, Jafri Malin
2017-11-01
Simultaneous electroencephalography (EEG) and functional magnetic resonance image (fMRI) acquisitions provide better insight into brain dynamics. Some artefacts due to simultaneous acquisition pose a threat to the quality of the data. One such problematic artefact is the ballistocardiogram (BCG) artefact. We developed a hybrid algorithm that combines features of empirical mode decomposition (EMD) with principal component analysis (PCA) to reduce the BCG artefact. The algorithm does not require extra electrocardiogram (ECG) or electrooculogram (EOG) recordings to extract the BCG artefact. The method was tested with both simulated and real EEG data of 11 participants. From the simulated data, the similarity index between the extracted BCG and the simulated BCG showed the effectiveness of the proposed method in BCG removal. On the other hand, real data were recorded with two conditions, i.e. resting state (eyes closed dataset) and task influenced (event-related potentials (ERPs) dataset). Using qualitative (visual inspection) and quantitative (similarity index, improved normalized power spectrum (INPS) ratio, power spectrum, sample entropy (SE)) evaluation parameters, the assessment results showed that the proposed method can efficiently reduce the BCG artefact while preserving the neuronal signals. Compared with conventional methods, namely, average artefact subtraction (AAS), optimal basis set (OBS) and combined independent component analysis and principal component analysis (ICA-PCA), the statistical analyses of the results showed that the proposed method has better performance, and the differences were significant for all quantitative parameters except for the power and sample entropy. The proposed method does not require any reference signal, prior information or assumption to extract the BCG artefact. It will be very useful in circumstances where the reference signal is not available. Copyright © 2017 Elsevier B.V. All rights reserved.
Billing code algorithms to identify cases of peripheral artery disease from administrative data
Fan, Jin; Arruda-Olson, Adelaide M; Leibson, Cynthia L; Smith, Carin; Liu, Guanghui; Bailey, Kent R; Kullo, Iftikhar J
2013-01-01
Objective To construct and validate billing code algorithms for identifying patients with peripheral arterial disease (PAD). Methods We extracted all encounters and line item details including PAD-related billing codes at Mayo Clinic Rochester, Minnesota, between July 1, 1997 and June 30, 2008; 22 712 patients evaluated in the vascular laboratory were divided into training and validation sets. Multiple logistic regression analysis was used to create an integer code score from the training dataset, and this was tested in the validation set. We applied a model-based code algorithm to patients evaluated in the vascular laboratory and compared this with a simpler algorithm (presence of at least one of the ICD-9 PAD codes 440.20–440.29). We also applied both algorithms to a community-based sample (n=4420), followed by a manual review. Results The logistic regression model performed well in both training and validation datasets (c statistic=0.91). In patients evaluated in the vascular laboratory, the model-based code algorithm provided better negative predictive value. The simpler algorithm was reasonably accurate for identification of PAD status, with lesser sensitivity and greater specificity. In the community-based sample, the sensitivity (38.7% vs 68.0%) of the simpler algorithm was much lower, whereas the specificity (92.0% vs 87.6%) was higher than the model-based algorithm. Conclusions A model-based billing code algorithm had reasonable accuracy in identifying PAD cases from the community, and in patients referred to the non-invasive vascular laboratory. The simpler algorithm had reasonable accuracy for identification of PAD in patients referred to the vascular laboratory but was significantly less sensitive in a community-based sample. PMID:24166724
Lakha, F; Gorman, D R; Mateos, P
2011-10-01
Health inequalities between ethnic minorities and the general population are persistent. Addressing them is hampered by the inability to classify individuals' ethnicity accurately. This is addressed by a new name-based ethnicity classification methodology called 'Onomap'. This paper evaluates the diagnostic accuracy of Onomap in identifying population groups by ethnicity, and discusses applications to public health practice. Onomap was applied to three independent reference datasets (birth registration, pupil census and register of Polish health professionals) collected in Britain and Poland at individual level (n = 260,748). Results were compared with the reference database ethnicity 'gold standard'. Outcome measures included sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV). Ninety-five percent confidence intervals and Chi-squared tests were used. Onomap identified the majority of those in the British participant group with high sensitivity and PPV (>95%), and low misclassification (<5%), although specificity and NPV were lowest in this group (56-87%). Outcome measures for all other non-British groupings were high for specificity and NPV (>98%), but variable for sensitivity and PPV (17-89%). Differences in misclassification by gender were statistically significant. Using maiden name rather than married name in women improved classification outcomes for those born in the British Isles (0.53%, 95% confidence interval 0.26-0.8%; P < 0.001) but not for South Asian or Polish groups. Onomap offers an effective methodology for identifying population groups in both health-related and educational datasets, categorizing populations into a variety of ethnic groups. This evaluation suggests that it can successfully assist health researchers, planners and policy makers in identifying and addressing health inequalities. Copyright © 2011 The Royal Society for Public Health. Published by Elsevier Ltd. All rights reserved.
Atlas-guided prostate intensity modulated radiation therapy (IMRT) planning.
Sheng, Yang; Li, Taoran; Zhang, You; Lee, W Robert; Yin, Fang-Fang; Ge, Yaorong; Wu, Q Jackie
2015-09-21
An atlas-based IMRT planning technique for prostate cancer was developed and evaluated. A multi-dose atlas was built based on the anatomy patterns of the patients, more specifically, the percent distance to the prostate and the concaveness angle formed by the seminal vesicles relative to the anterior-posterior axis. A 70-case dataset was classified using a k-medoids clustering analysis to recognize anatomy pattern variations in the dataset. The best classification, defined by the number of classes or medoids, was determined by the largest value of the average silhouette width. Reference plans from each class formed a multi-dose atlas. The atlas-guided planning (AGP) technique started with matching the new case anatomy pattern to one of the reference cases in the atlas; then a deformable registration between the atlas and new case anatomies transferred the dose from the atlas to the new case to guide inverse planning with full automation. 20 additional clinical cases were re-planned to evaluate the AGP technique. Dosimetric properties between AGP and clinical plans were evaluated. The classification analysis determined that the 5-case atlas would best represent anatomy patterns for the patient cohort. AGP took approximately 1 min on average (corresponding to 70 iterations of optimization) for all cases. When dosimetric parameters were compared, the differences between AGP and clinical plans were less than 3.5%, albeit some statistical significances observed: homogeneity index (p > 0.05), conformity index (p < 0.01), bladder gEUD (p < 0.01), and rectum gEUD (p = 0.02). Atlas-guided treatment planning is feasible and efficient. Atlas predicted dose can effectively guide the optimizer to achieve plan quality comparable to that of clinical plans.
Wu, Abraham J; Bosch, Walter R; Chang, Daniel T; Hong, Theodore S; Jabbour, Salma K; Kleinberg, Lawrence R; Mamon, Harvey J; Thomas, Charles R; Goodman, Karyn A
2015-07-15
Current guidelines for esophageal cancer contouring are derived from traditional 2-dimensional fields based on bony landmarks, and they do not provide sufficient anatomic detail to ensure consistent contouring for more conformal radiation therapy techniques such as intensity modulated radiation therapy (IMRT). Therefore, we convened an expert panel with the specific aim to derive contouring guidelines and generate an atlas for the clinical target volume (CTV) in esophageal or gastroesophageal junction (GEJ) cancer. Eight expert academically based gastrointestinal radiation oncologists participated. Three sample cases were chosen: a GEJ cancer, a distal esophageal cancer, and a mid-upper esophageal cancer. Uniform computed tomographic (CT) simulation datasets and accompanying diagnostic positron emission tomographic/CT images were distributed to each expert, and the expert was instructed to generate gross tumor volume (GTV) and CTV contours for each case. All contours were aggregated and subjected to quantitative analysis to assess the degree of concordance between experts and to generate draft consensus contours. The panel then refined these contours to generate the contouring atlas. The κ statistics indicated substantial agreement between panelists for each of the 3 test cases. A consensus CTV atlas was generated for the 3 test cases, each representing common anatomic presentations of esophageal cancer. The panel agreed on guidelines and principles to facilitate the generalizability of the atlas to individual cases. This expert panel successfully reached agreement on contouring guidelines for esophageal and GEJ IMRT and generated a reference CTV atlas. This atlas will serve as a reference for IMRT contours for clinical practice and prospective trial design. Subsequent patterns of failure analyses of clinical datasets using these guidelines may require modification in the future. Copyright © 2015 Elsevier Inc. All rights reserved.
Expert consensus contouring guidelines for IMRT in esophageal and gastroesophageal junction cancer
Wu, Abraham J.; Bosch, Walter R.; Chang, Daniel T.; Hong, Theodore S.; Jabbour, Salma K.; Kleinberg, Lawrence R.; Mamon, Harvey J.; Thomas, Charles R.; Goodman, Karyn A.
2015-01-01
Purpose/Objective(s) Current guidelines for esophageal cancer contouring are derived from traditional two-dimensional fields based on bony landmarks, and do not provide sufficient anatomical detail to ensure consistent contouring for more conformal radiotherapy techniques such as intensity-modulated radiation therapy (IMRT). Therefore, we convened an expert panel with the specific aim to derive contouring guidelines and generate an atlas for the clinical target volume (CTV) in esophageal or gastroesophageal junction (GEJ) cancer. Methods and Materials Eight expert academically-based gastrointestinal radiation oncologists participated. Three sample cases were chosen: a GEJ cancer, a distal esophageal cancer, and a mid-upper esophageal cancer. Uniform CT simulation datasets and an accompanying diagnostic PET-CT were distributed to each expert, and he/she was instructed to generate gross tumor volume (GTV) and CTV contours for each case. All contours were aggregated and subjected to quantitative analysis to assess the degree of concordance between experts and generate draft consensus contours. The panel then refined these contours to generate the contouring atlas. Results Kappa statistics indicated substantial agreement between panelists for each of the three test cases. A consensus CTV atlas was generated for the three test cases, each representing common anatomic presentations of esophageal cancer. The panel agreed on guidelines and principles to facilitate the generalizability of the atlas to individual cases. Conclusions This expert panel successfully reached agreement on contouring guidelines for esophageal and GEJ IMRT and generated a reference CTV atlas. This atlas will serve as a reference for IMRT contours for clinical practice and prospective trial design. Subsequent patterns of failure analyses of clinical datasets utilizing these guidelines may require modification in the future. PMID:26104943
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Abraham J., E-mail: wua@mskcc.org; Bosch, Walter R.; Chang, Daniel T.
Purpose/Objective(s): Current guidelines for esophageal cancer contouring are derived from traditional 2-dimensional fields based on bony landmarks, and they do not provide sufficient anatomic detail to ensure consistent contouring for more conformal radiation therapy techniques such as intensity modulated radiation therapy (IMRT). Therefore, we convened an expert panel with the specific aim to derive contouring guidelines and generate an atlas for the clinical target volume (CTV) in esophageal or gastroesophageal junction (GEJ) cancer. Methods and Materials: Eight expert academically based gastrointestinal radiation oncologists participated. Three sample cases were chosen: a GEJ cancer, a distal esophageal cancer, and a mid-upper esophagealmore » cancer. Uniform computed tomographic (CT) simulation datasets and accompanying diagnostic positron emission tomographic/CT images were distributed to each expert, and the expert was instructed to generate gross tumor volume (GTV) and CTV contours for each case. All contours were aggregated and subjected to quantitative analysis to assess the degree of concordance between experts and to generate draft consensus contours. The panel then refined these contours to generate the contouring atlas. Results: The κ statistics indicated substantial agreement between panelists for each of the 3 test cases. A consensus CTV atlas was generated for the 3 test cases, each representing common anatomic presentations of esophageal cancer. The panel agreed on guidelines and principles to facilitate the generalizability of the atlas to individual cases. Conclusions: This expert panel successfully reached agreement on contouring guidelines for esophageal and GEJ IMRT and generated a reference CTV atlas. This atlas will serve as a reference for IMRT contours for clinical practice and prospective trial design. Subsequent patterns of failure analyses of clinical datasets using these guidelines may require modification in the future.« less
EmailTime: visual analytics and statistics for temporal email
NASA Astrophysics Data System (ADS)
Erfani Joorabchi, Minoo; Yim, Ji-Dong; Shaw, Christopher D.
2011-01-01
Although the discovery and analysis of communication patterns in large and complex email datasets are difficult tasks, they can be a valuable source of information. We present EmailTime, a visual analysis tool of email correspondence patterns over the course of time that interactively portrays personal and interpersonal networks using the correspondence in the email dataset. Our approach is to put time as a primary variable of interest, and plot emails along a time line. EmailTime helps email dataset explorers interpret archived messages by providing zooming, panning, filtering and highlighting etc. To support analysis, it also measures and visualizes histograms, graph centrality and frequency on the communication graph that can be induced from the email collection. This paper describes EmailTime's capabilities, along with a large case study with Enron email dataset to explore the behaviors of email users within different organizational positions from January 2000 to December 2001. We defined email behavior as the email activity level of people regarding a series of measured metrics e.g. sent and received emails, numbers of email addresses, etc. These metrics were calculated through EmailTime. Results showed specific patterns in the use email within different organizational positions. We suggest that integrating both statistics and visualizations in order to display information about the email datasets may simplify its evaluation.
On the comparison of the strength of morphological integration across morphometric datasets.
Adams, Dean C; Collyer, Michael L
2016-11-01
Evolutionary morphologists frequently wish to understand the extent to which organisms are integrated, and whether the strength of morphological integration among subsets of phenotypic variables differ among taxa or other groups. However, comparisons of the strength of integration across datasets are difficult, in part because the summary measures that characterize these patterns (RV coefficient and r PLS ) are dependent both on sample size and on the number of variables. As a solution to this issue, we propose a standardized test statistic (a z-score) for measuring the degree of morphological integration between sets of variables. The approach is based on a partial least squares analysis of trait covariation, and its permutation-based sampling distribution. Under the null hypothesis of a random association of variables, the method displays a constant expected value and confidence intervals for datasets of differing sample sizes and variable number, thereby providing a consistent measure of integration suitable for comparisons across datasets. A two-sample test is also proposed to statistically determine whether levels of integration differ between datasets, and an empirical example examining cranial shape integration in Mediterranean wall lizards illustrates its use. Some extensions of the procedure are also discussed. © 2016 The Author(s). Evolution © 2016 The Society for the Study of Evolution.
Nesvizhskii, Alexey I.
2013-01-01
Analysis of protein interaction networks and protein complexes using affinity purification and mass spectrometry (AP/MS) is among most commonly used and successful applications of proteomics technologies. One of the foremost challenges of AP/MS data is a large number of false positive protein interactions present in unfiltered datasets. Here we review computational and informatics strategies for detecting specific protein interaction partners in AP/MS experiments, with a focus on incomplete (as opposite to genome-wide) interactome mapping studies. These strategies range from standard statistical approaches, to empirical scoring schemes optimized for a particular type of data, to advanced computational frameworks. The common denominator among these methods is the use of label-free quantitative information such as spectral counts or integrated peptide intensities that can be extracted from AP/MS data. We also discuss related issues such as combining multiple biological or technical replicates, and dealing with data generated using different tagging strategies. Computational approaches for benchmarking of scoring methods are discussed, and the need for generation of reference AP/MS datasets is highlighted. Finally, we discuss the possibility of more extended modeling of experimental AP/MS data, including integration with external information such as protein interaction predictions based on functional genomics data. PMID:22611043
Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples
White, James Robert; Nagarajan, Niranjan; Pop, Mihai
2009-01-01
Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them. We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/. PMID:19360128
NASA Astrophysics Data System (ADS)
Ekenes, K.
2017-12-01
This presentation will outline the process of creating a web application for exploring large amounts of scientific geospatial data using modern automated cartographic techniques. Traditional cartographic methods, including data classification, may inadvertently hide geospatial and statistical patterns in the underlying data. This presentation demonstrates how to use smart web APIs that quickly analyze the data when it loads, and provides suggestions for the most appropriate visualizations based on the statistics of the data. Since there are just a few ways to visualize any given dataset well, it is imperative to provide smart default color schemes tailored to the dataset as opposed to static defaults. Since many users don't go beyond default values, it is imperative that they are provided with smart default visualizations. Multiple functions for automating visualizations are available in the Smart APIs, along with UI elements allowing users to create more than one visualization for a dataset since there isn't a single best way to visualize a given dataset. Since bivariate and multivariate visualizations are particularly difficult to create effectively, this automated approach removes the guesswork out of the process and provides a number of ways to generate multivariate visualizations for the same variables. This allows the user to choose which visualization is most appropriate for their presentation. The methods used in these APIs and the renderers generated by them are not available elsewhere. The presentation will show how statistics can be used as the basis for automating default visualizations of data along continuous ramps, creating more refined visualizations while revealing the spread and outliers of the data. Adding interactive components to instantaneously alter visualizations allows users to unearth spatial patterns previously unknown among one or more variables. These applications may focus on a single dataset that is frequently updated, or configurable for a variety of datasets from multiple sources.
Fernández, M D; López, J C; Baeza, E; Céspedes, A; Meca, D E; Bailey, B
2015-08-01
A typical meteorological year (TMY) represents the typical meteorological conditions over many years but still contains the short term fluctuations which are absent from long-term averaged data. Meteorological data were measured at the Experimental Station of Cajamar 'Las Palmerillas' (Cajamar Foundation) in Almeria, Spain, over 19 years at the meteorological station and in a reference greenhouse which is typical of those used in the region. The two sets of measurements were subjected to quality control analysis and then used to create TMY datasets using three different methodologies proposed in the literature. Three TMY datasets were generated for the external conditions and two for the greenhouse. They were assessed by using each as input to seven horticultural models and comparing the model results with those obtained by experiment in practical trials. In addition, the models were used with the meteorological data recorded during the trials. A scoring system was used to identify the best performing TMY in each application and then rank them in overall performance. The best methodology was that of Argiriou for both greenhouse and external conditions. The average relative errors between the seasonal values estimated using the 19-year dataset and those using the Argiriou greenhouse TMY were 2.2 % (reference evapotranspiration), -0.45 % (pepper crop transpiration), 3.4 % (pepper crop nitrogen uptake) and 0.8 % (green bean yield). The values obtained using the Argiriou external TMY were 1.8 % (greenhouse reference evapotranspiration), 0.6 % (external reference evapotranspiration), 4.7 % (greenhouse heat requirement) and 0.9 % (loquat harvest date). Using the models with the 19 individual years in the historical dataset showed that the year to year weather variability gave results which differed from the average values by ± 15 %. By comparison with results from other greenhouses it was shown that the greenhouse TMY is applicable to greenhouses which have a solar radiation transmission of approximately 65 % and rely on manual control of ventilation which constitute the majority in the south-east of Spain and in most Mediterranean greenhouse areas.
NASA Astrophysics Data System (ADS)
Fernández, M. D.; López, J. C.; Baeza, E.; Céspedes, A.; Meca, D. E.; Bailey, B.
2015-08-01
A typical meteorological year (TMY) represents the typical meteorological conditions over many years but still contains the short term fluctuations which are absent from long-term averaged data. Meteorological data were measured at the Experimental Station of Cajamar `Las Palmerillas' (Cajamar Foundation) in Almeria, Spain, over 19 years at the meteorological station and in a reference greenhouse which is typical of those used in the region. The two sets of measurements were subjected to quality control analysis and then used to create TMY datasets using three different methodologies proposed in the literature. Three TMY datasets were generated for the external conditions and two for the greenhouse. They were assessed by using each as input to seven horticultural models and comparing the model results with those obtained by experiment in practical trials. In addition, the models were used with the meteorological data recorded during the trials. A scoring system was used to identify the best performing TMY in each application and then rank them in overall performance. The best methodology was that of Argiriou for both greenhouse and external conditions. The average relative errors between the seasonal values estimated using the 19-year dataset and those using the Argiriou greenhouse TMY were 2.2 % (reference evapotranspiration), -0.45 % (pepper crop transpiration), 3.4 % (pepper crop nitrogen uptake) and 0.8 % (green bean yield). The values obtained using the Argiriou external TMY were 1.8 % (greenhouse reference evapotranspiration), 0.6 % (external reference evapotranspiration), 4.7 % (greenhouse heat requirement) and 0.9 % (loquat harvest date). Using the models with the 19 individual years in the historical dataset showed that the year to year weather variability gave results which differed from the average values by ± 15 %. By comparison with results from other greenhouses it was shown that the greenhouse TMY is applicable to greenhouses which have a solar radiation transmission of approximately 65 % and rely on manual control of ventilation which constitute the majority in the south-east of Spain and in most Mediterranean greenhouse areas.
Large-scale seismic waveform quality metric calculation using Hadoop
NASA Astrophysics Data System (ADS)
Magana-Zook, S.; Gaylord, J. M.; Knapp, D. R.; Dodge, D. A.; Ruppert, S. D.
2016-09-01
In this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of 0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. These experiments were conducted multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.
Similarity of markers identified from cancer gene expression studies: observations from GEO.
Shi, Xingjie; Shen, Shihao; Liu, Jin; Huang, Jian; Zhou, Yong; Ma, Shuangge
2014-09-01
Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Information Visualization Techniques for Effective Cross-Discipline Communication
NASA Astrophysics Data System (ADS)
Fisher, Ward
2013-04-01
Collaboration between research groups in different fields is a common occurrence, but it can often be frustrating due to the absence of a common vocabulary. This lack of a shared context can make expressing important concepts and discussing results difficult. This problem may be further exacerbated when communicating to an audience of laypeople. Without a clear frame of reference, simple concepts are often rendered difficult-to-understand at best, and unintelligible at worst. An easy way to alleviate this confusion is with the use of clear, well-designed visualizations to illustrate an idea, process or conclusion. There exist a number of well-described machine-learning and statistical techniques which can be used to illuminate the information present within complex high-dimensional datasets. Once the information has been separated from the data, clear communication becomes a matter of selecting an appropriate visualization. Ideally, the visualization is information-rich but data-scarce. Anything from a simple bar chart, to a line chart with confidence intervals, to an animated set of 3D point-clouds can be used to render a complex idea as an easily understood image. Several case studies will be presented in this work. In the first study, we will examine how a complex statistical analysis was applied to a high-dimensional dataset, and how the results were succinctly communicated to an audience of microbiologists and chemical engineers. Next, we will examine a technique used to illustrate the concept of the singular value decomposition, as used in the field of computer vision, to a lay audience of undergraduate students from mixed majors. We will then examine a case where a simple animated line plot was used to communicate an approach to signal decomposition, and will finish with a discussion of the tools available to create these visualizations.
Automatic adjustment of astrochronologic correlations
NASA Astrophysics Data System (ADS)
Zeeden, Christian; Kaboth, Stefanie; Hilgen, Frederik; Laskar, Jacques
2017-04-01
Here we present an algorithm for the automated adjustment and optimisation of correlations between proxy data and an orbital tuning target (or similar datasets as e.g. ice models) for the R environment (R Development Core Team 2008), building on the 'astrochron' package (Meyers et al.2014). The basis of this approach is an initial tuning on orbital (precession, obliquity, eccentricity) scale. We use filters of orbital frequency ranges related to e.g. precession, obliquity or eccentricity of data and compare these filters to an ensemble of target data, which may consist of e.g. different combinations of obliquity and precession, different phases of precession and obliquity, a mix of orbital and other data (e.g. ice models), or different orbital solutions. This approach allows for the identification of an ideal mix of precession and obliquity to be used as tuning target. In addition, the uncertainty related to different tuning tie points (and also precession- and obliquity contributions of the tuning target) can easily be assessed. Our message is to suggest an initial tuning and then obtain a reproducible tuned time scale, avoiding arbitrary chosen tie points and replacing these by automatically chosen ones, representing filter maxima (or minima). We present and discuss the above outlined approach and apply it to artificial and geological data. Artificial data are assessed to find optimal filter settings; real datasets are used to demonstrate the possibilities of such an approach. References: Meyers, S.R. (2014). Astrochron: An R Package for Astrochronology. http://cran.r-project.org/package=astrochron R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Generation of High Resolution Global DSM from ALOS PRISM
NASA Astrophysics Data System (ADS)
Takaku, J.; Tadono, T.; Tsutsui, K.
2014-04-01
Panchromatic Remote-sensing Instrument for Stereo Mapping (PRISM), one of onboard sensors carried on the Advanced Land Observing Satellite (ALOS), was designed to generate worldwide topographic data with its optical stereoscopic observation. The sensor consists of three independent panchromatic radiometers for viewing forward, nadir, and backward in 2.5 m ground resolution producing a triplet stereoscopic image along its track. The sensor had observed huge amount of stereo images all over the world during the mission life of the satellite from 2006 through 2011. We have semi-automatically processed Digital Surface Model (DSM) data with the image archives in some limited areas. The height accuracy of the dataset was estimated at less than 5 m (rms) from the evaluation with ground control points (GCPs) or reference DSMs derived from the Light Detection and Ranging (LiDAR). Then, we decided to process the global DSM datasets from all available archives of PRISM stereo images by the end of March 2016. This paper briefly reports on the latest processing algorithms for the global DSM datasets as well as their preliminary results on some test sites. The accuracies and error characteristics of datasets are analyzed and discussed on various fields by the comparison with existing global datasets such as Ice, Cloud, and land Elevation Satellite (ICESat) data and Shuttle Radar Topography Mission (SRTM) data, as well as the GCPs and the reference airborne LiDAR/DSM.
Scaling of global input-output networks
NASA Astrophysics Data System (ADS)
Liang, Sai; Qi, Zhengling; Qu, Shen; Zhu, Ji; Chiu, Anthony S. F.; Jia, Xiaoping; Xu, Ming
2016-06-01
Examining scaling patterns of networks can help understand how structural features relate to the behavior of the networks. Input-output networks consist of industries as nodes and inter-industrial exchanges of products as links. Previous studies consider limited measures for node strengths and link weights, and also ignore the impact of dataset choice. We consider a comprehensive set of indicators in this study that are important in economic analysis, and also examine the impact of dataset choice, by studying input-output networks in individual countries and the entire world. Results show that Burr, Log-Logistic, Log-normal, and Weibull distributions can better describe scaling patterns of global input-output networks. We also find that dataset choice has limited impacts on the observed scaling patterns. Our findings can help examine the quality of economic statistics, estimate missing data in economic statistics, and identify key nodes and links in input-output networks to support economic policymaking.
Detecting opinion spams through supervised boosting approach.
Hazim, Mohamad; Anuar, Nor Badrul; Ab Razak, Mohd Faizal; Abdullah, Nor Aniza
2018-01-01
Product reviews are the individual's opinions, judgement or belief about a certain product or service provided by certain companies. Such reviews serve as guides for these companies to plan and monitor their business ventures in terms of increasing productivity or enhancing their product/service qualities. Product reviews can also increase business profits by convincing future customers about the products which they have interest in. In the mobile application marketplace such as Google Playstore, reviews and star ratings are used as indicators of the application quality. However, among all these reviews, hereby also known as opinions, spams also exist, to disrupt the online business balance. Previous studies used the time series and neural network approach (which require a lot of computational power) to detect these opinion spams. However, the detection performance can be restricted in terms of accuracy because the approach focusses on basic, discrete and document level features only thereby, projecting little statistical relationships. Aiming to improve the detection of opinion spams in mobile application marketplace, this study proposes using statistical based features that are modelled through the supervised boosting approach such as the Extreme Gradient Boost (XGBoost) and the Generalized Boosted Regression Model (GBM) to evaluate two multilingual datasets (i.e. English and Malay language). From the evaluation done, it was found that the XGBoost is most suitable for detecting opinion spams in the English dataset while the GBM Gaussian is most suitable for the Malay dataset. The comparative analysis also indicates that the implementation of the proposed statistical based features had achieved a detection accuracy rate of 87.43 per cent on the English dataset and 86.13 per cent on the Malay dataset.
Abar, Orhan; Charnigo, Richard J.; Rayapati, Abner
2017-01-01
Association rule mining has received significant attention from both the data mining and machine learning communities. While data mining researchers focus more on designing efficient algorithms to mine rules from large datasets, the learning community has explored applications of rule mining to classification. A major problem with rule mining algorithms is the explosion of rules even for moderate sized datasets making it very difficult for end users to identify both statistically significant and potentially novel rules that could lead to interesting new insights and hypotheses. Researchers have proposed many domain independent interestingness measures using which, one can rank the rules and potentially glean useful rules from the top ranked ones. However, these measures have not been fully explored for rule mining in clinical datasets owing to the relatively large sizes of the datasets often encountered in healthcare and also due to limited access to domain experts for review/analysis. In this paper, using an electronic medical record (EMR) dataset of diagnoses and medications from over three million patient visits to the University of Kentucky medical center and affiliated clinics, we conduct a thorough evaluation of dozens of interestingness measures proposed in data mining literature, including some new composite measures. Using cumulative relevance metrics from information retrieval, we compare these interestingness measures against human judgments obtained from a practicing psychiatrist for association rules involving the depressive disorders class as the consequent. Our results not only surface new interesting associations for depressive disorders but also indicate classes of interestingness measures that weight rule novelty and statistical strength in contrasting ways, offering new insights for end users in identifying interesting rules. PMID:28736771
Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo
2015-01-01
Background The joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics. Methods We propose a new combinatorial optimization problem that addresses the core issue of biomarker detection in integrated datasets. Optimal solutions for this model deliver a feature selection from a panel of prospective biomarkers. The model we propose is a generalised version of the (α,β)-k-Feature Set problem. We illustrate the performance of this new methodology via a challenging meta-analysis task involving six prostate cancer microarray datasets. The results are then compared to the popular RankProd meta-analysis tool and to what can be obtained by analysing the individual datasets by statistical and combinatorial methods alone. Results Application of the integrated method resulted in a more informative signature than the rank-based meta-analysis or individual dataset results, and overcomes problems arising from real world datasets. The set of genes identified is highly significant in the context of prostate cancer. The method used does not rely on homogenisation or transformation of values to a common scale, and at the same time is able to capture markers associated with subgroups of the disease. PMID:26106884
Accurate continuous geographic assignment from low- to high-density SNP data.
Guillot, Gilles; Jónsson, Hákon; Hinge, Antoine; Manchih, Nabil; Orlando, Ludovic
2016-04-01
Large-scale genotype datasets can help track the dispersal patterns of epidemiological outbreaks and predict the geographic origins of individuals. Such genetically-based geographic assignments also show a range of possible applications in forensics for profiling both victims and criminals, and in wildlife management, where poaching hotspot areas can be located. They, however, require fast and accurate statistical methods to handle the growing amount of genetic information made available from genotype arrays and next-generation sequencing technologies. We introduce a novel statistical method for geopositioning individuals of unknown origin from genotypes. Our method is based on a geostatistical model trained with a dataset of georeferenced genotypes. Statistical inference under this model can be implemented within the theoretical framework of Integrated Nested Laplace Approximation, which represents one of the major recent breakthroughs in statistics, as it does not require Monte Carlo simulations. We compare the performance of our method and an alternative method for geospatial inference, SPA in a simulation framework. We highlight the accuracy and limits of continuous spatial assignment methods at various scales by analyzing genotype datasets from a diversity of species, including Florida Scrub-jay birds Aphelocoma coerulescens, Arabidopsis thaliana and humans, representing 41-197,146 SNPs. Our method appears to be best suited for the analysis of medium-sized datasets (a few tens of thousands of loci), such as reduced-representation sequencing data that become increasingly available in ecology. http://www2.imm.dtu.dk/∼gigu/Spasiba/ gilles.b.guillot@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Statistical link between external climate forcings and modes of ocean variability
NASA Astrophysics Data System (ADS)
Malik, Abdul; Brönnimann, Stefan; Perona, Paolo
2017-07-01
In this study we investigate statistical link between external climate forcings and modes of ocean variability on inter-annual (3-year) to centennial (100-year) timescales using de-trended semi-partial-cross-correlation analysis technique. To investigate this link we employ observations (AD 1854-1999), climate proxies (AD 1600-1999), and coupled Atmosphere-Ocean-Chemistry Climate Model simulations with SOCOL-MPIOM (AD 1600-1999). We find robust statistical evidence that Atlantic multi-decadal oscillation (AMO) has intrinsic positive correlation with solar activity in all datasets employed. The strength of the relationship between AMO and solar activity is modulated by volcanic eruptions and complex interaction among modes of ocean variability. The observational dataset reveals that El Niño southern oscillation (ENSO) has statistically significant negative intrinsic correlation with solar activity on decadal to multi-decadal timescales (16-27-year) whereas there is no evidence of a link on a typical ENSO timescale (2-7-year). In the observational dataset, the volcanic eruptions do not have a link with AMO on a typical AMO timescale (55-80-year) however the long-term datasets (proxies and SOCOL-MPIOM output) show that volcanic eruptions have intrinsic negative correlation with AMO on inter-annual to multi-decadal timescales. The Pacific decadal oscillation has no link with solar activity, however, it has positive intrinsic correlation with volcanic eruptions on multi-decadal timescales (47-54-year) in reconstruction and decadal to multi-decadal timescales (16-32-year) in climate model simulations. We also find evidence of a link between volcanic eruptions and ENSO, however, the sign of relationship is not consistent between observations/proxies and climate model simulations.
NASA Astrophysics Data System (ADS)
Griffiths, Thomas; Habler, Gerlinde; Schantl, Philip; Abart, Rainer
2017-04-01
Crystallographic orientation relationships (CORs) between crystalline inclusions and their hosts are commonly used to support particular inclusion origins, but often interpretations are based on a small fraction of all inclusions in a system. The electron backscatter diffraction (EBSD) method allows collection of large COR datasets more quickly than other methods while maintaining high spatial resolution. Large datasets allow analysis of the relative frequencies of different CORs, and identification of 'statistical CORs', where certain limited degrees of freedom exist in the orientation relationship between two neighbour crystals (Griffiths et al. 2016). Statistical CORs exist in addition to completely fixed 'specific' CORs (previously the only type of COR considered). We present a comparison of three EBSD single point datasets (all N > 200 inclusions) of rutile inclusions in garnet hosts, covering three rock systems, each with a different geological history: 1) magmatic garnet in pegmatite from the Koralpe complex, Eastern Alps, formed at temperatures > 600°C and low pressures; 2) granulite facies garnet rims on ultra-high-pressure garnets from the Kimi complex, Rhodope Massif; and 3) a Moldanubian granulite from the southeastern Bohemian Massif, equilibrated at peak conditions of 1050°C and 1.6 GPa. The present study is unique because all datasets have been analysed using the same catalogue of potential CORs, therefore relative frequencies and other COR properties can be meaningfully compared. In every dataset > 94% of the inclusions analysed exhibit one of the CORs tested for. Certain CORs are consistently among the most common in all datasets. However, the relative abundances of these common CORs show large variations between datasets (varying from 8 to 42 % relative abundance in one case). Other CORs are consistently uncommon but nonetheless present in every dataset. Lastly, there are some CORs that are common in one of the datasets and rare in the remainder. These patterns suggest competing influences on relative COR frequencies. Certain CORs seem consistently favourable, perhaps pointing to very stable low energy configurations, whereas some CORs are favoured in only one system, perhaps due to particulars of the formation mechanism, kinetics or conditions. Variations in COR frequencies between datasets seem to correlate with the conditions of host-inclusion system evolution. The two datasets from granulite-facies metamorphic samples show more similarities to each other than to the pegmatite dataset, and the sample inferred to have experienced the highest temperatures (Moldanubian granulite) shows the lowest diversity of CORs, low frequencies of statistical CORs and the highest frequency of specific CORs. These results provide evidence that petrological information is being encoded in COR distributions. They make a strong case for further studies of the factors influencing COR development and for measurements of COR distributions in other systems and between different phases. Griffiths, T.A., Habler, G., Abart, R. (2016): Crystallographic orientation relationships in host-inclusion systems: New insights from large EBSD data sets. Amer. Miner., 101, 690-705.
The multiple imputation method: a case study involving secondary data analysis.
Walani, Salimah R; Cleland, Charles M
2015-05-01
To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.
Twitter Conversation Patterns Related to Research Papers
ERIC Educational Resources Information Center
Nelhans, Gustaf; Lorentzen, David Gunnarsson
2016-01-01
Introduction: This paper deals with what academic texts and datasets are referred to and discussed on Twitter. We used document object identifiers as references to these items. Method: We streamed tweets from the Twitter application programming interface including the strings "dx" and "doi" while simultaneously streaming tweets…
Signal detection in global mean temperatures after "Paris": an uncertainty and sensitivity analysis
NASA Astrophysics Data System (ADS)
Visser, Hans; Dangendorf, Sönke; van Vuuren, Detlef P.; Bregman, Bram; Petersen, Arthur C.
2018-02-01
In December 2015, 195 countries agreed in Paris to hold the increase in global mean surface temperature (GMST) well below 2.0 °C above pre-industrial levels and to pursue efforts to limit the temperature increase to 1.5 °C
. Since large financial flows will be needed to keep GMSTs below these targets, it is important to know how GMST has progressed since pre-industrial times. However, the Paris Agreement is not conclusive as regards methods to calculate it. Should trend progression be deduced from GCM simulations or from instrumental records by (statistical) trend methods? Which simulations or GMST datasets should be chosen, and which trend models? What is pre-industrial
and, finally, are the Paris targets formulated for total warming, originating from both natural and anthropogenic forcing, or do they refer to anthropogenic warming only? To find answers to these questions we performed an uncertainty and sensitivity analysis where datasets and model choices have been varied. For all cases we evaluated trend progression along with uncertainty information. To do so, we analysed four trend approaches and applied these to the five leading observational GMST products. We find GMST progression to be largely independent of various trend model approaches. However, GMST progression is significantly influenced by the choice of GMST datasets. Uncertainties due to natural variability are largest in size. As a parallel path, we calculated GMST progression from an ensemble of 42 GCM simulations. Mean progression derived from GCM-based GMSTs appears to lie in the range of trend-dataset combinations. A difference between both approaches appears to be the width of uncertainty bands: GCM simulations show a much wider spread. Finally, we discuss various choices for pre-industrial baselines and the role of warming definitions. Based on these findings we propose an estimate for signal progression in GMSTs since pre-industrial.
Geoseq: a tool for dissecting deep-sequencing datasets.
Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi
2010-10-12
Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.
NASA Astrophysics Data System (ADS)
Schwartz, M. Christian
2017-08-01
This paper addresses two straightforward questions. First, how similar are the statistics of cirrus particle size distribution (PSD) datasets collected using the Two-Dimensional Stereo (2D-S) probe to cirrus PSD datasets collected using older Particle Measuring Systems (PMS) 2-D Cloud (2DC) and 2-D Precipitation (2DP) probes? Second, how similar are the datasets when shatter-correcting post-processing is applied to the 2DC datasets? To answer these questions, a database of measured and parameterized cirrus PSDs - constructed from measurements taken during the Small Particles in Cirrus (SPARTICUS); Mid-latitude Airborne Cirrus Properties Experiment (MACPEX); and Tropical Composition, Cloud, and Climate Coupling (TC4) flight campaigns - is used.Bulk cloud quantities are computed from the 2D-S database in three ways: first, directly from the 2D-S data; second, by applying the 2D-S data to ice PSD parameterizations developed using sets of cirrus measurements collected using the older PMS probes; and third, by applying the 2D-S data to a similar parameterization developed using the 2D-S data themselves. This is done so that measurements of the same cloud volumes by parameterized versions of the 2DC and 2D-S can be compared with one another. It is thereby seen - given the same cloud field and given the same assumptions concerning ice crystal cross-sectional area, density, and radar cross section - that the parameterized 2D-S and the parameterized 2DC predict similar distributions of inferred shortwave extinction coefficient, ice water content, and 94 GHz radar reflectivity. However, the parameterization of the 2DC based on uncorrected data predicts a statistically significantly higher number of total ice crystals and a larger ratio of small ice crystals to large ice crystals than does the parameterized 2D-S. The 2DC parameterization based on shatter-corrected data also predicts statistically different numbers of ice crystals than does the parameterized 2D-S, but the comparison between the two is nevertheless more favorable. It is concluded that the older datasets continue to be useful for scientific purposes, with certain caveats, and that continuing field investigations of cirrus with more modern probes is desirable.
References for Haplotype Imputation in the Big Data Era
Li, Wenzhi; Xu, Wei; Li, Qiling; Ma, Li; Song, Qing
2016-01-01
Imputation is a powerful in silico approach to fill in those missing values in the big datasets. This process requires a reference panel, which is a collection of big data from which the missing information can be extracted and imputed. Haplotype imputation requires ethnicity-matched references; a mismatched reference panel will significantly reduce the quality of imputation. However, currently existing big datasets cover only a small number of ethnicities, there is a lack of ethnicity-matched references for many ethnic populations in the world, which has hampered the data imputation of haplotypes and its downstream applications. To solve this issue, several approaches have been proposed and explored, including the mixed reference panel, the internal reference panel and genotype-converted reference panel. This review article provides the information and comparison between these approaches. Increasing evidence showed that not just one or two genetic elements dictate the gene activity and functions; instead, cis-interactions of multiple elements dictate gene activity. Cis-interactions require the interacting elements to be on the same chromosome molecule, therefore, haplotype analysis is essential for the investigation of cis-interactions among multiple genetic variants at different loci, and appears to be especially important for studying the common diseases. It will be valuable in a wide spectrum of applications from academic research, to clinical diagnosis, prevention, treatment, and pharmaceutical industry. PMID:27274952
Freiman, Moti; Nickisch, Hannes; Prevrhal, Sven; Schmitt, Holger; Vembar, Mani; Maurovich-Horvat, Pál; Donnelly, Patrick; Goshen, Liran
2017-03-01
The goal of this study was to assess the potential added benefit of accounting for partial volume effects (PVE) in an automatic coronary lumen segmentation algorithm that is used to determine the hemodynamic significance of a coronary artery stenosis from coronary computed tomography angiography (CCTA). Two sets of data were used in our work: (a) multivendor CCTA datasets of 18 subjects from the MICCAI 2012 challenge with automatically generated centerlines and 3 reference segmentations of 78 coronary segments and (b) additional CCTA datasets of 97 subjects with 132 coronary lesions that had invasive reference standard FFR measurements. We extracted the coronary artery centerlines for the 97 datasets by an automated software program followed by manual correction if required. An automatic machine-learning-based algorithm segmented the coronary tree with and without accounting for the PVE. We obtained CCTA-based FFR measurements using a flow simulation in the coronary trees that were generated by the automatic algorithm with and without accounting for PVE. We assessed the potential added value of PVE integration as a part of the automatic coronary lumen segmentation algorithm by means of segmentation accuracy using the MICCAI 2012 challenge framework and by means of flow simulation overall accuracy, sensitivity, specificity, negative and positive predictive values, and the receiver operated characteristic (ROC) area under the curve. We also evaluated the potential benefit of accounting for PVE in automatic segmentation for flow simulation for lesions that were diagnosed as obstructive based on CCTA which could have indicated a need for an invasive exam and revascularization. Our segmentation algorithm improves the maximal surface distance error by ~39% compared to previously published method on the 18 datasets from the MICCAI 2012 challenge with comparable Dice and mean surface distance. Results with and without accounting for PVE were comparable. In contrast, integrating PVE analysis into an automatic coronary lumen segmentation algorithm improved the flow simulation specificity from 0.6 to 0.68 with the same sensitivity of 0.83. Also, accounting for PVE improved the area under the ROC curve for detecting hemodynamically significant CAD from 0.76 to 0.8 compared to automatic segmentation without PVE analysis with invasive FFR threshold of 0.8 as the reference standard. Accounting for PVE in flow simulation to support the detection of hemodynamic significant disease in CCTA-based obstructive lesions improved specificity from 0.51 to 0.73 with same sensitivity of 0.83 and the area under the curve from 0.69 to 0.79. The improvement in the AUC was statistically significant (N = 76, Delong's test, P = 0.012). Accounting for the partial volume effects in automatic coronary lumen segmentation algorithms has the potential to improve the accuracy of CCTA-based hemodynamic assessment of coronary artery lesions. © 2017 American Association of Physicists in Medicine.
Kawata, Masaaki; Sato, Chikara
2007-06-01
In determining the three-dimensional (3D) structure of macromolecular assemblies in single particle analysis, a large representative dataset of two-dimensional (2D) average images from huge number of raw images is a key for high resolution. Because alignments prior to averaging are computationally intensive, currently available multireference alignment (MRA) software does not survey every possible alignment. This leads to misaligned images, creating blurred averages and reducing the quality of the final 3D reconstruction. We present a new method, in which multireference alignment is harmonized with classification (multireference multiple alignment: MRMA). This method enables a statistical comparison of multiple alignment peaks, reflecting the similarities between each raw image and a set of reference images. Among the selected alignment candidates for each raw image, misaligned images are statistically excluded, based on the principle that aligned raw images of similar projections have a dense distribution around the correctly aligned coordinates in image space. This newly developed method was examined for accuracy and speed using model image sets with various signal-to-noise ratios, and with electron microscope images of the Transient Receptor Potential C3 and the sodium channel. In every data set, the newly developed method outperformed conventional methods in robustness against noise and in speed, creating 2D average images of higher quality. This statistically harmonized alignment-classification combination should greatly improve the quality of single particle analysis.
Booth, Brian G; Keijsers, Noël L W; Sijbers, Jan; Huysmans, Toon
2018-05-03
Pedobarography produces large sets of plantar pressure samples that are routinely subsampled (e.g. using regions of interest) or aggregated (e.g. center of pressure trajectories, peak pressure images) in order to simplify statistical analysis and provide intuitive clinical measures. We hypothesize that these data reductions discard gait information that can be used to differentiate between groups or conditions. To test the hypothesis of null information loss, we created an implementation of statistical parametric mapping (SPM) for dynamic plantar pressure datasets (i.e. plantar pressure videos). Our SPM software framework brings all plantar pressure videos into anatomical and temporal correspondence, then performs statistical tests at each sampling location in space and time. Novelly, we introduce non-linear temporal registration into the framework in order to normalize for timing differences within the stance phase. We refer to our software framework as STAPP: spatiotemporal analysis of plantar pressure measurements. Using STAPP, we tested our hypothesis on plantar pressure videos from 33 healthy subjects walking at different speeds. As walking speed increased, STAPP was able to identify significant decreases in plantar pressure at mid-stance from the heel through the lateral forefoot. The extent of these plantar pressure decreases has not previously been observed using existing plantar pressure analysis techniques. We therefore conclude that the subsampling of plantar pressure videos - a task which led to the discarding of gait information in our study - can be avoided using STAPP. Copyright © 2018 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Sun, L. Qing; Feng, Feng X.
2014-11-01
In this study, we first built and compared two different climate datasets for Wuling mountainous area in 2010, one of which considered topographical effects during the ANUSPLIN interpolation was referred as terrain-based climate dataset, while the other one did not was called ordinary climate dataset. Then, we quantified the topographical effects of climatic inputs on NPP estimation by inputting two different climate datasets to the same ecosystem model, the Boreal Ecosystem Productivity Simulator (BEPS), to evaluate the importance of considering relief when estimating NPP. Finally, we found the primary contributing variables to the topographical effects through a series of experiments given an overall accuracy of the model output for NPP. The results showed that: (1) The terrain-based climate dataset presented more reliable topographic information and had closer agreements with the station dataset than the ordinary climate dataset at successive time series of 365 days in terms of the daily mean values. (2) On average, ordinary climate dataset underestimated NPP by 12.5% compared with terrain-based climate dataset over the whole study area. (3) The primary climate variables contributing to the topographical effects of climatic inputs for Wuling mountainous area were temperatures, which suggest that it is necessary to correct temperature differences for estimating NPP accurately in such a complex terrain.
3D shape recovery from image focus using gray level co-occurrence matrix
NASA Astrophysics Data System (ADS)
Mahmood, Fahad; Munir, Umair; Mehmood, Fahad; Iqbal, Javaid
2018-04-01
Recovering a precise and accurate 3-D shape of the target object utilizing robust 3-D shape recovery algorithm is an ultimate objective of computer vision community. Focus measure algorithm plays an important role in this architecture which convert the color values of each pixel of the acquired 2-D image dataset into corresponding focus values. After convolving the focus measure filter with the input 2-D image dataset, a 3-D shape recovery approach is applied which will recover the depth map. In this document, we are concerned with proposing Gray Level Co-occurrence Matrix along with its statistical features for computing the focus information of the image dataset. The Gray Level Co-occurrence Matrix quantifies the texture present in the image using statistical features and then applies joint probability distributive function of the gray level pairs of the input image. Finally, we quantify the focus value of the input image using Gaussian Mixture Model. Due to its little computational complexity, sharp focus measure curve, robust to random noise sources and accuracy, it is considered as superior alternative to most of recently proposed 3-D shape recovery approaches. This algorithm is deeply investigated on real image sequences and synthetic image dataset. The efficiency of the proposed scheme is also compared with the state of art 3-D shape recovery approaches. Finally, by means of two global statistical measures, root mean square error and correlation, we claim that this approach -in spite of simplicity generates accurate results.
A novel statistical method for quantitative comparison of multiple ChIP-seq datasets.
Chen, Li; Wang, Chi; Qin, Zhaohui S; Wu, Hao
2015-06-15
ChIP-seq is a powerful technology to measure the protein binding or histone modification strength in the whole genome scale. Although there are a number of methods available for single ChIP-seq data analysis (e.g. 'peak detection'), rigorous statistical method for quantitative comparison of multiple ChIP-seq datasets with the considerations of data from control experiment, signal to noise ratios, biological variations and multiple-factor experimental designs is under-developed. In this work, we develop a statistical method to perform quantitative comparison of multiple ChIP-seq datasets and detect genomic regions showing differential protein binding or histone modification. We first detect peaks from all datasets and then union them to form a single set of candidate regions. The read counts from IP experiment at the candidate regions are assumed to follow Poisson distribution. The underlying Poisson rates are modeled as an experiment-specific function of artifacts and biological signals. We then obtain the estimated biological signals and compare them through the hypothesis testing procedure in a linear model framework. Simulations and real data analyses demonstrate that the proposed method provides more accurate and robust results compared with existing ones. An R software package ChIPComp is freely available at http://web1.sph.emory.edu/users/hwu30/software/ChIPComp.html. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
International Metadata Standards and Enterprise Data Quality Metadata Systems
NASA Astrophysics Data System (ADS)
Habermann, T.
2016-12-01
Well-documented data quality is critical in situations where scientists and decision-makers need to combine multiple datasets from different disciplines and collection systems to address scientific questions or difficult decisions. Standardized data quality metadata could be very helpful in these situations. Many efforts at developing data quality standards falter because of the diversity of approaches to measuring and reporting data quality. The "one size fits all" paradigm does not generally work well in this situation. The ISO data quality standard (ISO 19157) takes a different approach with the goal of systematically describing how data quality is measured rather than how it should be measured. It introduces the idea of standard data quality measures that can be well documented in a measure repository and used for consistently describing how data quality is measured across an enterprise. The standard includes recommendations for properties of these measures that include unique identifiers, references, illustrations and examples. Metadata records can reference these measures using the unique identifier and reuse them along with details (and references) that describe how the measure was applied to a particular dataset. A second important feature of ISO 19157 is the inclusion of citations to existing papers or reports that describe quality of a dataset. This capability allows users to find this information in a single location, i.e. the dataset metadata, rather than searching the web or other catalogs. I will describe these and other capabilities of ISO 19157 with examples of how they are being used to describe data quality across the NASA EOS Enterprise and also compare these approaches with other standards.
Kerns, James R; Followill, David S; Lowenstein, Jessica; Molineu, Andrea; Alvarez, Paola; Taylor, Paige A; Stingo, Francesco C; Kry, Stephen F
2016-05-01
Accurate data regarding linear accelerator (Linac) radiation characteristics are important for treatment planning system modeling as well as regular quality assurance of the machine. The Imaging and Radiation Oncology Core-Houston (IROC-H) has measured the dosimetric characteristics of numerous machines through their on-site dosimetry review protocols. Photon data are presented and can be used as a secondary check of acquired values, as a means to verify commissioning a new machine, or in preparation for an IROC-H site visit. Photon data from IROC-H on-site reviews from 2000 to 2014 were compiled and analyzed. Specifically, data from approximately 500 Varian machines were analyzed. Each dataset consisted of point measurements of several dosimetric parameters at various locations in a water phantom to assess the percentage depth dose, jaw output factors, multileaf collimator small field output factors, off-axis factors, and wedge factors. The data were analyzed by energy and parameter, with similarly performing machine models being assimilated into classes. Common statistical metrics are presented for each machine class. Measurement data were compared against other reference data where applicable. Distributions of the parameter data were shown to be robust and derive from a student's t distribution. Based on statistical and clinical criteria, all machine models were able to be classified into two or three classes for each energy, except for 6 MV for which there were eight classes. Quantitative analysis of the measurements for 6, 10, 15, and 18 MV photon beams is presented for each parameter; supplementary material has also been made available which contains further statistical information. IROC-H has collected numerous data on Varian Linacs and the results of photon measurements from the past 15 years are presented. The data can be used as a comparison check of a physicist's acquired values. Acquired values that are well outside the expected distribution should be verified by the physicist to identify whether the measurements are valid. Comparison of values to this reference data provides a redundant check to help prevent gross dosimetric treatment errors.
Smith, Joseph M.; Mather, Martha E.
2012-01-01
Ecological indicators are science-based tools used to assess how human activities have impacted environmental resources. For monitoring and environmental assessment, existing species assemblage data can be used to make these comparisons through time or across sites. An impediment to using assemblage data, however, is that these data are complex and need to be simplified in an ecologically meaningful way. Because multivariate statistics are mathematical relationships, statistical groupings may not make ecological sense and will not have utility as indicators. Our goal was to define a process to select defensible and ecologically interpretable statistical simplifications of assemblage data in which researchers and managers can have confidence. For this, we chose a suite of statistical methods, compared the groupings that resulted from these analyses, identified convergence among groupings, then we interpreted the groupings using species and ecological guilds. When we tested this approach using a statewide stream fish dataset, not all statistical methods worked equally well. For our dataset, logistic regression (Log), detrended correspondence analysis (DCA), cluster analysis (CL), and non-metric multidimensional scaling (NMDS) provided consistent, simplified output. Specifically, the Log, DCA, CL-1, and NMDS-1 groupings were ≥60% similar to each other, overlapped with the fluvial-specialist ecological guild, and contained a common subset of species. Groupings based on number of species (e.g., Log, DCA, CL and NMDS) outperformed groupings based on abundance [e.g., principal components analysis (PCA) and Poisson regression]. Although the specific methods that worked on our test dataset have generality, here we are advocating a process (e.g., identifying convergent groupings with redundant species composition that are ecologically interpretable) rather than the automatic use of any single statistical tool. We summarize this process in step-by-step guidance for the future use of these commonly available ecological and statistical methods in preparing assemblage data for use in ecological indicators.
NASA Astrophysics Data System (ADS)
Anker, Y.; Hershkovitz, Y.; Gasith, A.; Ben-Dor, E.
2011-12-01
Although remote sensing of fluvial ecosystems is well developed, the tradeoff between spectral and spatial resolutions prevents its application in small streams (<3m width). In the current study, a remote sensing approach for monitoring and research of small ecosystem was developed. The method is based on differentiation between two indicative vegetation species out of the ecosystem flora. Since when studied, the channel was covered mostly by a filamentous green alga (Cladophora glomerata) and watercress (Nasturtium officinale), these species were chosen as indicative; nonetheless, common reed (Phragmites australis) was also classified in order to exclude it from the stream ROI. The procedure included: A. For both section and habitat scales classifications, acquisition of aerial digital RGB datasets. B. For section scale classification, hyperspectral (HSR) dataset acquisition. C. For calibration, HSR reflectance measurements of specific ground targets, in close proximity to each dataset acquisition swath. D. For habitat scale classification, manual, in-stream flora grid transects classification. The digital RGB datasets were converted to reflectance units by spectral calibration against colored reference plates. These red, green, blue, white, and black EVA foam reference plates were measured by an ASD field spectrometer and each was given a spectral value. Each spectral value was later applied to the spectral calibration and radiometric correction of spectral RGB (SRGB) cube. Spectral calibration of the HSR dataset was done using the empirical line method, based on reference values of progressive grey scale targets. Differentiation between the vegetation species was done by supervised classification both for the HSR and for the SRGB datasets. This procedure was done using the Spectral Angle Mapper function with the spectral pattern of each vegetation species as a spectral end member. Comparison between the two remote sensing techniques and between the SRGB classification and the in-situ transects indicates that: A. Stream vegetation classification resolution is about 4 cm by the SRGB method compared to about 1 m by HSR. Moreover, this resolution is also higher than of the manual grid transect classification. B. The SRGB method is by far the most cost-efficient. The combination of spectral information (rather than the cognitive color) and high spatial resolution of aerial photography provides noise filtration and better sub-water detection capabilities than the HSR technique. C. Only the SRGB method applies for habitat and section scales; hence, its application together with in-situ grid transects for validation, may be optimal for use in similar scenarios.
The HSR dataset was first degraded to 17 bands with the same spectral range as the RGB dataset and also to a dataset with 3 equivalent bands
NASA Astrophysics Data System (ADS)
Tian, D.; Medina, H.
2017-12-01
Post-processing of medium range reference evapotranspiration (ETo) forecasts based on numerical weather prediction (NWP) models has the potential of improving the quality and utility of these forecasts. This work compares the performance of several post-processing methods for correcting ETo forecasts over the continental U.S. generated from The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) database using data from Europe (EC), the United Kingdom (MO), and the United States (NCEP). The pondered post-processing techniques are: simple bias correction, the use of multimodels, the Ensemble Model Output Statistics (EMOS, Gneitting et al., 2005) and the Bayesian Model Averaging (BMA, Raftery et al., 2005). ETo estimates based on quality-controlled U.S. Regional Climate Reference Network measurements, and computed with the FAO 56 Penman Monteith equation, are adopted as baseline. EMOS and BMA are generally the most efficient post-processing techniques of the ETo forecasts. Nevertheless, the simple bias correction of the best model is commonly much more rewarding than using multimodel raw forecasts. Our results demonstrate the potential of different forecasting and post-processing frameworks in operational evapotranspiration and irrigation advisory systems at national scale.
Tutkuviene, Janina; Cattaneo, Cristina; Obertová, Zuzana; Ratnayake, Melanie; Poppa, Pasquale; Barkus, Arunas; Khalaj-Hedayati, Kerstin; Schroeder, Inge; Ritz-Timme, Stefanie
2016-11-01
Craniofacial growth changes in young children are not yet completely understood. Up-to-date references for craniofacial measurements are crucial for clinical assessment of orthodontic anomalies, craniofacial abnormalities and subsequent planning of interventions. To provide normal reference data and to identify growth patterns for craniofacial dimensions of European boys and girls aged 3-6 years. Using standard anthropometric methodology, body weight, body height and 23 craniofacial measurements were acquired for a cross-sectional sample of 681 healthy children (362 boys and 319 girls) aged 3-6 years from Germany, Italy and Lithuania. Descriptive statistics, correlation coefficients, percentage annual changes and percentage growth rates were used to analyse the dataset. Between the ages of 3-6 years, craniofacial measurements showed age- and sex-related patterns independent from patterns observed for body weight and body height. Sex-related differences were observed in the majority of craniofacial measurements. In both sexes, face heights and face depths showed the strongest correlation with age. Growth patterns differed by craniofacial measurement and can be summarised into eight distinct age- and sex-related patterns. This study provided reference data and identified sex- and age-related growth patterns of the craniofacial complex of young European children, which may be used for detailed assessment of normal growth in paediatrics, maxillofacial reconstructive surgery and possibly for forensic age assessment.
Time Series Expression Analyses Using RNA-seq: A Statistical Approach
Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P.
2013-01-01
RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis. PMID:23586021
Time series expression analyses using RNA-seq: a statistical approach.
Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P
2013-01-01
RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis.
Caritat, Patrice de; Reimann, Clemens; Smith, David; Wang, Xueqiu
2017-01-01
During the last 10-20 years, Geological Surveys around the world have undertaken a major effort towards delivering fully harmonized and tightly quality-controlled low-density multi-element soil geochemical maps and datasets of vast regions including up to whole continents. Concentrations of between 45 and 60 elements commonly have been determined in a variety of different regolith types (e.g., sediment, soil). The multi-element datasets are published as complete geochemical atlases and made available to the general public. Several other geochemical datasets covering smaller areas but generally at a higher spatial density are also available. These datasets may, however, not be found by superficial internet-based searches because the elements are not mentioned individually either in the title or in the keyword lists of the original references. This publication attempts to increase the visibility and discoverability of these fundamental background datasets covering large areas up to whole continents.
Multi-Centrality Graph Spectral Decompositions and Their Application to Cyber Intrusion Detection
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Pin-Yu; Choudhury, Sutanay; Hero, Alfred
Many modern datasets can be represented as graphs and hence spectral decompositions such as graph principal component analysis (PCA) can be useful. Distinct from previous graph decomposition approaches based on subspace projection of a single topological feature, e.g., the centered graph adjacency matrix (graph Laplacian), we propose spectral decomposition approaches to graph PCA and graph dictionary learning that integrate multiple features, including graph walk statistics, centrality measures and graph distances to reference nodes. In this paper we propose a new PCA method for single graph analysis, called multi-centrality graph PCA (MC-GPCA), and a new dictionary learning method for ensembles ofmore » graphs, called multi-centrality graph dictionary learning (MC-GDL), both based on spectral decomposition of multi-centrality matrices. As an application to cyber intrusion detection, MC-GPCA can be an effective indicator of anomalous connectivity pattern and MC-GDL can provide discriminative basis for attack classification.« less
Long-term (in)stability of the climate-streamflow relationship
NASA Astrophysics Data System (ADS)
Saft, Margarita; Peel, Murray; Coxon, Gemma; Freer, Jim; Parajka, Juraj; Woods, Ross
2017-04-01
Land use changes have long been known to alter streamflow production for a given climatic input. Recently, extended shifts in climate were also shown to be capable of altering catchment internal functioning and streamflow production for a given climatic input. This study investigates the stability of climate-streamflow relationships in natural catchments in different regions of the world for the first time, using datasets of natural/reference catchments from Europe, US, and Australia. Changes in climate-streamflow relationships are investigated statistically on the interannual to interdecadal timescale and related to interdecadal climate variability. We compare the frequency and magnitude of shifts in climate-streamflow relationship between different regions, and discuss what any differences in shift frequency and magnitude might be related to. This study draws attention to the issues of catchment vulnerability to changes in external factors, catchment-climate co-evolution, and long-term catchment memory.
Improved image reconstruction of low-resolution multichannel phase contrast angiography
P. Krishnan, Akshara; Joy, Ajin; Paul, Joseph Suresh
2016-01-01
Abstract. In low-resolution phase contrast magnetic resonance angiography, the maximum intensity projected channel images will be blurred with consequent loss of vascular details. The channel images are enhanced using a stabilized deblurring filter, applied to each channel prior to combining the individual channel images. The stabilized deblurring is obtained by the addition of a nonlocal regularization term to the reverse heat equation, referred to as nonlocally stabilized reverse diffusion filter. Unlike reverse diffusion filter, which is highly unstable and blows up noise, nonlocal stabilization enhances intensity projected parallel images uniformly. Application to multichannel vessel enhancement is illustrated using both volunteer data and simulated multichannel angiograms. Robustness of the filter applied to volunteer datasets is shown using statistically validated improvement in flow quantification. Improved performance in terms of preserving vascular structures and phased array reconstruction in both simulated and real data is demonstrated using structureness measure and contrast ratio. PMID:26835501
Analysis of energy-based algorithms for RNA secondary structure prediction
2012-01-01
Background RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. Results We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). Conclusions Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets. PMID:22296803
Analysis of energy-based algorithms for RNA secondary structure prediction.
Hajiaghayi, Monir; Condon, Anne; Hoos, Holger H
2012-02-01
RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets.
Verdin, Kristine L.; Godt, Jonathan W.; Funk, Christopher C.; Pedreros, Diego; Worstell, Bruce; Verdin, James
2007-01-01
Landslides resulting from earthquakes can cause widespread loss of life and damage to critical infrastructure. The U.S. Geological Survey (USGS) has developed an alarm system, PAGER (Prompt Assessment of Global Earthquakes for Response), that aims to provide timely information to emergency relief organizations on the impact of earthquakes. Landslides are responsible for many of the damaging effects following large earthquakes in mountainous regions, and thus data defining the topographic relief and slope are critical to the PAGER system. A new global topographic dataset was developed to aid in rapidly estimating landslide potential following large earthquakes. We used the remotely-sensed elevation data collected as part of the Shuttle Radar Topography Mission (SRTM) to generate a slope dataset with nearly global coverage. Slopes from the SRTM data, computed at 3-arc-second resolution, were summarized at 30-arc-second resolution, along with statistics developed to describe the distribution of slope within each 30-arc-second pixel. Because there are many small areas lacking SRTM data and the northern limit of the SRTM mission was lat 60?N., statistical methods referencing other elevation data were used to fill the voids within the dataset and to extrapolate the data north of 60?. The dataset will be used in the PAGER system to rapidly assess the susceptibility of areas to landsliding following large earthquakes.
Liang, Li-Jung; Weiss, Robert E; Redelings, Benjamin; Suchard, Marc A
2009-10-01
Statistical analyses of phylogenetic data culminate in uncertain estimates of underlying model parameters. Lack of additional data hinders the ability to reduce this uncertainty, as the original phylogenetic dataset is often complete, containing the entire gene or genome information available for the given set of taxa. Informative priors in a Bayesian analysis can reduce posterior uncertainty; however, publicly available phylogenetic software specifies vague priors for model parameters by default. We build objective and informative priors using hierarchical random effect models that combine additional datasets whose parameters are not of direct interest but are similar to the analysis of interest. We propose principled statistical methods that permit more precise parameter estimates in phylogenetic analyses by creating informative priors for parameters of interest. Using additional sequence datasets from our lab or public databases, we construct a fully Bayesian semiparametric hierarchical model to combine datasets. A dynamic iteratively reweighted Markov chain Monte Carlo algorithm conveniently recycles posterior samples from the individual analyses. We demonstrate the value of our approach by examining the insertion-deletion (indel) process in the enolase gene across the Tree of Life using the phylogenetic software BALI-PHY; we incorporate prior information about indels from 82 curated alignments downloaded from the BAliBASE database.
Innovations in user-defined analysis: dynamic grouping and customized user datasets in VistaPHw.
Solet, David; Glusker, Ann; Laurent, Amy; Yu, Tianji
2006-01-01
Flexible, ready access to community health assessment data is a feature of innovative Web-based data query systems. An example is VistaPHw, which provides access to Washington state data and statistics used in community health assessment. Because of its flexible analysis options, VistaPHw customizes local, population-based results to be relevant to public health decision-making. The advantages of two innovations, dynamic grouping and the Custom Data Module, are described. Dynamic grouping permits the creation of user-defined aggregations of geographic areas, age groups, race categories, and years. Standard VistaPHw measures such as rates, confidence intervals, and other statistics may then be calculated for the new groups. Dynamic grouping has provided data for major, successful grant proposals, building partnerships with local governments and organizations, and informing program planning for community organizations. The Custom Data Module allows users to prepare virtually any dataset so it may be analyzed in VistaPHw. Uses for this module may include datasets too sensitive to be placed on a Web server or datasets that are not standardized across the state. Limitations and other system needs are also discussed.
NASA Astrophysics Data System (ADS)
Han, Keesook J.; Hodge, Matthew; Ross, Virginia W.
2011-06-01
For monitoring network traffic, there is an enormous cost in collecting, storing, and analyzing network traffic datasets. Data mining based network traffic analysis has a growing interest in the cyber security community, but is computationally expensive for finding correlations between attributes in massive network traffic datasets. To lower the cost and reduce computational complexity, it is desirable to perform feasible statistical processing on effective reduced datasets instead of on the original full datasets. Because of the dynamic behavior of network traffic, traffic traces exhibit mixtures of heavy tailed statistical distributions or overdispersion. Heavy tailed network traffic characterization and visualization are important and essential tasks to measure network performance for the Quality of Services. However, heavy tailed distributions are limited in their ability to characterize real-time network traffic due to the difficulty of parameter estimation. The Entropy-Based Heavy Tailed Distribution Transformation (EHTDT) was developed to convert the heavy tailed distribution into a transformed distribution to find the linear approximation. The EHTDT linearization has the advantage of being amenable to characterize and aggregate overdispersion of network traffic in realtime. Results of applying the EHTDT for innovative visual analytics to real network traffic data are presented.
SDCLIREF - A sub-daily gridded reference dataset
NASA Astrophysics Data System (ADS)
Wood, Raul R.; Willkofer, Florian; Schmid, Franz-Josef; Trentini, Fabian; Komischke, Holger; Ludwig, Ralf
2017-04-01
Climate change is expected to impact the intensity and frequency of hydrometeorological extreme events. In order to adequately capture and analyze extreme rainfall events, in particular when assessing flood and flash flood situations, data is required at high spatial and sub-daily resolution which is often not available in sufficient density and over extended time periods. The ClimEx project (Climate Change and Hydrological Extreme Events) addresses the alteration of hydrological extreme events under climate change conditions. In order to differentiate between a clear climate change signal and the limits of natural variability, unique Single-Model Regional Climate Model Ensembles (CRCM5 driven by CanESM2, RCP8.5) were created for a European and North-American domain, each comprising 50 members of 150 years (1951-2100). In combination with the CORDEX-Database, this newly created ClimEx-Ensemble is a one-of-a-kind model dataset to analyze changes of sub-daily extreme events. For the purpose of bias-correcting the regional climate model ensembles as well as for the baseline calibration and validation of hydrological catchment models, a new sub-daily (3h) high-resolution (500m) gridded reference dataset (SDCLIREF) was created for a domain covering the Upper Danube and Main watersheds ( 100.000km2). As the sub-daily observations lack a continuous time series for the reference period 1980-2010, the need for a suitable method to bridge the gap of the discontinuous time series arouse. The Method of Fragments (Sharma and Srikanthan (2006); Westra et al. (2012)) was applied to transform daily observations to sub-daily rainfall events to extend the time series and densify the station network. Prior to applying the Method of Fragments and creating the gridded dataset using rigorous interpolation routines, data collection of observations, operated by several institutions in three countries (Germany, Austria, Switzerland), and the subsequent quality control of the observations was carried out. Among others, the quality control checked for steps, extensive dry seasons, temporal consistency and maximum hourly values. The resulting SDCLIREF dataset provides a robust precipitation reference for hydrometeorological applications in unprecedented high spatio-temporal resolution. References: Sharma, A.; Srikanthan, S. (2006): Continuous Rainfall Simulation: A Nonparametric Alternative. In: 30th Hydrology and Water Resources Symposium 4-7 December 2006, Launceston, Tasmania. Westra, S.; Mehrotra, R.; Sharma, A.; Srikanthan, R. (2012): Continuous rainfall simulation. 1. A regionalized subdaily disaggregation approach. In: Water Resour. Res. 48 (1). DOI: 10.1029/2011WR010489.
P-MartCancer–Interactive Online Software to Enable Analysis of Shotgun Cancer Proteomic Datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Webb-Robertson, Bobbie-Jo M.; Bramer, Lisa M.; Jensen, Jeffrey L.
P-MartCancer is a new interactive web-based software environment that enables biomedical and biological scientists to perform in-depth analyses of global proteomics data without requiring direct interaction with the data or with statistical software. P-MartCancer offers a series of statistical modules associated with quality assessment, peptide and protein statistics, protein quantification and exploratory data analyses driven by the user via customized workflows and interactive visualization. Currently, P-MartCancer offers access to multiple cancer proteomic datasets generated through the Clinical Proteomics Tumor Analysis Consortium (CPTAC) at the peptide, gene and protein levels. P-MartCancer is deployed using Azure technologies (http://pmart.labworks.org/cptac.html), the web-service is alternativelymore » available via Docker Hub (https://hub.docker.com/r/pnnl/pmart-web/) and many statistical functions can be utilized directly from an R package available on GitHub (https://github.com/pmartR).« less
Quantile regression for the statistical analysis of immunological data with many non-detects.
Eilers, Paul H C; Röder, Esther; Savelkoul, Huub F J; van Wijk, Roy Gerth
2012-07-07
Immunological parameters are hard to measure. A well-known problem is the occurrence of values below the detection limit, the non-detects. Non-detects are a nuisance, because classical statistical analyses, like ANOVA and regression, cannot be applied. The more advanced statistical techniques currently available for the analysis of datasets with non-detects can only be used if a small percentage of the data are non-detects. Quantile regression, a generalization of percentiles to regression models, models the median or higher percentiles and tolerates very high numbers of non-detects. We present a non-technical introduction and illustrate it with an implementation to real data from a clinical trial. We show that by using quantile regression, groups can be compared and that meaningful linear trends can be computed, even if more than half of the data consists of non-detects. Quantile regression is a valuable addition to the statistical methods that can be used for the analysis of immunological datasets with non-detects.
Analysis of Parasite and Other Skewed Counts
Alexander, Neal
2012-01-01
Objective To review methods for the statistical analysis of parasite and other skewed count data. Methods Statistical methods for skewed count data are described and compared, with reference to those used over a ten year period of Tropical Medicine and International Health. Two parasitological datasets are used for illustration. Results Ninety papers were identified, 89 with descriptive and 60 with inferential analysis. A lack of clarity is noted in identifying measures of location, in particular the Williams and geometric mean. The different measures are compared, emphasizing the legitimacy of the arithmetic mean for skewed data. In the published papers, the t test and related methods were often used on untransformed data, which is likely to be invalid. Several approaches to inferential analysis are described, emphasizing 1) non-parametric methods, while noting that they are not simply comparisons of medians, and 2) generalized linear modelling, in particular with the negative binomial distribution. Additional methods, such as the bootstrap, with potential for greater use are described. Conclusions Clarity is recommended when describing transformations and measures of location. It is suggested that non-parametric methods and generalized linear models are likely to be sufficient for most analyses. PMID:22943299
NASA Astrophysics Data System (ADS)
Labzovskii, Lev D.; Papayannis, Alexandros; Binietoglou, Ioannis; Banks, Robert F.; Baldasano, Jose M.; Toanca, Florica; Tzanis, Chris G.; Christodoulakis, John
2018-02-01
Accurate continuous measurements of relative humidity (RH) vertical profiles in the lower troposphere have become a significant scientific challenge. In recent years a synergy of various ground-based remote sensing instruments have been successfully used for RH vertical profiling, which has resulted in the improvement of spatial resolution and, in some cases, of the accuracy of the measurement. Some studies have also suggested the use of high-resolution model simulations as input datasets into RH vertical profiling techniques. In this paper we apply two synergetic methods for RH profiling, including the synergy of lidar with a microwave radiometer and high-resolution atmospheric modeling. The two methods are employed for RH retrieval between 100 and 6000 m with increased spatial resolution, based on datasets from the HygrA-CD (Hygroscopic Aerosols to Cloud Droplets) campaign conducted in Athens, Greece from May to June 2014. RH profiles from synergetic methods are then compared with those retrieved using single instruments or as simulated by high-resolution models. Our proposed technique for RH profiling provides improved statistical agreement with reference to radiosoundings by 27 % when the lidar-radiometer (in comparison with radiometer measurements) approach is used and by 15 % when a lidar model is used (in comparison with WRF-model simulations). Mean uncertainty of RH due to temperature bias in RH profiling was ˜ 4.34 % for the lidar-radiometer and ˜ 1.22 % for the lidar-model methods. However, maximum uncertainty in RH retrievals due to temperature bias showed that lidar-model method is more reliable at heights greater than 2000 m. Overall, our results have demonstrated the capability of both combined methods for daytime measurements in heights between 100 and 6000 m when lidar-radiometer or lidar-WRF combined datasets are available.
Inter-algorithm lesion volumetry comparison of real and 3D simulated lung lesions in CT
NASA Astrophysics Data System (ADS)
Robins, Marthony; Solomon, Justin; Hoye, Jocelyn; Smith, Taylor; Ebner, Lukas; Samei, Ehsan
2017-03-01
The purpose of this study was to establish volumetric exchangeability between real and computational lung lesions in CT. We compared the overall relative volume estimation performance of segmentation tools when used to measure real lesions in actual patient CT images and computational lesions virtually inserted into the same patient images (i.e., hybrid datasets). Pathologically confirmed malignancies from 30 thoracic patient cases from Reference Image Database to Evaluate Therapy Response (RIDER) were modeled and used as the basis for the comparison. Lesions included isolated nodules as well as those attached to the pleura or other lung structures. Patient images were acquired using a 16 detector row or 64 detector row CT scanner (Lightspeed 16 or VCT; GE Healthcare). Scans were acquired using standard chest protocols during a single breath-hold. Virtual 3D lesion models based on real lesions were developed in Duke Lesion Tool (Duke University), and inserted using a validated image-domain insertion program. Nodule volumes were estimated using multiple commercial segmentation tools (iNtuition, TeraRecon, Inc., Syngo.via, Siemens Healthcare, and IntelliSpace, Philips Healthcare). Consensus based volume comparison showed consistent trends in volume measurement between real and virtual lesions across all software. The average percent bias (+/- standard error) shows -9.2+/-3.2% for real lesions versus -6.7+/-1.2% for virtual lesions with tool A, 3.9+/-2.5% and 5.0+/-0.9% for tool B, and 5.3+/-2.3% and 1.8+/-0.8% for tool C, respectively. Virtual lesion volumes were statistically similar to those of real lesions (< 4% difference) with p >.05 in most cases. Results suggest that hybrid datasets had similar inter-algorithm variability compared to real datasets.
NASA Astrophysics Data System (ADS)
Szekely, Tanguy; Killick, Rachel; Gourrion, Jerome; Reverdin, Gilles
2017-04-01
CORA and EN4 are both global delayed time mode validated in-situ ocean temperature and salinity datasets distributed by the Met Office (http://www.metoffice.gov.uk/) and Copernicus (www.marine.copernicus.eu). A large part of the profiles distributed by CORA and EN4 in recent years are Argo profiles from the ARGO DAC, but profiles are also extracted from the World Ocean Database and TESAC profiles from GTSPP. In the case of CORA, data coming from the EUROGOOS Regional operationnal oserving system( ROOS) operated by European institutes no managed by National Data Centres and other datasets of profiles povided by scientific sources can also be found (Sea mammals profiles from MEOP, XBT datasets from cruises ...). (EN4 also takes data from the ASBO dataset to supplement observations in the Arctic). First advantage of this new merge product is to enhance the space and time coverage at global and european scales for the period covering 1950 till a year before the current year. This product is updated once a year and T&S gridded fields are alos generated for the period 1990-year n-1. The enhancement compared to the revious CORA product will be presented Despite the fact that the profiles distributed by both datasets are mostly the same, the quality control procedures developed by the Met Office and Copernicus teams differ, sometimes leading to different quality control flags for the same profile. Started in 2016 a new study started that aims to compare both validation procedures to move towards a Copernicus Marine Service dataset with the best features of CORA and EN4 validation.A reference data set composed of the full set of in-situ temperature and salinity measurements collected by Coriolis during 2015 is used. These measurements have been made thanks to wide range of instruments (XBTs, CTDs, Argo floats, Instrumented sea mammals,...), covering the global ocean. The reference dataset has been validated simultaneously by both teams.An exhaustive comparison of the validation test results is now performed to find the best features of both datasets. The study shows the differences between the EN4 and CORA validation results. It highlights the complementarity between the EN4 and CORA higher order tests. The design of the CORA and EN4 validation charts is discussed to understand how a different approach on the dataset scope can lead to differences in data validation. The new validation chart of the Copernicus Marine Service dataset is presented.
2013-06-01
benefitting from rapid, automated discrimination of specific predefined signals , and is free-standing (requiring no other plugins or packages). The...previously labeled dataset, and comparing two labeled datasets. 15. SUBJECT TERMS Artifact, signal detection, EEG, MATLAB, toolbox 16. SECURITY... CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU 18. NUMBER OF PAGES 56 19a. NAME OF RESPONSIBLE PERSON W. David Hairston a. REPORT
Jeffrey T. Walton
2008-01-01
Two datasets of percent urban tree canopy cover were compared. The first dataset was based on a 1991 AVHRR forest density map. The second was the US Geological Survey's National Land Cover Database (NLCD) 2001 sub-pixel tree canopy. A comparison of these two tree canopy layers was conducted in 36 census designated places of western New York State. Reference data...
Ruane, Sara; Raxworthy, Christopher J; Lemmon, Alan R; Lemmon, Emily Moriarty; Burbrink, Frank T
2015-10-12
Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar's serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species tree approaches. We also examined the individual gene trees in comparison to the 377-locus species tree using the program MetaTree. Using the full anchored dataset under a variety of methods gave us the same, well-supported phylogeny for pseudoxyrhophiines. The African pseudoxyrhophiine Duberria is the sister taxon to the Malagasy pseudoxyrhophiines genera, providing evidence for a monophyletic radiation in Madagascar. In addition, within Madagascar, the two major clades inferred correspond largely to the aglyphous and opisthoglyphous genera, suggesting that feeding specializations associated with tooth venom delivery may have played a major role in the early diversification of this radiation. The comparison of tree topologies from the concatenated and species-tree methods using different datasets indicated the 5-locus dataset cannot beused to infer a correct phylogeny for the pseudoxyrhophiines under any method tested here and that summary statistics methods require 50 or more loci to consistently recover the species-tree inferred using the complete anchored dataset. However, as few as 15 loci may infer the correct topology when using the full coalescent species tree method *BEAST. MetaTree analyses of each gene tree from the Sanger and anchored datasets found that none of the individual gene trees matched the 377-locus species tree, and that no gene trees were identical with respect to topology. Our results suggest that ≥50 loci may be necessary to confidently infer phylogenies when using summaryspecies-tree methods, but that the coalescent-based method *BEAST consistently recovers the same topology using only 15 loci. These results reinforce that datasets with small numbers of markers may result in misleading topologies, and further, that the method of inference used to generate a phylogeny also has a major influence on the number of loci necessary to infer robust species trees.
Pujar, Shashikant; O’Leary, Nuala A; Farrell, Catherine M; Mudge, Jonathan M; Wallin, Craig; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bult, Carol J; Frankish, Adam; Pruitt, Kim D
2018-01-01
Abstract The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. PMID:29126148
SparRec: An effective matrix completion framework of missing data imputation for GWAS
NASA Astrophysics Data System (ADS)
Jiang, Bo; Ma, Shiqian; Causey, Jason; Qiao, Linbo; Hardin, Matthew Price; Bitts, Ian; Johnson, Daniel; Zhang, Shuzhong; Huang, Xiuzhen
2016-10-01
Genome-wide association studies present computational challenges for missing data imputation, while the advances of genotype technologies are generating datasets of large sample sizes with sample sets genotyped on multiple SNP chips. We present a new framework SparRec (Sparse Recovery) for imputation, with the following properties: (1) The optimization models of SparRec, based on low-rank and low number of co-clusters of matrices, are different from current statistics methods. While our low-rank matrix completion (LRMC) model is similar to Mendel-Impute, our matrix co-clustering factorization (MCCF) model is completely new. (2) SparRec, as other matrix completion methods, is flexible to be applied to missing data imputation for large meta-analysis with different cohorts genotyped on different sets of SNPs, even when there is no reference panel. This kind of meta-analysis is very challenging for current statistics based methods. (3) SparRec has consistent performance and achieves high recovery accuracy even when the missing data rate is as high as 90%. Compared with Mendel-Impute, our low-rank based method achieves similar accuracy and efficiency, while the co-clustering based method has advantages in running time. The testing results show that SparRec has significant advantages and competitive performance over other state-of-the-art existing statistics methods including Beagle and fastPhase.
NASA Astrophysics Data System (ADS)
Rosner, A.; Letcher, B. H.; Vogel, R. M.
2014-12-01
Predicting streamflow in headwaters and over a broad spatial scale pose unique challenges due to limited data availability. Flow observation gages for headwaters streams are less common than for larger rivers, and gages with records lengths of ten year or more are even more scarce. Thus, there is a great need for estimating streamflows in ungaged or sparsely-gaged headwaters. Further, there is often insufficient basin information to develop rainfall-runoff models that could be used to predict future flows under various climate scenarios. Headwaters in the northeastern U.S. are of particular concern to aquatic biologists, as these stream serve as essential habitat for native coldwater fish. In order to understand fish response to past or future environmental drivers, estimates of seasonal streamflow are needed. While there is limited flow data, there is a wealth of data for historic weather conditions. Observed data has been modeled to interpolate a spatially continuous historic weather dataset. (Mauer et al 2002). We present a statistical model developed by pairing streamflow observations with precipitation and temperature information for the same and preceding time-steps. We demonstrate this model's use to predict flow metrics at the seasonal time-step. While not a physical model, this statistical model represents the weather drivers. Since this model can predict flows not directly tied to reference gages, we can generate flow estimates for historic as well as potential future conditions.
NASA Astrophysics Data System (ADS)
Faqih, A.
2017-03-01
Providing information regarding future climate scenarios is very important in climate change study. The climate scenario can be used as basic information to support adaptation and mitigation studies. In order to deliver future climate scenarios over specific region, baseline and projection data from the outputs of global climate models (GCM) is needed. However, due to its coarse resolution, the data have to be downscaled and bias corrected in order to get scenario data with better spatial resolution that match the characteristics of the observed data. Generating this downscaled data is mostly difficult for scientist who do not have specific background, experience and skill in dealing with the complex data from the GCM outputs. In this regards, it is necessary to develop a tool that can be used to simplify the downscaling processes in order to help scientist, especially in Indonesia, for generating future climate scenario data that can be used for their climate change-related studies. In this paper, we introduce a tool called as “Statistical Bias Correction for Climate Scenarios (SiBiaS)”. The tool is specially designed to facilitate the use of CMIP5 GCM data outputs and process their statistical bias corrections relative to the reference data from observations. It is prepared for supporting capacity building in climate modeling in Indonesia as part of the Indonesia 3rd National Communication (TNC) project activities.
Liu, Junting; Wang, Liang; Sun, Jinghui; Liu, Gongshu; Yan, Weili; Xi, Bo; Xiong, Feng; Ding, Wenqing; Huang, Guimin; Heymsfield, Steven; Mi, Jie
2017-05-29
No nationwide paediatric reference standards for bone mineral density (BMD) are available in China. We aimed to provide sex-specific BMD reference values for Chinese children and adolescents (3-18 years). Data (10 818 participants aged 3-18 years) were obtained from cross-sectional surveys of the China Child and Adolescent Cardiovascular Health in 2015, which included four municipality cities and three provinces. BMD was measured using Hologic Discovery Dual Energy X-ray Absorptiometry (DXA) scanner. The DXA measures were modelled against age, with height as an independent variable. The LMS statistical method using a curve fitting procedure was used to construct reference smooth cross-sectional centile curves for dependent versus independent variables. Children residing in Northeast China had the highest total body less head (TBLH) BMD while children residing in Shandong Province had the lowest values. Among children, TBLH BMD was higher for boys as compared with girls; but, it increased with age and height in both sexes. Furthermore, TBLH BMD was higher among US children as compared with Chinese children. There was a large difference in BMD for height among children from these two countries. US children had a much higher BMD at each percentile (P) than Chinese children; the largest observed difference was at P50 and P3 and the smallest difference was at P97. This is the first study to present a sex-specific reference dataset for Chinese children aged 3-18 years. The data can help clinicians improve interpretation, assessment and monitoring of densitometry results. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
1000 Norms Project: protocol of a cross-sectional study cataloging human variation.
McKay, Marnee J; Baldwin, Jennifer N; Ferreira, Paulo; Simic, Milena; Vanicek, Natalie; Hiller, Claire E; Nightingale, Elizabeth J; Moloney, Niamh A; Quinlan, Kate G; Pourkazemi, Fereshteh; Sman, Amy D; Nicholson, Leslie L; Mousavi, Seyed J; Rose, Kristy; Raymond, Jacqueline; Mackey, Martin G; Chard, Angus; Hübscher, Markus; Wegener, Caleb; Fong Yan, Alycia; Refshauge, Kathryn M; Burns, Joshua
2016-03-01
Clinical decision-making regarding diagnosis and management largely depends on comparison with healthy or 'normal' values. Physiotherapists and researchers therefore need access to robust patient-centred outcome measures and appropriate reference values. However there is a lack of high-quality reference data for many clinical measures. The aim of the 1000 Norms Project is to generate a freely accessible database of musculoskeletal and neurological reference values representative of the healthy population across the lifespan. In 2012 the 1000 Norms Project Consortium defined the concept of 'normal', established a sampling strategy and selected measures based on clinical significance, psychometric properties and the need for reference data. Musculoskeletal and neurological items tapping the constructs of dexterity, balance, ambulation, joint range of motion, strength and power, endurance and motor planning will be collected in this cross-sectional study. Standardised questionnaires will evaluate quality of life, physical activity, and musculoskeletal health. Saliva DNA will be analysed for the ACTN3 genotype ('gene for speed'). A volunteer cohort of 1000 participants aged 3 to 100 years will be recruited according to a set of self-reported health criteria. Descriptive statistics will be generated, creating tables of mean values and standard deviations stratified for age and gender. Quantile regression equations will be used to generate age charts and age-specific centile values. This project will be a powerful resource to assist physiotherapists and clinicians across all areas of healthcare to diagnose pathology, track disease progression and evaluate treatment response. This reference dataset will also contribute to the development of robust patient-centred clinical trial outcome measures. Copyright © 2015 Chartered Society of Physiotherapy. Published by Elsevier Ltd. All rights reserved.
Felyx : A Free Open Software Solution for the Analysis of Large Earth Observation Datasets
NASA Astrophysics Data System (ADS)
Piolle, Jean-Francois; Shutler, Jamie; Poulter, David; Guidetti, Veronica; Donlon, Craig
2014-05-01
GHRSST project, by assembling large collections of earth observation data from various sources and agencies, has also raised the need for providing the user community with tools to inter-compare them, assess and monitor their quality. The ESA /Medspiration project, which implemented the first operating node of GHRSST system for Europe, also paved the way successfully towards such generic analytics tools by developing the High Resolution Diagnostic Dataset System (HR-DDS) and Satellite to In situ Multi-sensor Match-up Databases. Building on this heritage, ESA is now funding the development by IFREMER, PML and Pelamis of felyx, a web tool merging the two capabilities into a single software solution. It will consist in a free open software solution, written in python and javascript, whose aim is to provide Earth Observation data producers and users with an open-source, flexible and reusable tool to allow the quality and performance of data streams (satellite, in situ and model) to be easily monitored and studied. The primary concept of Felyx is to work as an extraction tool, subsetting source data over predefined target areas (which can be static or moving) : these data subsets, and associated metrics, can then be accessed by users or client applications either as raw files, automatic alerts and reports generated periodically, or through a flexible web interface enabling statistical analysis and visualization. Felyx presents itself as an open-source suite of tools, written in python and javascript, enabling : * subsetting large local or remote collections of Earth Observation data over predefined sites (geographical boxes) or moving targets (ship, buoy, hurricane), storing locally the extracted data (refered as miniProds). These miniProds constitute a much smaller representative subset of the original collection on which one can perform any kind of processing or assessment without having to cope with heavy volumes of data. * computing statistical metrics over these miniProds using for instance a set of usual statistical operators (mean, median, rms, ...), fully extensible and applicable to any variable of a dataset. These metrics are stored in a fast search engine, queryable by humans and automated applications. * reporting or alerting, based on user-defined inference rules, through various media (emails, twitter feeds,..) and devices (phones, tablets). * analysing miniProds and metrics through a web interface allowing to dig into this base of information and extracting useful knowledge through multidimensional interactive display functions (time series, scatterplots, histograms, maps). The services provided by felyx will be generic, deployable at users own premises and adaptable enough to integrate any kind of parameters. Users will be able to operate their own felyx instance at any location, on datasets and parameters of their own interest, and the various instances will be able to interact with each other, creating a web of felyx systems enabling aggregation and cross comparison of miniProds and metrics from multiple sources. Initially two instances will be operated simultaneously during a 6 months demonstration phase, at IFREMER - on sea surface temperature (for GHRSST community) and ocean waves datasets - and PML - on ocean colour. We will present results from the Felyx project, demonstrate how the GHRSST community can exploit Felyx and demonstrate how the wider community can make use of the GHRSST data within Felyx.
Yilmaz, E; Kayikcioglu, T; Kayipmaz, S
2017-07-01
In this article, we propose a decision support system for effective classification of dental periapical cyst and keratocystic odontogenic tumor (KCOT) lesions obtained via cone beam computed tomography (CBCT). CBCT has been effectively used in recent years for diagnosing dental pathologies and determining their boundaries and content. Unlike other imaging techniques, CBCT provides detailed and distinctive information about the pathologies by enabling a three-dimensional (3D) image of the region to be displayed. We employed 50 CBCT 3D image dataset files as the full dataset of our study. These datasets were identified by experts as periapical cyst and KCOT lesions according to the clinical, radiographic and histopathologic features. Segmentation operations were performed on the CBCT images using viewer software that we developed. Using the tools of this software, we marked the lesional volume of interest and calculated and applied the order statistics and 3D gray-level co-occurrence matrix for each CBCT dataset. A feature vector of the lesional region, including 636 different feature items, was created from those statistics. Six classifiers were used for the classification experiments. The Support Vector Machine (SVM) classifier achieved the best classification performance with 100% accuracy, and 100% F-score (F1) scores as a result of the experiments in which a ten-fold cross validation method was used with a forward feature selection algorithm. SVM achieved the best classification performance with 96.00% accuracy, and 96.00% F1 scores in the experiments in which a split sample validation method was used with a forward feature selection algorithm. SVM additionally achieved the best performance of 94.00% accuracy, and 93.88% F1 in which a leave-one-out (LOOCV) method was used with a forward feature selection algorithm. Based on the results, we determined that periapical cyst and KCOT lesions can be classified with a high accuracy with the models that we built using the new dataset selected for this study. The studies mentioned in this article, along with the selected 3D dataset, 3D statistics calculated from the dataset, and performance results of the different classifiers, comprise an important contribution to the field of computer-aided diagnosis of dental apical lesions. Copyright © 2017 Elsevier B.V. All rights reserved.
Embling, Laura A K; Zagami, Debbie; Sriram, Krishna Bajee; Gordon, Robert J; Sivakumaran, Pathmanathan
2016-12-01
The categorisation of lung disease into obstructive ventilatory defect (OVD) and tendency to a restrictive ventilatory defect (TRVD) patterns using spirometry is used to guide both prognostication and treatment. The effectiveness of categorisation depends upon having reference ranges that accurately represent the population they describe. The Global Lung Initiative 2012 (GLI 2012) has spirometry reference ranges drawn from the largest sample size to date. This study aimed to determine whether using spirometry reference ranges from the new GLI 2012 dataset, compared to the previously used National Health and Nutritional Examination Survey III (NHANES III) dataset, resulted in a change in diagnosis between OVD, TRVD and normal ventilatory pattern (NVP). Spirometry data were collected from 301 patients, aged 18-80 years, undergoing investigation at the Gold Coast Hospital and Health Service (GCHHS) throughout February and March 2014. OVD was defined as a forced expiratory volume in 1 second (FEV 1 ) divided by forced vital capacity (FVC) less than lower limit of normal (LLN). TRVD was defined as FEV 1 /FVC ≥ LLN, FEV 1 < LLN, and FVC < LLN. The LLN values were determined by equations from the GLI and NHANES datasets. Spirometry interpreted using the NHANES III equations showed: 102 individuals (33.9%) with normal spirometry, 136 (45.2%) with an OVD pattern, 52 (17.3%) with a TRVD pattern, and 11 (3.7%) with a mixed pattern. When the spirometry data were interpreted using the GLI 2012 equations 2 (0.7%) individuals changed from OVD to NVP, 2 (0.7%) changed from NVP to OVD and 14 (4.7%) changed from TRVD to NVP. Using the GLI 2012 reference range resulted in a change in diagnosis of lung disease in 5.9% of the individuals included in this study. This variance in diagnosis when changing reference ranges should be taken into account by clinicians as it may affect patient management.
Statistical Exploration of Electronic Structure of Molecules from Quantum Monte-Carlo Simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prabhat, Mr; Zubarev, Dmitry; Lester, Jr., William A.
In this report, we present results from analysis of Quantum Monte Carlo (QMC) simulation data with the goal of determining internal structure of a 3N-dimensional phase space of an N-electron molecule. We are interested in mining the simulation data for patterns that might be indicative of the bond rearrangement as molecules change electronic states. We examined simulation output that tracks the positions of two coupled electrons in the singlet and triplet states of an H2 molecule. The electrons trace out a trajectory, which was analyzed with a number of statistical techniques. This project was intended to address the following scientificmore » questions: (1) Do high-dimensional phase spaces characterizing electronic structure of molecules tend to cluster in any natural way? Do we see a change in clustering patterns as we explore different electronic states of the same molecule? (2) Since it is hard to understand the high-dimensional space of trajectories, can we project these trajectories to a lower dimensional subspace to gain a better understanding of patterns? (3) Do trajectories inherently lie in a lower-dimensional manifold? Can we recover that manifold? After extensive statistical analysis, we are now in a better position to respond to these questions. (1) We definitely see clustering patterns, and differences between the H2 and H2tri datasets. These are revealed by the pamk method in a fairly reliable manner and can potentially be used to distinguish bonded and non-bonded systems and get insight into the nature of bonding. (2) Projecting to a lower dimensional subspace ({approx}4-5) using PCA or Kernel PCA reveals interesting patterns in the distribution of scalar values, which can be related to the existing descriptors of electronic structure of molecules. Also, these results can be immediately used to develop robust tools for analysis of noisy data obtained during QMC simulations (3) All dimensionality reduction and estimation techniques that we tried seem to indicate that one needs 4 or 5 components to account for most of the variance in the data, hence this 5D dataset does not necessarily lie on a well-defined, low dimensional manifold. In terms of specific clustering techniques, K-means was generally useful in exploring the dataset. The partition around medoids (pam) technique produced the most definitive results for our data showing distinctive patterns for both a sample of the complete data and time-series. The gap statistic with tibshirani criteria did not provide any distinction across the 2 dataset. The gap statistic w/DandF criteria, Model based clustering and hierarchical modeling simply failed to run on our datasets. Thankfully, the vanilla PCA technique was successful in handling our entire dataset. PCA revealed some interesting patterns for the scalar value distribution. Kernel PCA techniques (vanilladot, RBF, Polynomial) and MDS failed to run on the entire dataset, or even a significant fraction of the dataset, and we resorted to creating an explicit feature map followed by conventional PCA. Clustering using K-means and PAM in the new basis set seems to produce promising results. Understanding the new basis set in the scientific context of the problem is challenging, and we are currently working to further examine and interpret the results.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
al-Saffar, Sinan; Joslyn, Cliff A.; Chappell, Alan R.
As semantic datasets grow to be very large and divergent, there is a need to identify and exploit their inherent semantic structure for discovery and optimization. Towards that end, we present here a novel methodology to identify the semantic structures inherent in an arbitrary semantic graph dataset. We first present the concept of an extant ontology as a statistical description of the semantic relations present amongst the typed entities modeled in the graph. This serves as a model of the underlying semantic structure to aid in discovery and visualization. We then describe a method of ontological scaling in which themore » ontology is employed as a hierarchical scaling filter to infer different resolution levels at which the graph structures are to be viewed or analyzed. We illustrate these methods on three large and publicly available semantic datasets containing more than one billion edges each. Keywords-Semantic Web; Visualization; Ontology; Multi-resolution Data Mining;« less
Integrative Analysis of “-Omics” Data Using Penalty Functions
Zhao, Qing; Shi, Xingjie; Huang, Jian; Liu, Jin; Li, Yang; Ma, Shuangge
2014-01-01
In the analysis of omics data, integrative analysis provides an effective way of pooling information across multiple datasets or multiple correlated responses, and can be more effective than single-dataset (response) analysis. Multiple families of integrative analysis methods have been proposed in the literature. The current review focuses on the penalization methods. Special attention is paid to sparse meta-analysis methods that pool summary statistics across datasets, and integrative analysis methods that pool raw data across datasets. We discuss their formulation and rationale. Beyond “standard” penalized selection, we also review contrasted penalization and Laplacian penalization which accommodate finer data structures. The computational aspects, including computational algorithms and tuning parameter selection, are examined. This review concludes with possible limitations and extensions. PMID:25691921
Differential privacy based on importance weighting
Ji, Zhanglong
2014-01-01
This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights are importance sampling weights, but to protect privacy, they are regularized and have noise added. The weights allow statistical queries to be answered approximately while provably guaranteeing differential privacy. We derive an expression for the asymptotic variance of the approximate answers. Experiments show that the new mechanism performs well even when the privacy budget is small, and when the public and private datasets are drawn from different populations. PMID:24482559
Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset
NASA Technical Reports Server (NTRS)
Ramasso, Emannuel; Saxena, Abhinav
2014-01-01
Benchmarking of prognostic algorithms has been challenging due to limited availability of common datasets suitable for prognostics. In an attempt to alleviate this problem several benchmarking datasets have been collected by NASA's prognostic center of excellence and made available to the Prognostics and Health Management (PHM) community to allow evaluation and comparison of prognostics algorithms. Among those datasets are five C-MAPSS datasets that have been extremely popular due to their unique characteristics making them suitable for prognostics. The C-MAPSS datasets pose several challenges that have been tackled by different methods in the PHM literature. In particular, management of high variability due to sensor noise, effects of operating conditions, and presence of multiple simultaneous fault modes are some factors that have great impact on the generalization capabilities of prognostics algorithms. More than 70 publications have used the C-MAPSS datasets for developing data-driven prognostic algorithms. The C-MAPSS datasets are also shown to be well-suited for development of new machine learning and pattern recognition tools for several key preprocessing steps such as feature extraction and selection, failure mode assessment, operating conditions assessment, health status estimation, uncertainty management, and prognostics performance evaluation. This paper summarizes a comprehensive literature review of publications using C-MAPSS datasets and provides guidelines and references to further usage of these datasets in a manner that allows clear and consistent comparison between different approaches.
The Importance of Variance in Statistical Analysis: Don't Throw Out the Baby with the Bathwater.
ERIC Educational Resources Information Center
Peet, Martha W.
This paper analyzes what happens to the effect size of a given dataset when the variance is removed by categorization for the purpose of applying "OVA" methods (analysis of variance, analysis of covariance). The dataset is from a classic study by Holzinger and Swinefors (1939) in which more than 20 ability test were administered to 301…
Risk model of prolonged intensive care unit stay in Chinese patients undergoing heart valve surgery.
Wang, Chong; Zhang, Guan-xin; Zhang, Hao; Lu, Fang-lin; Li, Bai-ling; Xu, Ji-bin; Han, Lin; Xu, Zhi-yun
2012-11-01
The aim of this study was to develop a preoperative risk prediction model and an scorecard for prolonged intensive care unit length of stay (PrlICULOS) in adult patients undergoing heart valve surgery. This is a retrospective observational study of collected data on 3925 consecutive patients older than 18 years, who had undergone heart valve surgery between January 2000 and December 2010. Data were randomly split into a development dataset (n=2401) and a validation dataset (n=1524). A multivariate logistic regression analysis was undertaken using the development dataset to identify independent risk factors for PrlICULOS. Performance of the model was then assessed by observed and expected rates of PrlICULOS on the development and validation dataset. Model calibration and discriminatory ability were analysed by the Hosmer-Lemeshow goodness-of-fit statistic and the area under the receiver operating characteristic (ROC) curve, respectively. There were 491 patients that required PrlICULOS (12.5%). Preoperative independent predictors of PrlICULOS are shown with odds ratio as follows: (1) age, 1.4; (2) chronic obstructive pulmonary disease (COPD), 1.8; (3) atrial fibrillation, 1.4; (4) left bundle branch block, 2.7; (5) ejection fraction, 1.4; (6) left ventricle weight, 1.5; (7) New York Heart Association class III-IV, 1.8; (8) critical preoperative state, 2.0; (9) perivalvular leakage, 6.4; (10) tricuspid valve replacement, 3.8; (11) concurrent CABG, 2.8; and (12) concurrent other cardiac surgery, 1.8. The Hosmer-Lemeshow goodness-of-fit statistic was not statistically significant in both development and validation dataset (P=0.365 vs P=0.310). The ROC curve for the prediction of PrlICULOS in development and validation dataset was 0.717 and 0.700, respectively. We developed and validated a local risk prediction model for PrlICULOS after adult heart valve surgery. This model can be used to calculate patient-specific risk with an equivalent predicted risk at our centre in future clinical practice. Copyright © 2012 Australian and New Zealand Society of Cardiac and Thoracic Surgeons (ANZSCTS) and the Cardiac Society of Australia and New Zealand (CSANZ). Published by Elsevier B.V. All rights reserved.
Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
Mainali, Kumar P; Bewick, Sharon; Thielen, Peter; Mehoke, Thomas; Breitwieser, Florian P; Paudel, Shishir; Adhikari, Arjun; Wolfe, Joshua; Slud, Eric V; Karig, David; Fagan, William F
2017-01-01
Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson's correlation coefficient (r) and Jaccard's index (J)-two of the most common metrics for correlation analysis of presence-absence data-can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson's correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard's index of similarity (J) can yield improvements over Pearson's correlation coefficient. However, the standard null model for Jaccard's index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for species prevalence. This model is available from recent statistics literature, and can be used for evaluating the significance of any value of an empirically observed Jaccard's index. The resulting simple, yet effective method for handling correlation analysis of microbial presence-absence datasets provides a robust means of testing and finding relationships and/or shared environmental responses among microbial taxa.
Large-scale seismic waveform quality metric calculation using Hadoop
Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.; ...
2016-05-27
Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.« less
Large-scale seismic waveform quality metric calculation using Hadoop
DOE Office of Scientific and Technical Information (OSTI.GOV)
Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.
Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/Omore » performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.« less
A hybrid approach to select features and classify diseases based on medical data
NASA Astrophysics Data System (ADS)
AbdelLatif, Hisham; Luo, Jiawei
2018-03-01
Feature selection is popular problem in the classification of diseases in clinical medicine. Here, we developing a hybrid methodology to classify diseases, based on three medical datasets, Arrhythmia, Breast cancer, and Hepatitis datasets. This methodology called k-means ANOVA Support Vector Machine (K-ANOVA-SVM) uses K-means cluster with ANOVA statistical to preprocessing data and selection the significant features, and Support Vector Machines in the classification process. To compare and evaluate the performance, we choice three classification algorithms, decision tree Naïve Bayes, Support Vector Machines and applied the medical datasets direct to these algorithms. Our methodology was a much better classification accuracy is given of 98% in Arrhythmia datasets, 92% in Breast cancer datasets and 88% in Hepatitis datasets, Compare to use the medical data directly with decision tree Naïve Bayes, and Support Vector Machines. Also, the ROC curve and precision with (K-ANOVA-SVM) Achieved best results than other algorithms
Use of Electronic Health-Related Datasets in Nursing and Health-Related Research.
Al-Rawajfah, Omar M; Aloush, Sami; Hewitt, Jeanne Beauchamp
2015-07-01
Datasets of gigabyte size are common in medical sciences. There is increasing consensus that significant untapped knowledge lies hidden in these large datasets. This review article aims to discuss Electronic Health-Related Datasets (EHRDs) in terms of types, features, advantages, limitations, and possible use in nursing and health-related research. Major scientific databases, MEDLINE, ScienceDirect, and Scopus, were searched for studies or review articles regarding using EHRDs in research. A total number of 442 articles were located. After application of study inclusion criteria, 113 articles were included in the final review. EHRDs were categorized into Electronic Administrative Health-Related Datasets and Electronic Clinical Health-Related Datasets. Subcategories of each major category were identified. EHRDs are invaluable assets for nursing the health-related research. Advanced research skills such as using analytical softwares, advanced statistical procedures, dealing with missing data and missing variables will maximize the efficient utilization of EHRDs in research. © The Author(s) 2014.
2012-01-01
Background ChIP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, detecting allelic imbalance from a single ChIP-seq dataset often has low statistical power since only sequence reads mapped to heterozygote SNPs are informative for discriminating two alleles. Results We develop a new method iASeq to address this issue by jointly analyzing multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve detection of allelic imbalance. Application of iASeq to 77 ChIP-seq samples from 40 ENCODE datasets and 1 genomic DNA sample in GM12878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately. Conclusions iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a new tool to better analyze ASB. PMID:23194258
A statistical framework for evaluating neural networks to predict recurrent events in breast cancer
NASA Astrophysics Data System (ADS)
Gorunescu, Florin; Gorunescu, Marina; El-Darzi, Elia; Gorunescu, Smaranda
2010-07-01
Breast cancer is the second leading cause of cancer deaths in women today. Sometimes, breast cancer can return after primary treatment. A medical diagnosis of recurrent cancer is often a more challenging task than the initial one. In this paper, we investigate the potential contribution of neural networks (NNs) to support health professionals in diagnosing such events. The NN algorithms are tested and applied to two different datasets. An extensive statistical analysis has been performed to verify our experiments. The results show that a simple network structure for both the multi-layer perceptron and radial basis function can produce equally good results, not all attributes are needed to train these algorithms and, finally, the classification performances of all algorithms are statistically robust. Moreover, we have shown that the best performing algorithm will strongly depend on the features of the datasets, and hence, there is not necessarily a single best classifier.
NASA Astrophysics Data System (ADS)
Bhuiyan, M. A. E.; Nikolopoulos, E. I.; Anagnostou, E. N.
2017-12-01
Quantifying the uncertainty of global precipitation datasets is beneficial when using these precipitation products in hydrological applications, because precipitation uncertainty propagation through hydrologic modeling can significantly affect the accuracy of the simulated hydrologic variables. In this research the Iberian Peninsula has been used as the study area with a study period spanning eleven years (2000-2010). This study evaluates the performance of multiple hydrologic models forced with combined global rainfall estimates derived based on a Quantile Regression Forests (QRF) technique. In QRF technique three satellite precipitation products (CMORPH, PERSIANN, and 3B42 (V7)); an atmospheric reanalysis precipitation and air temperature dataset; satellite-derived near-surface daily soil moisture data; and a terrain elevation dataset are being utilized in this study. A high-resolution, ground-based observations driven precipitation dataset (named SAFRAN) available at 5 km/1 h resolution is used as reference. Through the QRF blending framework the stochastic error model produces error-adjusted ensemble precipitation realizations, which are used to force four global hydrological models (JULES (Joint UK Land Environment Simulator), WaterGAP3 (Water-Global Assessment and Prognosis), ORCHIDEE (Organizing Carbon and Hydrology in Dynamic Ecosystems) and SURFEX (Stands for Surface Externalisée) ) to simulate three hydrologic variables (surface runoff, subsurface runoff and evapotranspiration). The models are forced with the reference precipitation to generate reference-based hydrologic simulations. This study presents a comparative analysis of multiple hydrologic model simulations for different hydrologic variables and the impact of the blending algorithm on the simulated hydrologic variables. Results show how precipitation uncertainty propagates through the different hydrologic model structures to manifest in reduction of error in hydrologic variables.
Evaluating Soil Moisture Retrievals from ESA's SMOS and NASA's SMAP Brightness Temperature Datasets
NASA Technical Reports Server (NTRS)
Al-Yaari, A.; Wigernon, J.-P.; Kerr, Y.; Rodriguez-Fernandez, N.; O'Neill, P. E.; Jackson, T. J.; De Lannoy, G. J. M.; Al Bitar, A.; Mialon, A.; Richaume, P.;
2017-01-01
Two satellites are currently monitoring surface soil moisture (SM) using L-band observations: SMOS (Soil Moisture and Ocean Salinity), a joint ESA (European Space Agency), CNES (Centre national d'tudes spatiales), and CDTI (the Spanish government agency with responsibility for space) satellite launched on November 2, 2009 and SMAP (Soil Moisture Active Passive), a National Aeronautics and Space Administration (NASA) satellite successfully launched in January 2015. In this study, we used a multilinear regression approach to retrieve SM from SMAP data to create a global dataset of SM, which is consistent with SM data retrieved from SMOS. This was achieved by calibrating coefficients of the regression model using the CATDS (Centre Aval de Traitement des Donnes) SMOS Level 3 SM and the horizontally and vertically polarized brightness temperatures (TB) at 40 deg incidence angle, over the 2013 - 2014 period. Next, this model was applied to SMAP L3 TB data from Apr 2015 to Jul 2016. The retrieved SM from SMAP (referred to here as SMAP_Reg) was compared to: (i) the operational SMAP L3 SM (SMAP_SCA), retrieved using the baseline Single Channel retrieval Algorithm (SCA); and (ii) the operational SMOSL3 SM, derived from the multiangular inversion of the L-MEB model (L-MEB algorithm) (SMOSL3). This inter-comparison was made against in situ soil moisture measurements from more than 400 sites spread over the globe, which are used here as a reference soil moisture dataset. The in situ observations were obtained from the International Soil Moisture Network (ISMN; https:ismn.geo.tuwien.ac.at) in North of America (PBO_H2O, SCAN, SNOTEL, iRON, and USCRN), in Australia (Oznet), Africa (DAHRA), and in Europe (REMEDHUS, SMOSMANIA, FMI, and RSMN). The agreement was analyzed in terms of four classical statistical criteria: Root Mean Squared Error (RMSE),Bias, Unbiased RMSE (UnbRMSE), and correlation coefficient (R). Results of the comparison of these various products with in situ observations show that the performance of both SMAP products i.e. SMAP_SCA and SMAP_Reg is 48 similar and marginally better to that of the SMOSL3 product particularly over the PBO_H2O, SCAN, and USCRN sites. However, SMOSL3 SM was closer to the in situ observations over the DAHRA and Oznet sites. We found that the correlation between all three datasets and in situ measurements is best (R 0.80) over the Oznet sites and worst (R 0.58) over the SNOTEL sites for SMAP_SCA and over the DAHRA and SMOSMANIA sites (R 0.51 and R 0.45 for SMAP_Reg and SMOSL3, respectively). The Bias values showed that all products are generally dry, except over RSMN, DAHRA, and Oznet (and FMI for SMAP_SCA). Finally, our analysis provided interesting insights that can be useful to improve the consistency between SMAP and SMOS datasets.
Evaluating soil moisture retrievals from ESA’s SMOS and NASA’s SMAP brightness temperature datasets
Al-Yaari, A.; Wigneron, J.-P.; Kerr, Y.; Rodriguez-Fernandez, N.; O’Neill, P. E.; Jackson, T. J.; De Lannoy, G.J.M.; Al Bitar, A; Mialon, A.; Richaume, P.; Walker, JP; Mahmoodi, A.; Yueh, S.
2018-01-01
Two satellites are currently monitoring surface soil moisture (SM) using L-band observations: SMOS (Soil Moisture and Ocean Salinity), a joint ESA (European Space Agency), CNES (Centre national d’études spatiales), and CDTI (the Spanish government agency with responsibility for space) satellite launched on November 2, 2009 and SMAP (Soil Moisture Active Passive), a National Aeronautics and Space Administration (NASA) satellite successfully launched in January 2015. In this study, we used a multilinear regression approach to retrieve SM from SMAP data to create a global dataset of SM, which is consistent with SM data retrieved from SMOS. This was achieved by calibrating coefficients of the regression model using the CATDS (Centre Aval de Traitement des Données) SMOS Level 3 SM and the horizontally and vertically polarized brightness temperatures (TB) at 40° incidence angle, over the 2013 – 2014 period. Next, this model was applied to SMAP L3 TB data from Apr 2015 to Jul 2016. The retrieved SM from SMAP (referred to here as SMAP_Reg) was compared to: (i) the operational SMAP L3 SM (SMAP_SCA), retrieved using the baseline Single Channel retrieval Algorithm (SCA); and (ii) the operational SMOSL3 SM, derived from the multiangular inversion of the L-MEB model (L-MEB algorithm) (SMOSL3). This inter-comparison was made against in situ soil moisture measurements from more than 400 sites spread over the globe, which are used here as a reference soil moisture dataset. The in situ observations were obtained from the International Soil Moisture Network (ISMN; https://ismn.geo.tuwien.ac.at/) in North of America (PBO_H2O, SCAN, SNOTEL, iRON, and USCRN), in Australia (Oznet), Africa (DAHRA), and in Europe (REMEDHUS, SMOSMANIA, FMI, and RSMN). The agreement was analyzed in terms of four classical statistical criteria: Root Mean Squared Error (RMSE), Bias, Unbiased RMSE (UnbRMSE), and correlation coefficient (R). Results of the comparison of these various products with in situ observations show that the performance of both SMAP products i.e. SMAP_SCA and SMAP_Reg is similar and marginally better to that of the SMOSL3 product particularly over the PBO_H2O, SCAN, and USCRN sites. However, SMOSL3 SM was closer to the in situ observations over the DAHRA and Oznet sites. We found that the correlation between all three datasets and in situ measurements is best (R > 0.80) over the Oznet sites and worst (R = 0.58) over the SNOTEL sites for SMAP_SCA and over the DAHRA and SMOSMANIA sites (R= 0.51 and R= 0.45 for SMAP_Reg and SMOSL3, respectively). The Bias values showed that all products are generally dry, except over RSMN, DAHRA, and Oznet (and FMI for SMAP_SCA). Finally, our analysis provided interesting insights that can be useful to improve the consistency between SMAP and SMOS datasets. PMID:29743730
Evaluating soil moisture retrievals from ESA's SMOS and NASA's SMAP brightness temperature datasets.
Al-Yaari, A; Wigneron, J-P; Kerr, Y; Rodriguez-Fernandez, N; O'Neill, P E; Jackson, T J; De Lannoy, G J M; Al Bitar, A; Mialon, A; Richaume, P; Walker, J P; Mahmoodi, A; Yueh, S
2017-05-01
Two satellites are currently monitoring surface soil moisture (SM) using L-band observations: SMOS (Soil Moisture and Ocean Salinity), a joint ESA (European Space Agency), CNES (Centre national d'études spatiales), and CDTI (the Spanish government agency with responsibility for space) satellite launched on November 2, 2009 and SMAP (Soil Moisture Active Passive), a National Aeronautics and Space Administration (NASA) satellite successfully launched in January 2015. In this study, we used a multilinear regression approach to retrieve SM from SMAP data to create a global dataset of SM, which is consistent with SM data retrieved from SMOS. This was achieved by calibrating coefficients of the regression model using the CATDS (Centre Aval de Traitement des Données) SMOS Level 3 SM and the horizontally and vertically polarized brightness temperatures (TB) at 40° incidence angle, over the 2013 - 2014 period. Next, this model was applied to SMAP L3 TB data from Apr 2015 to Jul 2016. The retrieved SM from SMAP (referred to here as SMAP_Reg) was compared to: (i) the operational SMAP L3 SM (SMAP_SCA), retrieved using the baseline Single Channel retrieval Algorithm (SCA); and (ii) the operational SMOSL3 SM, derived from the multiangular inversion of the L-MEB model (L-MEB algorithm) (SMOSL3). This inter-comparison was made against in situ soil moisture measurements from more than 400 sites spread over the globe, which are used here as a reference soil moisture dataset. The in situ observations were obtained from the International Soil Moisture Network (ISMN; https://ismn.geo.tuwien.ac.at/) in North of America (PBO_H2O, SCAN, SNOTEL, iRON, and USCRN), in Australia (Oznet), Africa (DAHRA), and in Europe (REMEDHUS, SMOSMANIA, FMI, and RSMN). The agreement was analyzed in terms of four classical statistical criteria: Root Mean Squared Error (RMSE), Bias, Unbiased RMSE (UnbRMSE), and correlation coefficient (R). Results of the comparison of these various products with in situ observations show that the performance of both SMAP products i.e. SMAP_SCA and SMAP_Reg is similar and marginally better to that of the SMOSL3 product particularly over the PBO_H2O, SCAN, and USCRN sites. However, SMOSL3 SM was closer to the in situ observations over the DAHRA and Oznet sites. We found that the correlation between all three datasets and in situ measurements is best (R > 0.80) over the Oznet sites and worst (R = 0.58) over the SNOTEL sites for SMAP_SCA and over the DAHRA and SMOSMANIA sites (R= 0.51 and R= 0.45 for SMAP_Reg and SMOSL3, respectively). The Bias values showed that all products are generally dry, except over RSMN, DAHRA, and Oznet (and FMI for SMAP_SCA). Finally, our analysis provided interesting insights that can be useful to improve the consistency between SMAP and SMOS datasets.
Data evaluation of trace elements determined in Nigerian coal using cluster procedures.
Ewa, I O B
2004-05-01
Large data-sets of elements determined by instrumental neutron activation analysis (INAA) require meaningful interpretation in order to determine the pattern of their existence in host matrices. This could be achieved using cluster procedures. Element abundances (Al, As, Ba, Br, Ca, Ce, Cs, Dy, Eu, Fe, Ga, Gd, Hf, K, La, Lu, Mg, Mn, Na, O, Rb, Sb, Sc, Sm, Sr, Ta, Tb, Th, Ti, U, V, Yb, Zn and Zr) of prepared and run-of-mine coals from eight principal mines (Onyeama, Ogbete, Enugu, Gombe, Asaba-Ugwashi, Okaba, Afikpo and Lafia ) in Nigeria were determined by INAA. Quality control of the measurements was assured by the re-determination of a standard reference material, NIST 1632a. These data-sets were then tested for multi-variate statistics using METHOD = SINGLE in the cluster procedure. The computer-assisted package SAS was used to generate the dendrograms while the algorithm used was stored Euclidean distances. The results showed a recognition pattern, useful for the interpretation of coalification histories and the prediction of fuel ranking for Nigerian coals. High segregation of coal fly ash was observed, while metallurgical coal grouped together with high-ranking coals of Okaba, Enugu and Obi (Lafia). Further work revealed some of these coals as having high gross calorific value (7908 kcal kg(-1) for Enugu coal; 7200 kcal kg(-1) for Okaba) and low sulphur thereby making them efficient fuel materials.
Catanuto, Giuseppe; Taher, Wafa; Rocco, Nicola; Catalano, Francesca; Allegra, Dario; Milotta, Filippo Luigi Maria; Stanco, Filippo; Gallo, Giovanni; Nava, Maurizio Bruno
2018-03-20
Breast shape is defined utilizing mainly qualitative assessment (full, flat, ptotic) or estimates, such as volume or distances between reference points, that cannot describe it reliably. We will quantitatively describe breast shape with two parameters derived from a statistical methodology denominated principal component analysis (PCA). We created a heterogeneous dataset of breast shapes acquired with a commercial infrared 3-dimensional scanner on which PCA was performed. We plotted on a Cartesian plane the two highest values of PCA for each breast (principal components 1 and 2). Testing of the methodology on a preoperative and postoperative surgical case and test-retest was performed by two operators. The first two principal components derived from PCA are able to characterize the shape of the breast included in the dataset. The test-retest demonstrated that different operators are able to obtain very similar values of PCA. The system is also able to identify major changes in the preoperative and postoperative stages of a two-stage reconstruction. Even minor changes were correctly detected by the system. This methodology can reliably describe the shape of a breast. An expert operator and a newly trained operator can reach similar results in a test/re-testing validation. Once developed and after further validation, this methodology could be employed as a good tool for outcome evaluation, auditing, and benchmarking.
Sloto, Ronald A.; Stuckey, Marla H.; Hoffman, Scott A.
2017-05-10
The current (2015) streamgage network in Pennsylvania and the Susquehanna River Basin in Pennsylvania and New York was evaluated in order to design a network that would meet the hydrologic needs of many partners and serve a variety of purposes and interests, including estimation of streamflow statistics at ungaged sites. This study was done by the U.S. Geological Survey, in cooperation with the Pennsylvania Department of Environmental Protection and the Susquehanna River Basin Commission. The study area includes the Commonwealth of Pennsylvania and the Susquehanna River Basin in Pennsylvania and New York. For this study, 229 streamgages were identified as reference streamgages that could be used to represent ungaged watersheds. Criteria for a reference streamgage are a minimum of 10 years of continuous record, minimally altered streamflow, and a drainage area less than 1,500 square miles. Some of the reference streamgages have been discontinued but provide historical hydrologic information valuable in the determination of streamflow characteristics of ungaged watersheds. Watersheds in the study area not adequately represented by a reference streamgage were identified by examining a range of basin characteristics, the extent of geographic coverage, and the strength of estimated streamflow correlations between gaged and ungaged sites.Basin characteristics were determined for the reference streamgage watersheds and the 1,662 12-digit hydrologic unit code (HUC12) subwatersheds in Pennsylvania and the Susquehanna River Basin using a geographic information system (GIS) spatial analysis and nationally available GIS datasets. Basin characteristics selected for this study include drainage area, mean basin elevation, mean basin slope, percentage of urbanized area, percentage of forested area, percentage of carbonate bedrock, mean annual precipitation, and soil thickness. A GIS spatial analysis was used to identify HUC12 subwatersheds outside the range of basin characteristics of the reference streamgages. There were 320 HUC12 subwatersheds, or 19 percent of the study area, with basin characteristics outside the range represented by the reference streamgage watersheds.A GIS spatial analysis was used to identify geographic gaps in the streamgage network. For each streamgage, a watershed area, called the gage statistical area (GSA), was delineated. The GSA shows the drainage area within a specific drainage-area ratio of the streamgage for transfer of streamflow statistics from that streamgage to ungaged sites on the valid statistical reach of the GSA for a streamgage. In Pennsylvania, a drainage-area ratio of 0.33–3 times the drainage area of the ungaged site was found to perform as well as, if not better than, more traditional ratios such as 0.5–1.5 (or 2) for transfer of selected streamflow statistics. A total of 1,102 HUC12 subwatersheds, or 66 percent of the study area, are outside the GSA for a reference streamgage.The USGS Baseline Streamflow Estimator (BaSE) program was used to determine how well HUC12 subwatersheds outside the streamgage GSAs are represented by the reference streamgage network in Pennsylvania, based on estimated streamflow correlation. The centroid of each HUC12 subwatershed was run through the BaSE program to determine the reference streamgage with the highest estimated streamflow correlation. There were 929 HUC12 subwatersheds in Pennsylvania, or 56 percent of the State, with an estimated correlation coefficient less than 0.96.The results from the basin characteristic, geographic, and streamflow correlation analyses were combined to identify 1,405 HUC12 subwatersheds in Pennsylvania and the Susquehanna River Basin in Pennsylvania and New York that lack a representative reference, based on at least one identified gap. Of the 1,405 HUC12 subwatersheds, 139 exhibited all three gaps, indicating a 8-percent gap in the reference streamgage network.Streamgages in areas with similar hydrologic characteristics and in close proximity to one another can potentially provide similar information (termed streamgages with high substitution potential). Streamgages were considered to have a high substitution potential with a nearby streamgage(s) if (1) the streamflow correlation coefficient was equal to or greater than 0.96, (2) the streamgages had 10 years of concurrent record, and (3) the streamgages are in the same watershed within the GSA of the streamgage. Seventy-four current (2015) streamgages with high substitution potential with at least one other streamgage were identified in the study area. Although these identified streamgages have a high substitution potential, they provide valuable streamflow information to a stakeholder. Selected primary uses of these streamgages were identified to determine the overall need for an individual streamgage.
Statistical testing and power analysis for brain-wide association study.
Gong, Weikang; Wan, Lin; Lu, Wenlian; Ma, Liang; Cheng, Fan; Cheng, Wei; Grünewald, Stefan; Feng, Jianfeng
2018-04-05
The identification of connexel-wise associations, which involves examining functional connectivities between pairwise voxels across the whole brain, is both statistically and computationally challenging. Although such a connexel-wise methodology has recently been adopted by brain-wide association studies (BWAS) to identify connectivity changes in several mental disorders, such as schizophrenia, autism and depression, the multiple correction and power analysis methods designed specifically for connexel-wise analysis are still lacking. Therefore, we herein report the development of a rigorous statistical framework for connexel-wise significance testing based on the Gaussian random field theory. It includes controlling the family-wise error rate (FWER) of multiple hypothesis testings using topological inference methods, and calculating power and sample size for a connexel-wise study. Our theoretical framework can control the false-positive rate accurately, as validated empirically using two resting-state fMRI datasets. Compared with Bonferroni correction and false discovery rate (FDR), it can reduce false-positive rate and increase statistical power by appropriately utilizing the spatial information of fMRI data. Importantly, our method bypasses the need of non-parametric permutation to correct for multiple comparison, thus, it can efficiently tackle large datasets with high resolution fMRI images. The utility of our method is shown in a case-control study. Our approach can identify altered functional connectivities in a major depression disorder dataset, whereas existing methods fail. A software package is available at https://github.com/weikanggong/BWAS. Copyright © 2018 Elsevier B.V. All rights reserved.
Exploring Relationships in Big Data
NASA Astrophysics Data System (ADS)
Mahabal, A.; Djorgovski, S. G.; Crichton, D. J.; Cinquini, L.; Kelly, S.; Colbert, M. A.; Kincaid, H.
2015-12-01
Big Data are characterized by several different 'V's. Volume, Veracity, Volatility, Value and so on. For many datasets inflated Volumes through redundant features often make the data more noisy and difficult to extract Value out of. This is especially true if one is comparing/combining different datasets, and the metadata are diverse. We have been exploring ways to exploit such datasets through a variety of statistical machinery, and visualization. We show how we have applied it to time-series from large astronomical sky-surveys. This was done in the Virtual Observatory framework. More recently we have been doing similar work for a completely different domain viz. biology/cancer. The methodology reuse involves application to diverse datasets gathered through the various centers associated with the Early Detection Research Network (EDRN) for cancer, an initiative of the National Cancer Institute (NCI). Application to Geo datasets is a natural extension.
A robust dataset-agnostic heart disease classifier from Phonocardiogram.
Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M
2017-07-01
Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.
NASA Astrophysics Data System (ADS)
Titov, A. G.; Okladnikov, I. G.; Gordov, E. P.
2017-11-01
The use of large geospatial datasets in climate change studies requires the development of a set of Spatial Data Infrastructure (SDI) elements, including geoprocessing and cartographical visualization web services. This paper presents the architecture of a geospatial OGC web service system as an integral part of a virtual research environment (VRE) general architecture for statistical processing and visualization of meteorological and climatic data. The architecture is a set of interconnected standalone SDI nodes with corresponding data storage systems. Each node runs a specialized software, such as a geoportal, cartographical web services (WMS/WFS), a metadata catalog, and a MySQL database of technical metadata describing geospatial datasets available for the node. It also contains geospatial data processing services (WPS) based on a modular computing backend realizing statistical processing functionality and, thus, providing analysis of large datasets with the results of visualization and export into files of standard formats (XML, binary, etc.). Some cartographical web services have been developed in a system’s prototype to provide capabilities to work with raster and vector geospatial data based on OGC web services. The distributed architecture presented allows easy addition of new nodes, computing and data storage systems, and provides a solid computational infrastructure for regional climate change studies based on modern Web and GIS technologies.
Tošić, Tamara; Sellers, Kristin K; Fröhlich, Flavio; Fedotenkova, Mariia; Beim Graben, Peter; Hutt, Axel
2015-01-01
For decades, research in neuroscience has supported the hypothesis that brain dynamics exhibits recurrent metastable states connected by transients, which together encode fundamental neural information processing. To understand the system's dynamics it is important to detect such recurrence domains, but it is challenging to extract them from experimental neuroscience datasets due to the large trial-to-trial variability. The proposed methodology extracts recurrent metastable states in univariate time series by transforming datasets into their time-frequency representations and computing recurrence plots based on instantaneous spectral power values in various frequency bands. Additionally, a new statistical inference analysis compares different trial recurrence plots with corresponding surrogates to obtain statistically significant recurrent structures. This combination of methods is validated by applying it to two artificial datasets. In a final study of visually-evoked Local Field Potentials in partially anesthetized ferrets, the methodology is able to reveal recurrence structures of neural responses with trial-to-trial variability. Focusing on different frequency bands, the δ-band activity is much less recurrent than α-band activity. Moreover, α-activity is susceptible to pre-stimuli, while δ-activity is much less sensitive to pre-stimuli. This difference in recurrence structures in different frequency bands indicates diverse underlying information processing steps in the brain.
Tošić, Tamara; Sellers, Kristin K.; Fröhlich, Flavio; Fedotenkova, Mariia; beim Graben, Peter; Hutt, Axel
2016-01-01
For decades, research in neuroscience has supported the hypothesis that brain dynamics exhibits recurrent metastable states connected by transients, which together encode fundamental neural information processing. To understand the system's dynamics it is important to detect such recurrence domains, but it is challenging to extract them from experimental neuroscience datasets due to the large trial-to-trial variability. The proposed methodology extracts recurrent metastable states in univariate time series by transforming datasets into their time-frequency representations and computing recurrence plots based on instantaneous spectral power values in various frequency bands. Additionally, a new statistical inference analysis compares different trial recurrence plots with corresponding surrogates to obtain statistically significant recurrent structures. This combination of methods is validated by applying it to two artificial datasets. In a final study of visually-evoked Local Field Potentials in partially anesthetized ferrets, the methodology is able to reveal recurrence structures of neural responses with trial-to-trial variability. Focusing on different frequency bands, the δ-band activity is much less recurrent than α-band activity. Moreover, α-activity is susceptible to pre-stimuli, while δ-activity is much less sensitive to pre-stimuli. This difference in recurrence structures in different frequency bands indicates diverse underlying information processing steps in the brain. PMID:26834580
Dynamic association rules for gene expression data analysis.
Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung
2015-10-14
The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.
Statistical rice yield modeling using blended MODIS-Landsat based crop phenology metrics in Taiwan
NASA Astrophysics Data System (ADS)
Chen, C. R.; Chen, C. F.; Nguyen, S. T.; Lau, K. V.
2015-12-01
Taiwan is a populated island with a majority of residents settled in the western plains where soils are suitable for rice cultivation. Rice is not only the most important commodity, but also plays a critical role for agricultural and food marketing. Information of rice production is thus important for policymakers to devise timely plans for ensuring sustainably socioeconomic development. Because rice fields in Taiwan are generally small and yet crop monitoring requires information of crop phenology associating with the spatiotemporal resolution of satellite data, this study used Landsat-MODIS fusion data for rice yield modeling in Taiwan. We processed the data for the first crop (Feb-Mar to Jun-Jul) and the second (Aug-Sep to Nov-Dec) in 2014 through five main steps: (1) data pre-processing to account for geometric and radiometric errors of Landsat data, (2) Landsat-MODIS data fusion using using the spatial-temporal adaptive reflectance fusion model, (3) construction of the smooth time-series enhanced vegetation index 2 (EVI2), (4) rice yield modeling using EVI2-based crop phenology metrics, and (5) error verification. The fusion results by a comparison bewteen EVI2 derived from the fusion image and that from the reference Landsat image indicated close agreement between the two datasets (R2 > 0.8). We analysed smooth EVI2 curves to extract phenology metrics or phenological variables for establishment of rice yield models. The results indicated that the established yield models significantly explained more than 70% variability in the data (p-value < 0.001). The comparison results between the estimated yields and the government's yield statistics for the first and second crops indicated a close significant relationship between the two datasets (R2 > 0.8), in both cases. The root mean square error (RMSE) and mean absolute error (MAE) used to measure the model accuracy revealed the consistency between the estimated yields and the government's yield statistics. This study demonstrates advantages of using EVI2-based phenology metrics (derived from Landsat-MODIS fusion data) for rice yield estimation in Taiwan prior to the harvest period.
2008-11-01
component, gender, paygrade, race/ethnicity, ethnic ancestry, education , active duty service, and military installation proximity...5. Female Ancestry refers to your ethnic origin or descent, “roots,” or heritage. It may refer to your parents ’ or ancestors’ country of birth...Pay and benefits .............................. Fair performance evaluations ........... Education and training opportunities
Apples and pears? A comparison of two sources of national lung cancer audit data in England
Jack, Ruth H.; Vernon, Sally; Dickinson, Rosie; Wood, Natasha; Harden, Susan; Beckett, Paul; Woolhouse, Ian; Hubbard, Richard B.
2017-01-01
In 2014, the method of data collection from NHS trusts in England for the National Lung Cancer Audit (NLCA) was changed from a bespoke dataset called LUCADA (Lung Cancer Data). Under the new contract, data are submitted via the Cancer Outcome and Service Dataset (COSD) system and linked additional cancer registry datasets. In 2014, trusts were given opportunity to submit LUCADA data as well as registry data. 132 NHS trusts submitted LUCADA data, and all 151 trusts submitted COSD data. This transitional year therefore provided the opportunity to compare both datasets for data completeness and reliability. We linked the two datasets at the patient level to assess the completeness of key patient and treatment variables. We also assessed the interdata agreement of these variables using Cohen's kappa statistic, κ. We identified 26 001 patients in both datasets. Overall, the recording of sex, age, performance status and stage had more than 90% agreement between datasets, but there were more patients with missing performance status in the registry dataset. Although levels of agreement for surgery, chemotherapy and external-beam radiotherapy were high between datasets, the new COSD system identified more instances of active treatment. There seems to be a high agreement of data between the datasets, and the findings suggest that the registry dataset coupled with COSD provides a richer dataset than LUCADA. However, it lagged behind LUCADA in performance status recording, which needs to improve over time. PMID:28748189
Brokering technologies to realize the hydrology scenario in NSF BCube
NASA Astrophysics Data System (ADS)
Boldrini, Enrico; Easton, Zachary; Fuka, Daniel; Pearlman, Jay; Nativi, Stefano
2015-04-01
In the National Science Foundation (NSF) BCube project an international team composed of cyber infrastructure experts, geoscientists, social scientists and educators are working together to explore the use of brokering technologies, initially focusing on four domains: hydrology, oceans, polar, and weather. In the hydrology domain, environmental models are fundamental to understand the behaviour of hydrological systems. A specific model usually requires datasets coming from different disciplines for its initialization (e.g. elevation models from Earth observation, weather data from Atmospheric sciences, etc.). Scientific datasets are usually available on heterogeneous publishing services, such as inventory and access services (e.g. OGC Web Coverage Service, THREDDS Data Server, etc.). Indeed, datasets are published according to different protocols, moreover they usually come in different formats, resolutions, Coordinate Reference Systems (CRSs): in short different grid environments depending on the original data and the publishing service processing capabilities. Scientists can thus be impeded by the burden of discovery, access and normalize the desired datasets to the grid environment required by the model. These technological tasks of course divert scientists from their main, scientific goals. The use of GI-axe brokering framework has been experimented in a hydrology scenario where scientists needed to compare a particular hydrological model with two different input datasets (digital elevation models): - the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) dataset, v.2. - the Shuttle Radar Topography Mission (SRTM) dataset, v.3. These datasets were published by means of Hyrax Server technology, which can provide NetCDF files at their original resolution and CRS. Scientists had their model running on ArcGIS, so the main goal was to import the datasets using the available ArcPy library and have EPSG:4326 with the same resolution grid as the reference system, so that model outputs could be compared. ArcPy however is able to access only GeoTIff datasets that are published by a OGC Web Coverage Service (WCS). The GI-axe broker has then been deployed between the client application and the data providers. It has been configured to broker the two different Hyrax service endpoints and republish the data content through a WCS interface for the use of the ArcPy library. Finally, scientists were able to easily run the model, and to concentrate on the comparison of the different results obtained according to the selected input dataset. The use of a third party broker to perform such technological tasks has also shown to have the potential advantage of increasing the repeatability of a study among different researchers.
A novel bi-level meta-analysis approach: applied to biological pathway analysis.
Nguyen, Tin; Tagett, Rebecca; Donato, Michele; Mitrea, Cristina; Draghici, Sorin
2016-02-01
The accumulation of high-throughput data in public repositories creates a pressing need for integrative analysis of multiple datasets from independent experiments. However, study heterogeneity, study bias, outliers and the lack of power of available methods present real challenge in integrating genomic data. One practical drawback of many P-value-based meta-analysis methods, including Fisher's, Stouffer's, minP and maxP, is that they are sensitive to outliers. Another drawback is that, because they perform just one statistical test for each individual experiment, they may not fully exploit the potentially large number of samples within each study. We propose a novel bi-level meta-analysis approach that employs the additive method and the Central Limit Theorem within each individual experiment and also across multiple experiments. We prove that the bi-level framework is robust against bias, less sensitive to outliers than other methods, and more sensitive to small changes in signal. For comparative analysis, we demonstrate that the intra-experiment analysis has more power than the equivalent statistical test performed on a single large experiment. For pathway analysis, we compare the proposed framework versus classical meta-analysis approaches (Fisher's, Stouffer's and the additive method) as well as against a dedicated pathway meta-analysis package (MetaPath), using 1252 samples from 21 datasets related to three human diseases, acute myeloid leukemia (9 datasets), type II diabetes (5 datasets) and Alzheimer's disease (7 datasets). Our framework outperforms its competitors to correctly identify pathways relevant to the phenotypes. The framework is sufficiently general to be applied to any type of statistical meta-analysis. The R scripts are available on demand from the authors. sorin@wayne.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Jeffries, G. R.; Cohn, A.
2016-12-01
Soy-corn double cropping (DC) has been widely adopted in Central Brazil alongside single cropped (SC) soybean production. DC involves different cropping calendars, soy varieties, and may be associated with different crop yield patterns and volatility than SC. Study of the performance of the region's agriculture in a changing climate depends on tracking differences in the productivity of SC vs. DC, but has been limited by crop yield data that conflate the two systems. We predicted SC and DC yields across Central Brazil, drawing on field observations and remotely sensed data. We first modeled field yield estimates as a function of remotely sensed DC status and vegetation index (VI) metrics, and other management and biophysical factors. We then used the statistical model estimated to predict SC and DC soybean yields at each 500 m2 grid cell of Central Brazil for harvest years 2001 - 2015. The yield estimation model was constructed using 1) a repeated cross-sectional survey of soybean yields and management factors for years 2007-2015, 2) a custom agricultural land cover classification dataset which assimilates earlier datasets for the region, and 3) 500m 8-day MODIS image composites used to calculate the wide dynamic range vegetation index (WDRVI) and derivative metrics such as area under the curve for WDRVI values in critical crop development periods. A statistical yield estimation model which primarily entails WDRVI metrics, DC status, and spatial fixed effects was developed on a subset of the yield dataset. Model validation was conducted by predicting previously withheld yield records, and then assessing error and goodness-of-fit for predicted values with metrics including root mean squared error (RMSE), mean squared error (MSE), and R2. We found a statistical yield estimation model which incorporates WDRVI and DC status to be way to estimate crop yields over the region. Statistical properties of the resulting gridded yield dataset may be valuable for understanding linkages between crop yields, farm management factors, and climate.
Salehizadeh, Seyed M. A.; Dao, Duy; Bolkhovsky, Jeffrey; Cho, Chae; Mendelson, Yitzhak; Chon, Ki H.
2015-01-01
Accurate estimation of heart rates from photoplethysmogram (PPG) signals during intense physical activity is a very challenging problem. This is because strenuous and high intensity exercise can result in severe motion artifacts in PPG signals, making accurate heart rate (HR) estimation difficult. In this study we investigated a novel technique to accurately reconstruct motion-corrupted PPG signals and HR based on time-varying spectral analysis. The algorithm is called Spectral filter algorithm for Motion Artifacts and heart rate reconstruction (SpaMA). The idea is to calculate the power spectral density of both PPG and accelerometer signals for each time shift of a windowed data segment. By comparing time-varying spectra of PPG and accelerometer data, those frequency peaks resulting from motion artifacts can be distinguished from the PPG spectrum. The SpaMA approach was applied to three different datasets and four types of activities: (1) training datasets from the 2015 IEEE Signal Process. Cup Database recorded from 12 subjects while performing treadmill exercise from 1 km/h to 15 km/h; (2) test datasets from the 2015 IEEE Signal Process. Cup Database recorded from 11 subjects while performing forearm and upper arm exercise. (3) Chon Lab dataset including 10 min recordings from 10 subjects during treadmill exercise. The ECG signals from all three datasets provided the reference HRs which were used to determine the accuracy of our SpaMA algorithm. The performance of the SpaMA approach was calculated by computing the mean absolute error between the estimated HR from the PPG and the reference HR from the ECG. The average estimation errors using our method on the first, second and third datasets are 0.89, 1.93 and 1.38 beats/min respectively, while the overall error on all 33 subjects is 1.86 beats/min and the performance on only treadmill experiment datasets (22 subjects) is 1.11 beats/min. Moreover, it was found that dynamics of heart rate variability can be accurately captured using the algorithm where the mean Pearson’s correlation coefficient between the power spectral densities of the reference and the reconstructed heart rate time series was found to be 0.98. These results show that the SpaMA method has a potential for PPG-based HR monitoring in wearable devices for fitness tracking and health monitoring during intense physical activities. PMID:26703618
Salehizadeh, Seyed M A; Dao, Duy; Bolkhovsky, Jeffrey; Cho, Chae; Mendelson, Yitzhak; Chon, Ki H
2015-12-23
Accurate estimation of heart rates from photoplethysmogram (PPG) signals during intense physical activity is a very challenging problem. This is because strenuous and high intensity exercise can result in severe motion artifacts in PPG signals, making accurate heart rate (HR) estimation difficult. In this study we investigated a novel technique to accurately reconstruct motion-corrupted PPG signals and HR based on time-varying spectral analysis. The algorithm is called Spectral filter algorithm for Motion Artifacts and heart rate reconstruction (SpaMA). The idea is to calculate the power spectral density of both PPG and accelerometer signals for each time shift of a windowed data segment. By comparing time-varying spectra of PPG and accelerometer data, those frequency peaks resulting from motion artifacts can be distinguished from the PPG spectrum. The SpaMA approach was applied to three different datasets and four types of activities: (1) training datasets from the 2015 IEEE Signal Process. Cup Database recorded from 12 subjects while performing treadmill exercise from 1 km/h to 15 km/h; (2) test datasets from the 2015 IEEE Signal Process. Cup Database recorded from 11 subjects while performing forearm and upper arm exercise. (3) Chon Lab dataset including 10 min recordings from 10 subjects during treadmill exercise. The ECG signals from all three datasets provided the reference HRs which were used to determine the accuracy of our SpaMA algorithm. The performance of the SpaMA approach was calculated by computing the mean absolute error between the estimated HR from the PPG and the reference HR from the ECG. The average estimation errors using our method on the first, second and third datasets are 0.89, 1.93 and 1.38 beats/min respectively, while the overall error on all 33 subjects is 1.86 beats/min and the performance on only treadmill experiment datasets (22 subjects) is 1.11 beats/min. Moreover, it was found that dynamics of heart rate variability can be accurately captured using the algorithm where the mean Pearson's correlation coefficient between the power spectral densities of the reference and the reconstructed heart rate time series was found to be 0.98. These results show that the SpaMA method has a potential for PPG-based HR monitoring in wearable devices for fitness tracking and health monitoring during intense physical activities.
Point cloud registration from local feature correspondences-Evaluation on challenging datasets.
Petricek, Tomas; Svoboda, Tomas
2017-01-01
Registration of laser scans, or point clouds in general, is a crucial step of localization and mapping with mobile robots or in object modeling pipelines. A coarse alignment of the point clouds is generally needed before applying local methods such as the Iterative Closest Point (ICP) algorithm. We propose a feature-based approach to point cloud registration and evaluate the proposed method and its individual components on challenging real-world datasets. For a moderate overlap between the laser scans, the method provides a superior registration accuracy compared to state-of-the-art methods including Generalized ICP, 3D Normal-Distribution Transform, Fast Point-Feature Histograms, and 4-Points Congruent Sets. Compared to the surface normals, the points as the underlying features yield higher performance in both keypoint detection and establishing local reference frames. Moreover, sign disambiguation of the basis vectors proves to be an important aspect in creating repeatable local reference frames. A novel method for sign disambiguation is proposed which yields highly repeatable reference frames.
CORUM: the comprehensive resource of mammalian protein complexes
Ruepp, Andreas; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Stransky, Michael; Waegele, Brigitte; Schmidt, Thorsten; Doudieu, Octave Noubibou; Stümpflen, Volker; Mewes, H. Werner
2008-01-01
Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The CORUM (http://mips.gsf.de/genre/proj/corum/index.html) database is a collection of experimentally verified mammalian protein complexes. Information is manually derived by critical reading of the scientific literature from expert annotators. Information about protein complexes includes protein complex names, subunits, literature references as well as the function of the complexes. For functional annotation, we use the FunCat catalogue that enables to organize the protein complex space into biologically meaningful subsets. The database contains more than 1750 protein complexes that are built from 2400 different genes, thus representing 12% of the protein-coding genes in human. A web-based system is available to query, view and download the data. CORUM provides a comprehensive dataset of protein complexes for discoveries in systems biology, analyses of protein networks and protein complex-associated diseases. Comparable to the MIPS reference dataset of protein complexes from yeast, CORUM intends to serve as a reference for mammalian protein complexes. PMID:17965090
NASA Technical Reports Server (NTRS)
Wang, Weile; Nemani, Ramakrishna R.; Michaelis, Andrew; Hashimoto, Hirofumi; Dungan, Jennifer L.; Thrasher, Bridget L.; Dixon, Keith W.
2016-01-01
The NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset is comprised of downscaled climate projections that are derived from 21 General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) and across two of the four greenhouse gas emissions scenarios (RCP4.5 and RCP8.5). Each of the climate projections includes daily maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2100 and the spatial resolution is 0.25 degrees (approximately 25 km x 25 km). The GDDP dataset has received warm welcome from the science community in conducting studies of climate change impacts at local to regional scales, but a comprehensive evaluation of its uncertainties is still missing. In this study, we apply the Perfect Model Experiment framework (Dixon et al. 2016) to quantify the key sources of uncertainties from the observational baseline dataset, the downscaling algorithm, and some intrinsic assumptions (e.g., the stationary assumption) inherent to the statistical downscaling techniques. We developed a set of metrics to evaluate downscaling errors resulted from bias-correction ("quantile-mapping"), spatial disaggregation, as well as the temporal-spatial non-stationarity of climate variability. Our results highlight the spatial disaggregation (or interpolation) errors, which dominate the overall uncertainties of the GDDP dataset, especially over heterogeneous and complex terrains (e.g., mountains and coastal area). In comparison, the temporal errors in the GDDP dataset tend to be more constrained. Our results also indicate that the downscaled daily precipitation also has relatively larger uncertainties than the temperature fields, reflecting the rather stochastic nature of precipitation in space. Therefore, our results provide insights in improving statistical downscaling algorithms and products in the future.
Pantazatos, Spiro P.; Li, Jianrong; Pavlidis, Paul; Lussier, Yves A.
2009-01-01
An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames, and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n = 50), and precision of the semantic mapping between these terms across datasets was 98% (n = 100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets. PMID:20495688
CrossLink: a novel method for cross-condition classification of cancer subtypes.
Ma, Chifeng; Sastry, Konduru S; Flore, Mario; Gehani, Salah; Al-Bozom, Issam; Feng, Yusheng; Serpedin, Erchin; Chouchane, Lotfi; Chen, Yidong; Huang, Yufei
2016-08-22
We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another. To address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature. We assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles. A novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.
Cooper, P David; Smart, David R
2017-06-01
Recent Australian attempts to facilitate disinvestment in healthcare, by identifying instances of 'inappropriate' care from large Government datasets, are subject to significant methodological flaws. Amongst other criticisms has been the fact that the Government datasets utilized for this purpose correlate poorly with datasets collected by relevant professional bodies. Government data derive from official hospital coding, collected retrospectively by clerical personnel, whilst professional body data derive from unit-specific databases, collected contemporaneously with care by clinical personnel. Assessment of accuracy of official hospital coding data for hyperbaric services in a tertiary referral hospital. All official hyperbaric-relevant coding data submitted to the relevant Australian Government agencies by the Royal Hobart Hospital, Tasmania, Australia for financial year 2010-2011 were reviewed and compared against actual hyperbaric unit activity as determined by reference to original source documents. Hospital coding data contained one or more errors in diagnoses and/or procedures in 70% of patients treated with hyperbaric oxygen that year. Multiple discrete error types were identified, including (but not limited to): missing patients; missing treatments; 'additional' treatments; 'additional' patients; incorrect procedure codes and incorrect diagnostic codes. Incidental observations of errors in surgical, anaesthetic and intensive care coding within this cohort suggest that the problems are not restricted to the specialty of hyperbaric medicine alone. Publications from other centres indicate that these problems are not unique to this institution or State. Current Government datasets are irretrievably compromised and not fit for purpose. Attempting to inform the healthcare policy debate by reference to these datasets is inappropriate. Urgent clinical engagement with hospital coding departments is warranted.
Lucas, Rico; Groeneveld, Jürgen; Harms, Hauke; Johst, Karin; Frank, Karin; Kleinsteuber, Sabine
2017-01-01
In times of global change and intensified resource exploitation, advanced knowledge of ecophysiological processes in natural and engineered systems driven by complex microbial communities is crucial for both safeguarding environmental processes and optimising rational control of biotechnological processes. To gain such knowledge, high-throughput molecular techniques are routinely employed to investigate microbial community composition and dynamics within a wide range of natural or engineered environments. However, for molecular dataset analyses no consensus about a generally applicable alpha diversity concept and no appropriate benchmarking of corresponding statistical indices exist yet. To overcome this, we listed criteria for the appropriateness of an index for such analyses and systematically scrutinised commonly employed ecological indices describing diversity, evenness and richness based on artificial and real molecular datasets. We identified appropriate indices warranting interstudy comparability and intuitive interpretability. The unified diversity concept based on 'effective numbers of types' provides the mathematical framework for describing community composition. Additionally, the Bray-Curtis dissimilarity as a beta-diversity index was found to reflect compositional changes. The employed statistical procedure is presented comprising commented R-scripts and example datasets for user-friendly trial application. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Data preparation techniques for a perinatal psychiatric study based on linked data.
Xu, Fenglian; Hilder, Lisa; Austin, Marie-Paule; Sullivan, Elizabeth A
2012-06-08
In recent years there has been an increase in the use of population-based linked data. However, there is little literature that describes the method of linked data preparation. This paper describes the method for merging data, calculating the statistical variable (SV), recoding psychiatric diagnoses and summarizing hospital admissions for a perinatal psychiatric study. The data preparation techniques described in this paper are based on linked birth data from the New South Wales (NSW) Midwives Data Collection (MDC), the Register of Congenital Conditions (RCC), the Admitted Patient Data Collection (APDC) and the Pharmaceutical Drugs of Addiction System (PHDAS). The master dataset is the meaningfully linked data which include all or major study data collections. The master dataset can be used to improve the data quality, calculate the SV and can be tailored for different analyses. To identify hospital admissions in the periods before pregnancy, during pregnancy and after birth, a statistical variable of time interval (SVTI) needs to be calculated. The methods and SPSS syntax for building a master dataset, calculating the SVTI, recoding the principal diagnoses of mental illness and summarizing hospital admissions are described. Linked data preparation, including building the master dataset and calculating the SV, can improve data quality and enhance data function.
Vu, Trung N; Valkenborg, Dirk; Smets, Koen; Verwaest, Kim A; Dommisse, Roger; Lemière, Filip; Verschoren, Alain; Goethals, Bart; Laukens, Kris
2011-10-20
Nuclear magnetic resonance spectroscopy (NMR) is a powerful technique to reveal and compare quantitative metabolic profiles of biological tissues. However, chemical and physical sample variations make the analysis of the data challenging, and typically require the application of a number of preprocessing steps prior to data interpretation. For example, noise reduction, normalization, baseline correction, peak picking, spectrum alignment and statistical analysis are indispensable components in any NMR analysis pipeline. We introduce a novel suite of informatics tools for the quantitative analysis of NMR metabolomic profile data. The core of the processing cascade is a novel peak alignment algorithm, called hierarchical Cluster-based Peak Alignment (CluPA). The algorithm aligns a target spectrum to the reference spectrum in a top-down fashion by building a hierarchical cluster tree from peak lists of reference and target spectra and then dividing the spectra into smaller segments based on the most distant clusters of the tree. To reduce the computational time to estimate the spectral misalignment, the method makes use of Fast Fourier Transformation (FFT) cross-correlation. Since the method returns a high-quality alignment, we can propose a simple methodology to study the variability of the NMR spectra. For each aligned NMR data point the ratio of the between-group and within-group sum of squares (BW-ratio) is calculated to quantify the difference in variability between and within predefined groups of NMR spectra. This differential analysis is related to the calculation of the F-statistic or a one-way ANOVA, but without distributional assumptions. Statistical inference based on the BW-ratio is achieved by bootstrapping the null distribution from the experimental data. The workflow performance was evaluated using a previously published dataset. Correlation maps, spectral and grey scale plots show clear improvements in comparison to other methods, and the down-to-earth quantitative analysis works well for the CluPA-aligned spectra. The whole workflow is embedded into a modular and statistically sound framework that is implemented as an R package called "speaq" ("spectrum alignment and quantitation"), which is freely available from http://code.google.com/p/speaq/.
Mercer, Theresa G; Frostick, Lynne E; Walmsley, Anthony D
2011-10-15
This paper presents a statistical technique that can be applied to environmental chemistry data where missing values and limit of detection levels prevent the application of statistics. A working example is taken from an environmental leaching study that was set up to determine if there were significant differences in levels of leached arsenic (As), chromium (Cr) and copper (Cu) between lysimeters containing preservative treated wood waste and those containing untreated wood. Fourteen lysimeters were setup and left in natural conditions for 21 weeks. The resultant leachate was analysed by ICP-OES to determine the As, Cr and Cu concentrations. However, due to the variation inherent in each lysimeter combined with the limits of detection offered by ICP-OES, the collected quantitative data was somewhat incomplete. Initial data analysis was hampered by the number of 'missing values' in the data. To recover the dataset, the statistical tool of Statistical Multiple Imputation (SMI) was applied, and the data was re-analysed successfully. It was demonstrated that using SMI did not affect the variance in the data, but facilitated analysis of the complete dataset. Copyright © 2011 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Hoell, Simon; Omenzetter, Piotr
2017-07-01
Considering jointly damage sensitive features (DSFs) of signals recorded by multiple sensors, applying advanced transformations to these DSFs and assessing systematically their contribution to damage detectability and localisation can significantly enhance the performance of structural health monitoring systems. This philosophy is explored here for partial autocorrelation coefficients (PACCs) of acceleration responses. They are interrogated with the help of the linear discriminant analysis based on the Fukunaga-Koontz transformation using datasets of the healthy and selected reference damage states. Then, a simple but efficient fast forward selection procedure is applied to rank the DSF components with respect to statistical distance measures specialised for either damage detection or localisation. For the damage detection task, the optimal feature subsets are identified based on the statistical hypothesis testing. For damage localisation, a hierarchical neuro-fuzzy tool is developed that uses the DSF ranking to establish its own optimal architecture. The proposed approaches are evaluated experimentally on data from non-destructively simulated damage in a laboratory scale wind turbine blade. The results support our claim of being able to enhance damage detectability and localisation performance by transforming and optimally selecting DSFs. It is demonstrated that the optimally selected PACCs from multiple sensors or their Fukunaga-Koontz transformed versions can not only improve the detectability of damage via statistical hypothesis testing but also increase the accuracy of damage localisation when used as inputs into a hierarchical neuro-fuzzy network. Furthermore, the computational effort of employing these advanced soft computing models for damage localisation can be significantly reduced by using transformed DSFs.
Earth History databases and visualization - the TimeScale Creator system
NASA Astrophysics Data System (ADS)
Ogg, James; Lugowski, Adam; Gradstein, Felix
2010-05-01
The "TimeScale Creator" team (www.tscreator.org) and the Subcommission on Stratigraphic Information (stratigraphy.science.purdue.edu) of the International Commission on Stratigraphy (www.stratigraphy.org) has worked with numerous geoscientists and geological surveys to prepare reference datasets for global and regional stratigraphy. All events are currently calibrated to Geologic Time Scale 2004 (Gradstein et al., 2004, Cambridge Univ. Press) and Concise Geologic Time Scale (Ogg et al., 2008, Cambridge Univ. Press); but the array of intercalibrations enable dynamic adjustment to future numerical age scales and interpolation methods. The main "global" database contains over 25,000 events/zones from paleontology, geomagnetics, sea-level and sequence stratigraphy, igneous provinces, bolide impacts, plus several stable isotope curves and image sets. Several regional datasets are provided in conjunction with geological surveys, with numerical ages interpolated using a similar flexible inter-calibration procedure. For example, a joint program with Geoscience Australia has compiled an extensive Australian regional biostratigraphy and a full array of basin lithologic columns with each formation linked to public lexicons of all Proterozoic through Phanerozoic basins - nearly 500 columns of over 9,000 data lines plus hot-curser links to oil-gas reference wells. Other datapacks include New Zealand biostratigraphy and basin transects (ca. 200 columns), Russian biostratigraphy, British Isles regional stratigraphy, Gulf of Mexico biostratigraphy and lithostratigraphy, high-resolution Neogene stable isotope curves and ice-core data, human cultural episodes, and Circum-Arctic stratigraphy sets. The growing library of datasets is designed for viewing and chart-making in the free "TimeScale Creator" JAVA package. This visualization system produces a screen display of the user-selected time-span and the selected columns of geologic time scale information. The user can change the vertical-scale, column widths, fonts, colors, titles, ordering, range chart options and many other features. Mouse-activated pop-ups provide additional information on columns and events; including links to external Internet sites. The graphics can be saved as SVG (scalable vector graphics) or PDF files for direct import into Adobe Illustrator or other common drafting software. Users can load additional regional datapacks, and create and upload their own datasets. The "Pro" version has additional dataset-creation tools, output options and the ability to edit and re-save merged datasets. The databases and visualization package are envisioned as a convenient reference tool, chart-production assistant, and a window into the geologic history of our planet.
Sauzet, Odile; Peacock, Janet L
2017-07-20
The analysis of perinatal outcomes often involves datasets with some multiple births. These are datasets mostly formed of independent observations and a limited number of clusters of size two (twins) and maybe of size three or more. This non-independence needs to be accounted for in the statistical analysis. Using simulated data based on a dataset of preterm infants we have previously investigated the performance of several approaches to the analysis of continuous outcomes in the presence of some clusters of size two. Mixed models have been developed for binomial outcomes but very little is known about their reliability when only a limited number of small clusters are present. Using simulated data based on a dataset of preterm infants we investigated the performance of several approaches to the analysis of binomial outcomes in the presence of some clusters of size two. Logistic models, several methods of estimation for the logistic random intercept models and generalised estimating equations were compared. The presence of even a small percentage of twins means that a logistic regression model will underestimate all parameters but a logistic random intercept model fails to estimate the correlation between siblings if the percentage of twins is too small and will provide similar estimates to logistic regression. The method which seems to provide the best balance between estimation of the standard error and the parameter for any percentage of twins is the generalised estimating equations. This study has shown that the number of covariates or the level two variance do not necessarily affect the performance of the various methods used to analyse datasets containing twins but when the percentage of small clusters is too small, mixed models cannot capture the dependence between siblings.
Annotating spatio-temporal datasets for meaningful analysis in the Web
NASA Astrophysics Data System (ADS)
Stasch, Christoph; Pebesma, Edzer; Scheider, Simon
2014-05-01
More and more environmental datasets that vary in space and time are available in the Web. This comes along with an advantage of using the data for other purposes than originally foreseen, but also with the danger that users may apply inappropriate analysis procedures due to lack of important assumptions made during the data collection process. In order to guide towards a meaningful (statistical) analysis of spatio-temporal datasets available in the Web, we have developed a Higher-Order-Logic formalism that captures some relevant assumptions in our previous work [1]. It allows to proof on meaningful spatial prediction and aggregation in a semi-automated fashion. In this poster presentation, we will present a concept for annotating spatio-temporal datasets available in the Web with concepts defined in our formalism. Therefore, we have defined a subset of the formalism as a Web Ontology Language (OWL) pattern. It allows capturing the distinction between the different spatio-temporal variable types, i.e. point patterns, fields, lattices and trajectories, that in turn determine whether a particular dataset can be interpolated or aggregated in a meaningful way using a certain procedure. The actual annotations that link spatio-temporal datasets with the concepts in the ontology pattern are provided as Linked Data. In order to allow data producers to add the annotations to their datasets, we have implemented a Web portal that uses a triple store at the backend to store the annotations and to make them available in the Linked Data cloud. Furthermore, we have implemented functions in the statistical environment R to retrieve the RDF annotations and, based on these annotations, to support a stronger typing of spatio-temporal datatypes guiding towards a meaningful analysis in R. [1] Stasch, C., Scheider, S., Pebesma, E., Kuhn, W. (2014): "Meaningful spatial prediction and aggregation", Environmental Modelling & Software, 51, 149-165.
Establishing Consensus Turbulence Statistics for Hot Subsonic Jets
NASA Technical Reports Server (NTRS)
Bridges, James; Werner, Mark P.
2010-01-01
Many tasks in fluids engineering require knowledge of the turbulence in jets. There is a strong, although fragmented, literature base for low order statistics, such as jet spread and other meanvelocity field characteristics. Some sources, particularly for low speed cold jets, also provide turbulence intensities that are required for validating Reynolds-averaged Navier-Stokes (RANS) Computational Fluid Dynamics (CFD) codes. There are far fewer sources for jet spectra and for space-time correlations of turbulent velocity required for aeroacoustics applications, although there have been many singular publications with various unique statistics, such as Proper Orthogonal Decomposition, designed to uncover an underlying low-order dynamical description of turbulent jet flow. As the complexity of the statistic increases, the number of flows for which the data has been categorized and assembled decreases, making it difficult to systematically validate prediction codes that require high-level statistics over a broad range of jet flow conditions. For several years, researchers at NASA have worked on developing and validating jet noise prediction codes. One such class of codes, loosely called CFD-based or statistical methods, uses RANS CFD to predict jet mean and turbulent intensities in velocity and temperature. These flow quantities serve as the input to the acoustic source models and flow-sound interaction calculations that yield predictions of far-field jet noise. To develop this capability, a catalog of turbulent jet flows has been created with statistics ranging from mean velocity to space-time correlations of Reynolds stresses. The present document aims to document this catalog and to assess the accuracies of the data, e.g. establish uncertainties for the data. This paper covers the following five tasks: Document acquisition and processing procedures used to create the particle image velocimetry (PIV) datasets. Compare PIV data with hotwire and laser Doppler velocimetry (LDV) data published in the open literature. Compare different datasets acquired at roughly the same flow conditions to establish uncertainties. Create a consensus dataset for a range of hot jet flows, including uncertainty bands. Analyze this consensus dataset for self-consistency and compare jet characteristics to those of the open literature. One final objective fulfilled by this work was the demonstration of a universal scaling for the jet flow fields, at least within the region of interest to aeroacoustics. The potential core length and the spread rate of the half-velocity radius were used to collapse of the mean and turbulent velocity fields over the first 20 jet diameters in a highly satisfying manner.
Campbell, J. Peter; Kalpathy-Cramer, Jayashree; Erdogmus, Deniz; Tian, Peng; Kedarisetti, Dharanish; Moleta, Chace; Reynolds, James D.; Hutcheson, Kelly; Shapiro, Michael J.; Repka, Michael X.; Ferrone, Philip; Drenser, Kimberly; Horowitz, Jason; Sonmez, Kemal; Swan, Ryan; Ostmo, Susan; Jonas, Karyn E.; Chan, R.V. Paul; Chiang, Michael F.
2016-01-01
Objective To identify patterns of inter-expert discrepancy in plus disease diagnosis in retinopathy of prematurity (ROP). Design We developed two datasets of clinical images of varying disease severity (100 images and 34 images) as part of the Imaging and Informatics in ROP study, and determined a consensus reference standard diagnosis (RSD) for each image, based on 3 independent image graders and the clinical exam. We recruited 8 expert ROP clinicians to classify these images and compared the distribution of classifications between experts and the RSD. Subjects, Participants, and/or Controls Images obtained during routine ROP screening in neonatal intensive care units. 8 participating experts with >10 years of clinical ROP experience and >5 peer-reviewed ROP publications. Methods, Intervention, or Testing Expert classification of images of plus disease in ROP. Main Outcome Measures Inter-expert agreement (weighted kappa statistic), and agreement and bias on ordinal classification between experts (ANOVA) and the RSD (percent agreement). Results There was variable inter-expert agreement on diagnostic classifications between the 8 experts and the RSD (weighted kappa 0 – 0.75, mean 0.30). RSD agreement ranged from 80 – 94% agreement for the dataset of 100 images, and 29 – 79% for the dataset of 34 images. However, when images were ranked in order of disease severity (by average expert classification), the pattern of expert classification revealed a consistent systematic bias for each expert consistent with unique cut points for the diagnosis of plus disease and pre-plus disease. The two-way ANOVA model suggested a highly significant effect of both image and user on the average score (P<0.05, adjusted R2=0.82 for dataset A, and P< 0.05 and adjusted R2 =0.6615 for dataset B). Conclusions and Relevance There is wide variability in the classification of plus disease by ROP experts, which occurs because experts have different “cut-points” for the amounts of vascular abnormality required for presence of plus and pre-plus disease. This has important implications for research, teaching and patient care for ROP, and suggests that a continuous ROP plus disease severity score may more accurately reflect the behavior of expert ROP clinicians, and may better standardize classification in the future. PMID:27591053
Campbell, J Peter; Kalpathy-Cramer, Jayashree; Erdogmus, Deniz; Tian, Peng; Kedarisetti, Dharanish; Moleta, Chace; Reynolds, James D; Hutcheson, Kelly; Shapiro, Michael J; Repka, Michael X; Ferrone, Philip; Drenser, Kimberly; Horowitz, Jason; Sonmez, Kemal; Swan, Ryan; Ostmo, Susan; Jonas, Karyn E; Chan, R V Paul; Chiang, Michael F
2016-11-01
To identify patterns of interexpert discrepancy in plus disease diagnosis in retinopathy of prematurity (ROP). We developed 2 datasets of clinical images as part of the Imaging and Informatics in ROP study and determined a consensus reference standard diagnosis (RSD) for each image based on 3 independent image graders and the clinical examination results. We recruited 8 expert ROP clinicians to classify these images and compared the distribution of classifications between experts and the RSD. Eight participating experts with more than 10 years of clinical ROP experience and more than 5 peer-reviewed ROP publications who analyzed images obtained during routine ROP screening in neonatal intensive care units. Expert classification of images of plus disease in ROP. Interexpert agreement (weighted κ statistic) and agreement and bias on ordinal classification between experts (analysis of variance [ANOVA]) and the RSD (percent agreement). There was variable interexpert agreement on diagnostic classifications between the 8 experts and the RSD (weighted κ, 0-0.75; mean, 0.30). The RSD agreement ranged from 80% to 94% for the dataset of 100 images and from 29% to 79% for the dataset of 34 images. However, when images were ranked in order of disease severity (by average expert classification), the pattern of expert classification revealed a consistent systematic bias for each expert consistent with unique cut points for the diagnosis of plus disease and preplus disease. The 2-way ANOVA model suggested a highly significant effect of both image and user on the average score (dataset A: P < 0.05 and adjusted R 2 = 0.82; and dataset B: P < 0.05 and adjusted R 2 = 0.6615). There is wide variability in the classification of plus disease by ROP experts, which occurs because experts have different cut points for the amounts of vascular abnormality required for presence of plus and preplus disease. This has important implications for research, teaching, and patient care for ROP and suggests that a continuous ROP plus disease severity score may reflect more accurately the behavior of expert ROP clinicians and may better standardize classification in the future. Copyright © 2016 American Academy of Ophthalmology. Published by Elsevier Inc. All rights reserved.
The Isprs Benchmark on Indoor Modelling
NASA Astrophysics Data System (ADS)
Khoshelham, K.; Díaz Vilariño, L.; Peter, M.; Kang, Z.; Acharya, D.
2017-09-01
Automated generation of 3D indoor models from point cloud data has been a topic of intensive research in recent years. While results on various datasets have been reported in literature, a comparison of the performance of different methods has not been possible due to the lack of benchmark datasets and a common evaluation framework. The ISPRS benchmark on indoor modelling aims to address this issue by providing a public benchmark dataset and an evaluation framework for performance comparison of indoor modelling methods. In this paper, we present the benchmark dataset comprising several point clouds of indoor environments captured by different sensors. We also discuss the evaluation and comparison of indoor modelling methods based on manually created reference models and appropriate quality evaluation criteria. The benchmark dataset is available for download at: http://www2.isprs.org/commissions/comm4/wg5/benchmark-on-indoor-modelling.html.
Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models
Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.
2016-01-01
An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777
Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.
Ernst, Jason; Kellis, Manolis
2015-04-01
With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.
Improved analyses using function datasets and statistical modeling
John S. Hogland; Nathaniel M. Anderson
2014-01-01
Raster modeling is an integral component of spatial analysis. However, conventional raster modeling techniques can require a substantial amount of processing time and storage space and have limited statistical functionality and machine learning algorithms. To address this issue, we developed a new modeling framework using C# and ArcObjects and integrated that framework...
Providing Geographic Datasets as Linked Data in Sdi
NASA Astrophysics Data System (ADS)
Hietanen, E.; Lehto, L.; Latvala, P.
2016-06-01
In this study, a prototype service to provide data from Web Feature Service (WFS) as linked data is implemented. At first, persistent and unique Uniform Resource Identifiers (URI) are created to all spatial objects in the dataset. The objects are available from those URIs in Resource Description Framework (RDF) data format. Next, a Web Ontology Language (OWL) ontology is created to describe the dataset information content using the Open Geospatial Consortium's (OGC) GeoSPARQL vocabulary. The existing data model is modified in order to take into account the linked data principles. The implemented service produces an HTTP response dynamically. The data for the response is first fetched from existing WFS. Then the Geographic Markup Language (GML) format output of the WFS is transformed on-the-fly to the RDF format. Content Negotiation is used to serve the data in different RDF serialization formats. This solution facilitates the use of a dataset in different applications without replicating the whole dataset. In addition, individual spatial objects in the dataset can be referred with URIs. Furthermore, the needed information content of the objects can be easily extracted from the RDF serializations available from those URIs. A solution for linking data objects to the dataset URI is also introduced by using the Vocabulary of Interlinked Datasets (VoID). The dataset is divided to the subsets and each subset is given its persistent and unique URI. This enables the whole dataset to be explored with a web browser and all individual objects to be indexed by search engines.
Leveling data in geochemical mapping: scope of application, pros and cons of existing methods
NASA Astrophysics Data System (ADS)
Pereira, Benoît; Vandeuren, Aubry; Sonnet, Philippe
2017-04-01
Geochemical mapping successfully met a range of needs from mineral exploration to environmental management. In Europe and around the world numerous geochemical datasets already exist. These datasets may originate from geochemical mapping projects or from the collection of sample analyses requested by environmental protection regulatory bodies. Combining datasets can be highly beneficial for establishing geochemical maps with increased resolution and/or coverage area. However this practice requires assessing the equivalence between datasets and, if needed, applying data leveling to remove possible biases between datasets. In the literature, several procedures for assessing dataset equivalence and leveling data are proposed. Daneshfar & Cameron (1998) proposed a method for the leveling of two adjacent datasets while Pereira et al. (2016) proposed two methods for the leveling of datasets that contain records located within the same geographical area. Each discussed method requires its own set of assumptions (underlying populations of data, spatial distribution of data, etc.). Here we propose to discuss the scope of application, pros, cons and practical recommendations for each method. This work is illustrated with several case studies in Wallonia (Southern Belgium) and in Europe involving trace element geochemical datasets. References: Daneshfar, B. & Cameron, E. (1998), Leveling geochemical data between map sheets, Journal of Geochemical Exploration 63(3), 189-201. Pereira, B.; Vandeuren, A.; Govaerts, B. B. & Sonnet, P. (2016), Assessing dataset equivalence and leveling data in geochemical mapping, Journal of Geochemical Exploration 168, 36-48.
Del Carratore, Francesco; Jankevics, Andris; Eisinga, Rob; Heskes, Tom; Hong, Fangxin; Breitling, Rainer
2017-09-01
The Rank Product (RP) is a statistical technique widely used to detect differentially expressed features in molecular profiling experiments such as transcriptomics, metabolomics and proteomics studies. An implementation of the RP and the closely related Rank Sum (RS) statistics has been available in the RankProd Bioconductor package for several years. However, several recent advances in the understanding of the statistical foundations of the method have made a complete refactoring of the existing package desirable. We implemented a completely refactored version of the RankProd package, which provides a more principled implementation of the statistics for unpaired datasets. Moreover, the permutation-based P -value estimation methods have been replaced by exact methods, providing faster and more accurate results. RankProd 2.0 is available at Bioconductor ( https://www.bioconductor.org/packages/devel/bioc/html/RankProd.html ) and as part of the mzMatch pipeline ( http://www.mzmatch.sourceforge.net ). rainer.breitling@manchester.ac.uk. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
RepExplore: addressing technical replicate variance in proteomics and metabolomics data analysis.
Glaab, Enrico; Schneider, Reinhard
2015-07-01
High-throughput omics datasets often contain technical replicates included to account for technical sources of noise in the measurement process. Although summarizing these replicate measurements by using robust averages may help to reduce the influence of noise on downstream data analysis, the information on the variance across the replicate measurements is lost in the averaging process and therefore typically disregarded in subsequent statistical analyses.We introduce RepExplore, a web-service dedicated to exploit the information captured in the technical replicate variance to provide more reliable and informative differential expression and abundance statistics for omics datasets. The software builds on previously published statistical methods, which have been applied successfully to biomedical omics data but are difficult to use without prior experience in programming or scripting. RepExplore facilitates the analysis by providing a fully automated data processing and interactive ranking tables, whisker plot, heat map and principal component analysis visualizations to interpret omics data and derived statistics. Freely available at http://www.repexplore.tk enrico.glaab@uni.lu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Effect of the absolute statistic on gene-sampling gene-set analysis methods.
Nam, Dougu
2017-06-01
Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.
SPAR: small RNA-seq portal for analysis of sequencing experiments.
Kuksa, Pavel P; Amlie-Wolf, Alexandre; Katanic, Živadin; Valladares, Otto; Wang, Li-San; Leung, Yuk Yee
2018-05-04
The introduction of new high-throughput small RNA sequencing protocols that generate large-scale genomics datasets along with increasing evidence of the significant regulatory roles of small non-coding RNAs (sncRNAs) have highlighted the urgent need for tools to analyze and interpret large amounts of small RNA sequencing data. However, it remains challenging to systematically and comprehensively discover and characterize sncRNA genes and specifically-processed sncRNA products from these datasets. To fill this gap, we present Small RNA-seq Portal for Analysis of sequencing expeRiments (SPAR), a user-friendly web server for interactive processing, analysis, annotation and visualization of small RNA sequencing data. SPAR supports sequencing data generated from various experimental protocols, including smRNA-seq, short total RNA sequencing, microRNA-seq, and single-cell small RNA-seq. Additionally, SPAR includes publicly available reference sncRNA datasets from our DASHR database and from ENCODE across 185 human tissues and cell types to produce highly informative small RNA annotations across all major small RNA types and other features such as co-localization with various genomic features, precursor transcript cleavage patterns, and conservation. SPAR allows the user to compare the input experiment against reference ENCODE/DASHR datasets. SPAR currently supports analyses of human (hg19, hg38) and mouse (mm10) sequencing data. SPAR is freely available at https://www.lisanwanglab.org/SPAR.
Pujar, Shashikant; O'Leary, Nuala A; Farrell, Catherine M; Loveland, Jane E; Mudge, Jonathan M; Wallin, Craig; Girón, Carlos G; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; Martin, Fergal J; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Suner, Marie-Marthe; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bruford, Elspeth A; Bult, Carol J; Frankish, Adam; Murphy, Terence; Pruitt, Kim D
2018-01-04
The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.
Sornborger, Andrew; Broder, Josef; Majumder, Anirban; Srinivasamoorthy, Ganesh; Porter, Erika; Reagin, Sean S; Keith, Charles; Lauderdale, James D
2008-09-01
Ratiometric fluorescent indicators are used for making quantitative measurements of a variety of physiological variables. Their utility is often limited by noise. This is the second in a series of papers describing statistical methods for denoising ratiometric data with the aim of obtaining improved quantitative estimates of variables of interest. Here, we outline a statistical optimization method that is designed for the analysis of ratiometric imaging data in which multiple measurements have been taken of systems responding to the same stimulation protocol. This method takes advantage of correlated information across multiple datasets for objectively detecting and estimating ratiometric signals. We demonstrate our method by showing results of its application on multiple, ratiometric calcium imaging experiments.
Data Analysis and Statistical Methods for the Assessment and Interpretation of Geochronologic Data
NASA Astrophysics Data System (ADS)
Reno, B. L.; Brown, M.; Piccoli, P. M.
2007-12-01
Ages are traditionally reported as a weighted mean with an uncertainty based on least squares analysis of analytical error on individual dates. This method does not take into account geological uncertainties, and cannot accommodate asymmetries in the data. In most instances, this method will understate uncertainty on a given age, which may lead to over interpretation of age data. Geologic uncertainty is difficult to quantify, but is typically greater than analytical uncertainty. These factors make traditional statistical approaches inadequate to fully evaluate geochronologic data. We propose a protocol to assess populations within multi-event datasets and to calculate age and uncertainty from each population of dates interpreted to represent a single geologic event using robust and resistant statistical methods. To assess whether populations thought to represent different events are statistically separate exploratory data analysis is undertaken using a box plot, where the range of the data is represented by a 'box' of length given by the interquartile range, divided at the median of the data, with 'whiskers' that extend to the furthest datapoint that lies within 1.5 times the interquartile range beyond the box. If the boxes representing the populations do not overlap, they are interpreted to represent statistically different sets of dates. Ages are calculated from statistically distinct populations using a robust tool such as the tanh method of Kelsey et al. (2003, CMP, 146, 326-340), which is insensitive to any assumptions about the underlying probability distribution from which the data are drawn. Therefore, this method takes into account the full range of data, and is not drastically affected by outliers. The interquartile range of each population of dates (the interquartile range) gives a first pass at expressing uncertainty, which accommodates asymmetry in the dataset; outliers have a minor affect on the uncertainty. To better quantify the uncertainty, a resistant tool that is insensitive to local misbehavior of data is preferred, such as the normalized median absolute deviations proposed by Powell et al. (2002, Chem Geol, 185, 191-204). We illustrate the method using a dataset of 152 monazite dates determined using EPMA chemical data from a single sample from the Neoproterozoic Brasília Belt, Brazil. Results are compared with ages and uncertainties calculated using traditional methods to demonstrate the differences. The dataset was manually culled into three populations representing discrete compositional domains within chemically-zoned monazite grains. The weighted mean ages and least squares uncertainties for these populations are 633±6 (2σ) Ma for a core domain, 614±5 (2σ) Ma for an intermediate domain and 595±6 (2σ) Ma for a rim domain. Probability distribution plots indicate asymmetric distributions of all populations, which cannot be accounted for with traditional statistical tools. These three domains record distinct ages outside the interquartile range for each population of dates, with the core domain lying in the subrange 642-624 Ma, the intermediate domain 617-609 Ma and the rim domain 606-589 Ma. The tanh estimator yields ages of 631±7 (2σ) for the core domain, 616±7 (2σ) for the intermediate domain and 601±8 (2σ) for the rim domain. Whereas the uncertainties derived using a resistant statistical tool are larger than those derived from traditional statistical tools, the method yields more realistic uncertainties that better address the spread in the dataset and account for asymmetry in the data.
Evaluation of reference evapotranspiration methods in arid, semiarid, and humid regions
Fei Gao; Gary Feng; Ying Ouyang; Huixiao Wang; Daniel Fisher; Ardeshir Adeli; Johnie Jenkins
2017-01-01
It is often necessary to find a simpler method in different climatic regions to calculate reference crop evapotranspiration (ETo) since the application of the FAO-56 Penman-Monteith method is often restricted due to the unavailability of a comprehensive weather dataset. Seven ETo methods, namely the standard FAO-56 Penman-Monteith, the FAO-24 Radiation, FAO-24 Blaney...
Datasets on hub-height wind speed comparisons for wind farms in California.
Wang, Meina; Ullrich, Paul; Millstein, Dev
2018-08-01
This article includes the description of data information related to the research article entitled "The future of wind energy in California: Future projections with the Variable-Resolution CESM"[1], with reference number RENE_RENE-D-17-03392. Datasets from the Variable-Resolution CESM, Det Norske Veritas Germanischer Lloyd Virtual Met, MERRA-2, CFSR, NARR, ISD surface observations, and upper air sounding observations were used for calculating and comparing hub-height wind speed at multiple major wind farms across California. Information on hub-height wind speed interpolation and power curves at each wind farm sites are also presented. All datasets, except Det Norske Veritas Germanischer Lloyd Virtual Met, are publicly available for future analysis.
NASA Astrophysics Data System (ADS)
Yiannikopoulou, I.; Philippopoulos, K.; Deligiorgi, D.
2012-04-01
The vertical thermal structure of the atmosphere is defined by a combination of dynamic and radiation transfer processes and plays an important role in describing the meteorological conditions at local scales. The scope of this work is to develop and quantify the predictive ability of a hybrid dynamic-statistical downscaling procedure to estimate the vertical profile of ambient temperature at finer spatial scales. The study focuses on the warm period of the year (June - August) and the method is applied to an urban coastal site (Hellinikon), located in eastern Mediterranean. The two-step methodology initially involves the dynamic downscaling of coarse resolution climate data via the RegCM4.0 regional climate model and subsequently the statistical downscaling of the modeled outputs by developing and training site-specific artificial neural networks (ANN). The 2.5ox2.5o gridded NCEP-DOE Reanalysis 2 dataset is used as initial and boundary conditions for the dynamic downscaling element of the methodology, which enhances the regional representivity of the dataset to 20km and provides modeled fields in 18 vertical levels. The regional climate modeling results are compared versus the upper-air Hellinikon radiosonde observations and the mean absolute error (MAE) is calculated between the four grid point values nearest to the station and the ambient temperature at the standard and significant pressure levels. The statistical downscaling element of the methodology consists of an ensemble of ANN models, one for each pressure level, which are trained separately and employ the regional scale RegCM4.0 output. The ANN models are theoretically capable of estimating any measurable input-output function to any desired degree of accuracy. In this study they are used as non-linear function approximators for identifying the relationship between a number of predictor variables and the ambient temperature at the various vertical levels. An insight of the statistically derived input-output transfer functions is obtained by utilizing the ANN weights method, which quantifies the relative importance of the predictor variables in the estimation procedure. The overall downscaling performance evaluation incorporates a set of correlation and statistical measures along with appropriate statistical tests. The hybrid downscaling method presented in this work can be extended to various locations by training different site-specific ANN models and the results, depending on the application, can be used for assisting the understanding of the past, present and future climatology. ____________________________ This research has been co-financed by the European Union and Greek national funds through the Operational Program "Education and Lifelong Learning" of the National Strategic Reference Framework (NSRF) - Research Funding Program: Heracleitus II: Investing in knowledge society through the European Social Fund.
IBM Watson Analytics: Automating Visualization, Descriptive, and Predictive Statistics
2016-01-01
Background We live in an era of explosive data generation that will continue to grow and involve all industries. One of the results of this explosion is the need for newer and more efficient data analytics procedures. Traditionally, data analytics required a substantial background in statistics and computer science. In 2015, International Business Machines Corporation (IBM) released the IBM Watson Analytics (IBMWA) software that delivered advanced statistical procedures based on the Statistical Package for the Social Sciences (SPSS). The latest entry of Watson Analytics into the field of analytical software products provides users with enhanced functions that are not available in many existing programs. For example, Watson Analytics automatically analyzes datasets, examines data quality, and determines the optimal statistical approach. Users can request exploratory, predictive, and visual analytics. Using natural language processing (NLP), users are able to submit additional questions for analyses in a quick response format. This analytical package is available free to academic institutions (faculty and students) that plan to use the tools for noncommercial purposes. Objective To report the features of IBMWA and discuss how this software subjectively and objectively compares to other data mining programs. Methods The salient features of the IBMWA program were examined and compared with other common analytical platforms, using validated health datasets. Results Using a validated dataset, IBMWA delivered similar predictions compared with several commercial and open source data mining software applications. The visual analytics generated by IBMWA were similar to results from programs such as Microsoft Excel and Tableau Software. In addition, assistance with data preprocessing and data exploration was an inherent component of the IBMWA application. Sensitivity and specificity were not included in the IBMWA predictive analytics results, nor were odds ratios, confidence intervals, or a confusion matrix. Conclusions IBMWA is a new alternative for data analytics software that automates descriptive, predictive, and visual analytics. This program is very user-friendly but requires data preprocessing, statistical conceptual understanding, and domain expertise. PMID:27729304
A knowledge-based T2-statistic to perform pathway analysis for quantitative proteomic data
Chen, Yi-Hau
2017-01-01
Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https://github.com/roqe/T2GA. PMID:28622336
A knowledge-based T2-statistic to perform pathway analysis for quantitative proteomic data.
Lai, En-Yu; Chen, Yi-Hau; Wu, Kun-Pin
2017-06-01
Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https://github.com/roqe/T2GA.
IBM Watson Analytics: Automating Visualization, Descriptive, and Predictive Statistics.
Hoyt, Robert Eugene; Snider, Dallas; Thompson, Carla; Mantravadi, Sarita
2016-10-11
We live in an era of explosive data generation that will continue to grow and involve all industries. One of the results of this explosion is the need for newer and more efficient data analytics procedures. Traditionally, data analytics required a substantial background in statistics and computer science. In 2015, International Business Machines Corporation (IBM) released the IBM Watson Analytics (IBMWA) software that delivered advanced statistical procedures based on the Statistical Package for the Social Sciences (SPSS). The latest entry of Watson Analytics into the field of analytical software products provides users with enhanced functions that are not available in many existing programs. For example, Watson Analytics automatically analyzes datasets, examines data quality, and determines the optimal statistical approach. Users can request exploratory, predictive, and visual analytics. Using natural language processing (NLP), users are able to submit additional questions for analyses in a quick response format. This analytical package is available free to academic institutions (faculty and students) that plan to use the tools for noncommercial purposes. To report the features of IBMWA and discuss how this software subjectively and objectively compares to other data mining programs. The salient features of the IBMWA program were examined and compared with other common analytical platforms, using validated health datasets. Using a validated dataset, IBMWA delivered similar predictions compared with several commercial and open source data mining software applications. The visual analytics generated by IBMWA were similar to results from programs such as Microsoft Excel and Tableau Software. In addition, assistance with data preprocessing and data exploration was an inherent component of the IBMWA application. Sensitivity and specificity were not included in the IBMWA predictive analytics results, nor were odds ratios, confidence intervals, or a confusion matrix. IBMWA is a new alternative for data analytics software that automates descriptive, predictive, and visual analytics. This program is very user-friendly but requires data preprocessing, statistical conceptual understanding, and domain expertise.
NASA Astrophysics Data System (ADS)
Xu, Xianjin; Yan, Chengfei; Zou, Xiaoqin
2017-08-01
The growing number of protein-ligand complex structures, particularly the structures of proteins co-bound with different ligands, in the Protein Data Bank helps us tackle two major challenges in molecular docking studies: the protein flexibility and the scoring function. Here, we introduced a systematic strategy by using the information embedded in the known protein-ligand complex structures to improve both binding mode and binding affinity predictions. Specifically, a ligand similarity calculation method was employed to search a receptor structure with a bound ligand sharing high similarity with the query ligand for the docking use. The strategy was applied to the two datasets (HSP90 and MAP4K4) in recent D3R Grand Challenge 2015. In addition, for the HSP90 dataset, a system-specific scoring function (ITScore2_hsp90) was generated by recalibrating our statistical potential-based scoring function (ITScore2) using the known protein-ligand complex structures and the statistical mechanics-based iterative method. For the HSP90 dataset, better performances were achieved for both binding mode and binding affinity predictions comparing with the original ITScore2 and with ensemble docking. For the MAP4K4 dataset, although there were only eight known protein-ligand complex structures, our docking strategy achieved a comparable performance with ensemble docking. Our method for receptor conformational selection and iterative method for the development of system-specific statistical potential-based scoring functions can be easily applied to other protein targets that have a number of protein-ligand complex structures available to improve predictions on binding.
Szabolcsi, Zoltán; Farkas, Zsuzsa; Borbély, Andrea; Bárány, Gusztáv; Varga, Dániel; Heinrich, Attila; Völgyi, Antónia; Pamjav, Horolma
2015-11-01
When the DNA profile from a crime-scene matches that of a suspect, the weight of DNA evidence depends on the unbiased estimation of the match probability of the profiles. For this reason, it is required to establish and expand the databases that reflect the actual allele frequencies in the population applied. 21,473 complete DNA profiles from Databank samples were used to establish the allele frequency database to represent the population of Hungarian suspects. We used fifteen STR loci (PowerPlex ESI16) including five, new ESS loci. The aim was to calculate the statistical, forensic efficiency parameters for the Databank samples and compare the newly detected data to the earlier report. The population substructure caused by relatedness may influence the frequency of profiles estimated. As our Databank profiles were considered non-random samples, possible relationships between the suspects can be assumed. Therefore, population inbreeding effect was estimated using the FIS calculation. The overall inbreeding parameter was found to be 0.0106. Furthermore, we tested the impact of the two allele frequency datasets on 101 randomly chosen STR profiles, including full and partial profiles. The 95% confidence interval estimates for the profile frequencies (pM) resulted in a tighter range when we used the new dataset compared to the previously published ones. We found that the FIS had less effect on frequency values in the 21,473 samples than the application of minimum allele frequency. No genetic substructure was detected by STRUCTURE analysis. Due to the low level of inbreeding effect and the high number of samples, the new dataset provides unbiased and precise estimates of LR for statistical interpretation of forensic casework and allows us to use lower allele frequencies. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Forcino, Frank L; Leighton, Lindsey R; Twerdy, Pamela; Cahill, James F
2015-01-01
Community ecologists commonly perform multivariate techniques (e.g., ordination, cluster analysis) to assess patterns and gradients of taxonomic variation. A critical requirement for a meaningful statistical analysis is accurate information on the taxa found within an ecological sample. However, oversampling (too many individuals counted per sample) also comes at a cost, particularly for ecological systems in which identification and quantification is substantially more resource consuming than the field expedition itself. In such systems, an increasingly larger sample size will eventually result in diminishing returns in improving any pattern or gradient revealed by the data, but will also lead to continually increasing costs. Here, we examine 396 datasets: 44 previously published and 352 created datasets. Using meta-analytic and simulation-based approaches, the research within the present paper seeks (1) to determine minimal sample sizes required to produce robust multivariate statistical results when conducting abundance-based, community ecology research. Furthermore, we seek (2) to determine the dataset parameters (i.e., evenness, number of taxa, number of samples) that require larger sample sizes, regardless of resource availability. We found that in the 44 previously published and the 220 created datasets with randomly chosen abundances, a conservative estimate of a sample size of 58 produced the same multivariate results as all larger sample sizes. However, this minimal number varies as a function of evenness, where increased evenness resulted in increased minimal sample sizes. Sample sizes as small as 58 individuals are sufficient for a broad range of multivariate abundance-based research. In cases when resource availability is the limiting factor for conducting a project (e.g., small university, time to conduct the research project), statistically viable results can still be obtained with less of an investment.
Accuracy of Digitally Fabricated Wax Denture Bases and Conventional Completed Complete Dentures.
Stawarczyk, Bogna; Lümkemann, Nina; Eichberger, Marlis; Wimmer, Timea
2017-12-19
The purpose of this investigation was to analyze the accuracy of digitally fabricated wax trial dentures and conventionally finalized complete dentures in comparison to a surface tessellation language (STL)-dataset. A generated data set for the denture bases and the tooth sockets was used, converted into STL-format, and saved as reference. Five mandibular and 5 maxillary denture bases were milled from wax blanks and denture teeth were waxed into their tooth sockets. Each complete denture was checked on fit, waxed onto the dental cast, and digitized using an optical laboratory scanning device. The complete dentures were completed conventionally using the injection method, finished, and scanned. The resulting STL-datasets were exported into the three-dimensional (3D) software GOM Inspect. Each of the 5 mandibular and 5 maxillary complete dentures was aligned with the STL- and the wax trial denture dataset. Alignment was performed based on a best-fit algorithm. A three-dimensional analysis of the spatial divergences in x -, y - and z -axes was performed by the 3D software and visualized in a color-coded illustration. The mean positive and negative deviations between the datasets were calculated automatically. In a direct comparison between maxillary wax trial dentures and complete dentures, complete dentures showed higher deviations from the STL-dataset than the wax trial dentures. The deviations occurred in the area of the teeth as well as in the distal area of the denture bases. In contrast, the highest deviations in both the mandibular wax trial dentures and the mandibular complete dentures were observed in the distal area. The complete dentures showed higher deviations on the occlusal surfaces of the teeth compared to the wax dentures. Computer-aided design/computer-aided manufacturing (CAD/CAM)-fabricated wax dentures exhibited fewer deviations from the STL-reference than the complete dentures. The deviations were significantly greater in the vicinity of the denture teeth area and the bases. The conventional transfer of CAD/CAM-fabricated wax dentures into acrylic resin leads to the highest deviations from the STL-reference.
Automated Analysis of Renewable Energy Datasets ('EE/RE Data Mining')
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bush, Brian; Elmore, Ryan; Getman, Dan
This poster illustrates methods to substantially improve the understanding of renewable energy data sets and the depth and efficiency of their analysis through the application of statistical learning methods ('data mining') in the intelligent processing of these often large and messy information sources. The six examples apply methods for anomaly detection, data cleansing, and pattern mining to time-series data (measurements from metering points in buildings) and spatiotemporal data (renewable energy resource datasets).
Simultaneous comparison and assessment of eight remotely sensed maps of Philippine forests
NASA Astrophysics Data System (ADS)
Estoque, Ronald C.; Pontius, Robert G.; Murayama, Yuji; Hou, Hao; Thapa, Rajesh B.; Lasco, Rodel D.; Villar, Merlito A.
2018-05-01
This article compares and assesses eight remotely sensed maps of Philippine forest cover in the year 2010. We examined eight Forest versus Non-Forest maps reclassified from eight land cover products: the Philippine Land Cover, the Climate Change Initiative (CCI) Land Cover, the Landsat Vegetation Continuous Fields (VCF), the MODIS VCF, the MODIS Land Cover Type product (MCD12Q1), the Global Tree Canopy Cover, the ALOS-PALSAR Forest/Non-Forest Map, and the GlobeLand30. The reference data consisted of 9852 randomly distributed sample points interpreted from Google Earth. We created methods to assess the maps and their combinations. Results show that the percentage of the Philippines covered by forest ranges among the maps from a low of 23% for the Philippine Land Cover to a high of 67% for GlobeLand30. Landsat VCF estimates 36% forest cover, which is closest to the 37% estimate based on the reference data. The eight maps plus the reference data agree unanimously on 30% of the sample points, of which 11% are attributable to forest and 19% to non-forest. The overall disagreement between the reference data and Philippine Land Cover is 21%, which is the least among the eight Forest versus Non-Forest maps. About half of the 9852 points have a nested structure such that the forest in a given dataset is a subset of the forest in the datasets that have more forest than the given dataset. The variation among the maps regarding forest quantity and allocation relates to the combined effects of the various definitions of forest and classification errors. Scientists and policy makers must consider these insights when producing future forest cover maps and when establishing benchmarks for forest cover monitoring.
Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads
Rebolledo-Mendez, Jovan; Hestand, Matthew S.; Coleman, Stephen J.; Zeng, Zheng; Orlando, Ludovic; MacLeod, James N.; Kalbfleisch, Ted
2015-01-01
The reference assembly for the domestic horse, EquCab2, published in 2009, was built using approximately 30 million Sanger reads from a Thoroughbred mare named Twilight. Contiguity in the assembly was facilitated using nearly 315 thousand BAC end sequences from Twilight’s half brother Bravo. Since then, it has served as the foundation for many genome-wide analyses that include not only the modern horse, but ancient horses and other equid species as well. As data mapped to this reference has accumulated, consistent variation between mapped datasets and the reference, in terms of regions with no read coverage, single nucleotide variants, and small insertions/deletions have become apparent. In many cases, it is not clear whether these differences are the result of true sequence variation between the research subjects’ and Twilight’s genome or due to errors in the reference. EquCab2 is regarded as “The Twilight Assembly.” The objective of this study was to identify inconsistencies between the EquCab2 assembly and the source Twilight Sanger data used to build it. To that end, the original Sanger and BAC end reads have been mapped back to this equine reference and assessed with the addition of approximately 40X coverage of new Illumina Paired-End sequence data. The resulting mapped datasets identify those regions with low Sanger read coverage, as well as variation in genomic content that is not consistent with either the original Twilight Sanger data or the new genomic sequence data generated from Twilight on the Illumina platform. As the haploid EquCab2 reference assembly was created using Sanger reads derived largely from a single individual, the vast majority of variation detected in a mapped dataset comprised of those same Sanger reads should be heterozygous. In contrast, homozygous variations would represent either errors in the reference or contributions from Bravo's BAC end sequences. Our analysis identifies 720,843 homozygous discrepancies between new, high throughput genomic sequence data generated for Twilight and the EquCab2 reference assembly. Most of these represent errors in the assembly, while approximately 10,000 are demonstrated to be contributions from another horse. Other results are presented that include the binary alignment map file of the mapped Sanger reads, a list of variants identified as discrepancies between the source data and resulting reference, and a BED annotation file that lists the regions of the genome whose consensus was likely derived from low coverage alignments. PMID:26107638
Using Electronic Data Interchange to Report Product Quality
1993-03-01
Numbers 0 31.1 S........................ . . . . ........... .... . .--- . ... N/U 140 SPS Sampling Parameters for Summary Statistics 0 1 N/U 150 REF...DTM Date/Time Reference 0 1 N/U 190 REF Reference Numbers 021 .................................. .......... .. ... NAU 200 STA Statistics 0 1 N/U 210...Measurements 0 1 N/U 120 DTM Date/Time Reference 0 >1 N/U 130 REF Reference Numbers 0 >1 :LOOIV f-SPS N/U 140 SPS Sampling Parameters for Summary Statistics 0 1
Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L
2015-02-01
Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.
Nuclear Forensic Inferences Using Iterative Multidimensional Statistics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Robel, M; Kristo, M J; Heller, M A
2009-06-09
Nuclear forensics involves the analysis of interdicted nuclear material for specific material characteristics (referred to as 'signatures') that imply specific geographical locations, production processes, culprit intentions, etc. Predictive signatures rely on expert knowledge of physics, chemistry, and engineering to develop inferences from these material characteristics. Comparative signatures, on the other hand, rely on comparison of the material characteristics of the interdicted sample (the 'questioned sample' in FBI parlance) with those of a set of known samples. In the ideal case, the set of known samples would be a comprehensive nuclear forensics database, a database which does not currently exist. Inmore » fact, our ability to analyze interdicted samples and produce an extensive list of precise materials characteristics far exceeds our ability to interpret the results. Therefore, as we seek to develop the extensive databases necessary for nuclear forensics, we must also develop the methods necessary to produce the necessary inferences from comparison of our analytical results with these large, multidimensional sets of data. In the work reported here, we used a large, multidimensional dataset of results from quality control analyses of uranium ore concentrate (UOC, sometimes called 'yellowcake'). We have found that traditional multidimensional techniques, such as principal components analysis (PCA), are especially useful for understanding such datasets and drawing relevant conclusions. In particular, we have developed an iterative partial least squares-discriminant analysis (PLS-DA) procedure that has proven especially adept at identifying the production location of unknown UOC samples. By removing classes which fell far outside the initial decision boundary, and then rebuilding the PLS-DA model, we have consistently produced better and more definitive attributions than with a single pass classification approach. Performance of the iterative PLS-DA method compared favorably to that of classification and regression tree (CART) and k nearest neighbor (KNN) algorithms, with the best combination of accuracy and robustness, as tested by classifying samples measured independently in our laboratories against the vendor QC based reference set.« less
Larson, Derek W.; Currie, Philip J.
2013-01-01
Isolated small theropod teeth are abundant in vertebrate microfossil assemblages, and are frequently used in studies of species diversity in ancient ecosystems. However, determining the taxonomic affinities of these teeth is problematic due to an absence of associated diagnostic skeletal material. Species such as Dromaeosaurus albertensis, Richardoestesia gilmorei, and Saurornitholestes langstoni are known from skeletal remains that have been recovered exclusively from the Dinosaur Park Formation (Campanian). It is therefore likely that teeth from different formations widely disparate in age or geographic position are not referable to these species. Tooth taxa without any associated skeletal material, such as Paronychodon lacustris and Richardoestesia isosceles, have also been identified from multiple localities of disparate ages throughout the Late Cretaceous. To address this problem, a dataset of measurements of 1183 small theropod teeth (the most specimen-rich theropod tooth dataset ever constructed) from North America ranging in age from Santonian through Maastrichtian were analyzed using multivariate statistical methods: canonical variate analysis, pairwise discriminant function analysis, and multivariate analysis of variance. The results indicate that teeth referred to the same taxon from different formations are often quantitatively distinct. In contrast, isolated teeth found in time equivalent formations are not quantitatively distinguishable from each other. These results support the hypothesis that small theropod taxa, like other dinosaurs in the Late Cretaceous, tend to be exclusive to discrete host formations. The methods outlined have great potential for future studies of isolated teeth worldwide, and may be the most useful non-destructive technique known of extracting the most data possible from isolated and fragmentary specimens. The ability to accurately assess species diversity and turnover through time based on isolated teeth will help illuminate patterns of evolution and extinction in these groups and potentially others in greater detail than has previously been thought possible without more complete skeletal material. PMID:23372708
A Computational Geometry Approach to Automated Pulmonary Fissure Segmentation in CT Examinations
Pu, Jiantao; Leader, Joseph K; Zheng, Bin; Knollmann, Friedrich; Fuhrman, Carl; Sciurba, Frank C; Gur, David
2010-01-01
Identification of pulmonary fissures, which form the boundaries between the lobes in the lungs, may be useful during clinical interpretation of CT examinations to assess the early presence and characterization of manifestation of several lung diseases. Motivated by the unique nature of the surface shape of pulmonary fissures in three-dimensional space, we developed a new automated scheme using computational geometry methods to detect and segment fissures depicted on CT images. After a geometric modeling of the lung volume using the Marching Cube Algorithm, Laplacian smoothing is applied iteratively to enhance pulmonary fissures by depressing non-fissure structures while smoothing the surfaces of lung fissures. Next, an Extended Gaussian Image based procedure is used to locate the fissures in a statistical manner that approximates the fissures using a set of plane “patches.” This approach has several advantages such as independence of anatomic knowledge of the lung structure except the surface shape of fissures, limited sensitivity to other lung structures, and ease of implementation. The scheme performance was evaluated by two experienced thoracic radiologists using a set of 100 images (slices) randomly selected from 10 screening CT examinations. In this preliminary evaluation 98.7% and 94.9% of scheme segmented fissure voxels are within 2 mm of the fissures marked independently by two radiologists in the testing image dataset. Using the scheme detected fissures as reference, 89.4% and 90.1% of manually marked fissure points have distance ≤ 2 mm to the reference suggesting a possible under-segmentation of the scheme. The case-based RMS (root-mean-square) distances (“errors”) between our scheme and the radiologist ranged from 1.48±0.92 to 2.04±3.88 mm. The discrepancy of fissure detection results between the automated scheme and either radiologist is smaller in this dataset than the inter-reader variability. PMID:19272987
DOE Office of Scientific and Technical Information (OSTI.GOV)
P-Mart was designed specifically to allow cancer researchers to perform robust statistical processing of publicly available cancer proteomic datasets. To date an online statistical processing suite for proteomics does not exist. The P-Mart software is designed to allow statistical programmers to utilize these algorithms through packages in the R programming language as well as offering a web-based interface using the Azure cloud technology. The Azure cloud technology also allows the release of the software via Docker containers.
Bayesian correlated clustering to integrate multiple datasets
Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.
2012-01-01
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558
The influence of climate change on Tanzania's hydropower sustainability
NASA Astrophysics Data System (ADS)
Sperna Weiland, Frederiek; Boehlert, Brent; Meijer, Karen; Schellekens, Jaap; Magnell, Jan-Petter; Helbrink, Jakob; Kassana, Leonard; Liden, Rikard
2015-04-01
Economic costs induced by current climate variability are large for Tanzania and may further increase due to future climate change. The Tanzanian National Climate Change Strategy addressed the need for stabilization of hydropower generation and strengthening of water resources management. Increased hydropower generation can contribute to sustainable use of energy resources and stabilization of the national electricity grid. To support Tanzania the World Bank financed this study in which the impact of climate change on the water resources and related hydropower generation capacity of Tanzania is assessed. To this end an ensemble of 78 GCM projections from both the CMIP3 and CMIP5 datasets was bias-corrected and down-scaled to 0.5 degrees resolution following the BCSD technique using the Princeton Global Meteorological Forcing Dataset as a reference. To quantify the hydrological impacts of climate change by 2035 the global hydrological model PCR-GLOBWB was set-up for Tanzania at a resolution of 3 minutes and run with all 78 GCM datasets. From the full set of projections a probable (median) and worst case scenario (95th percentile) were selected based upon (1) the country average Climate Moisture Index and (2) discharge statistics of relevance to hydropower generation. Although precipitation from the Princeton dataset shows deviations from local station measurements and the global hydrological model does not perfectly reproduce local scale hydrographs, the main discharge characteristics and precipitation patterns are represented well. The modeled natural river flows were adjusted for water demand and irrigation within the water resources model RIBASIM (both historical values and future scenarios). Potential hydropower capacity was assessed with the power market simulation model PoMo-C that considers both reservoir inflows obtained from RIBASIM and overall electricity generation costs. Results of the study show that climate change is unlikely to negatively affect the average potential of future hydropower production; it will likely make hydropower more profitable. Yet, the uncertainty in climate change projections remains large and risks are significant, adaptation strategies should ideally consider a worst case scenario to ensure robust power generation. Overall a diversified power generation portfolio, anchored in hydropower and supported by other renewables and fossil fuel-based energy sources, is the best solution for Tanzania
Potential Applications of Remote Sensing Precipitation Data on Urban Stormwater Modeling
NASA Astrophysics Data System (ADS)
Maggioni, V.; Tarantola, R.; Ferreira, C.
2014-12-01
Although stormwater modeling is widely used to plan, manage and operate stormwater systems in the urban environment, accuracy in model development and calibration is still problematic. Precipitation is the major forcing of stormwater modeling and one of the most important variables for accurate representation of the water cycle in urban areas. However, rainfall data availability in both temporal and spatial adequate scales is scarce. Here we investigate the potential to apply satellite precipitation products to small-scale urban watersheds with a focus on real-time data for operational use and historical data for model calibration and planning. We present a study case in Northern Virginia, part of the Washington, D.C. metropolitan region. We compare several rainfall datasets from satellites, radar and rain gauges during 2002-2008, using two multi-satellite precipitation products. The first one is the NASA TRMM TMPA at daily/0.25° time/space resolution, which is available in two forms: 3B42-Real Time and 3B42-Version 7, where the latter is a post-processed product, corrected with ground-based observations. The second one is the NOAA CMORPH at 3hrs/0.25° time/space resolution. The NOAA Climate Prediction Center (CPC) data and NCEP Stage IV radar-based product are used as reference datasets for TMPA and CMORPH, respectively. Statistical analyses are conducted to compare these datasets: correlation coefficient, RMSE, bias, probability of correct no-rain detection and of false alarm were computed with a focus on Fairfax, VA county. Preliminary results show that the TMPA products outperform CMORPH, when compared to rain gauges and radar data over the county. Moreover, no appreciable difference is detected between TMPA-V7 and TMPA-RT, which demonstrates that real-time data could be used over the urban watershed with results that are comparable to the adjusted product. Analyses are undergoing to investigate higher temporal resolution and to include a comparison with the Fairfax county rain gages data. Future work will also evaluate the impacts of different precipitation datasets on stormwater runoff for Fairfax county, using the EPA-SWMM5 storm water model.
Quality Assessments of Long-Term Quantitative Proteomic Analysis of Breast Cancer Xenograft Tissues
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, Jian-Ying; Chen, Lijun; Zhang, Bai
The identification of protein biomarkers requires large-scale analysis of human specimens to achieve statistical significance. In this study, we evaluated the long-term reproducibility of an iTRAQ (isobaric tags for relative and absolute quantification) based quantitative proteomics strategy using one channel for universal normalization across all samples. A total of 307 liquid chromatography tandem mass spectrometric (LC-MS/MS) analyses were completed, generating 107 one-dimensional (1D) LC-MS/MS datasets and 8 offline two-dimensional (2D) LC-MS/MS datasets (25 fractions for each set) for human-in-mouse breast cancer xenograft tissues representative of basal and luminal subtypes. Such large-scale studies require the implementation of robust metrics to assessmore » the contributions of technical and biological variability in the qualitative and quantitative data. Accordingly, we developed a quantification confidence score based on the quality of each peptide-spectrum match (PSM) to remove quantification outliers from each analysis. After combining confidence score filtering and statistical analysis, reproducible protein identification and quantitative results were achieved from LC-MS/MS datasets collected over a 16 month period.« less
Prolonged Instability Prior to a Regime Shift | Science ...
Regime shifts are generally defined as the point of ‘abrupt’ change in the state of a system. However, a seemingly abrupt transition can be the product of a system reorganization that has been ongoing much longer than is evident in statistical analysis of a single component of the system. Using both univariate and multivariate statistical methods, we tested a long-term high-resolution paleoecological dataset with a known change in species assemblage for a regime shift. Analysis of this dataset with Fisher Information and multivariate time series modeling showed that there was a∼2000 year period of instability prior to the regime shift. This period of instability and the subsequent regime shift coincide with regional climate change, indicating that the system is undergoing extrinsic forcing. Paleoecological records offer a unique opportunity to test tools for the detection of thresholds and stable-states, and thus to examine the long-term stability of ecosystems over periods of multiple millennia. This manuscript explores various methods of assessing the transition between alternative states in an ecological system described by a long-term high-resolution paleoecological dataset.
Weidner, Christopher; Fischer, Cornelius; Sauer, Sascha
2014-12-01
We introduce PHOXTRACK (PHOsphosite-X-TRacing Analysis of Causal Kinases), a user-friendly freely available software tool for analyzing large datasets of post-translational modifications of proteins, such as phosphorylation, which are commonly gained by mass spectrometry detection. In contrast to other currently applied data analysis approaches, PHOXTRACK uses full sets of quantitative proteomics data and applies non-parametric statistics to calculate whether defined kinase-specific sets of phosphosite sequences indicate statistically significant concordant differences between various biological conditions. PHOXTRACK is an efficient tool for extracting post-translational information of comprehensive proteomics datasets to decipher key regulatory proteins and to infer biologically relevant molecular pathways. PHOXTRACK will be maintained over the next years and is freely available as an online tool for non-commercial use at http://phoxtrack.molgen.mpg.de. Users will also find a tutorial at this Web site and can additionally give feedback at https://groups.google.com/d/forum/phoxtrack-discuss. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Variable Selection in the Presence of Missing Data: Imputation-based Methods.
Zhao, Yize; Long, Qi
2017-01-01
Variable selection plays an essential role in regression analysis as it identifies important variables that associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid used under the assumptions of missing at random (MAR) and missing completely at random (MCAR), largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combine variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under-developed and offers fertile ground for further research.
Selecting minimum dataset soil variables using PLSR as a regressive multivariate method
NASA Astrophysics Data System (ADS)
Stellacci, Anna Maria; Armenise, Elena; Castellini, Mirko; Rossi, Roberta; Vitti, Carolina; Leogrande, Rita; De Benedetto, Daniela; Ferrara, Rossana M.; Vivaldi, Gaetano A.
2017-04-01
Long-term field experiments and science-based tools that characterize soil status (namely the soil quality indices, SQIs) assume a strategic role in assessing the effect of agronomic techniques and thus in improving soil management especially in marginal environments. Selecting key soil variables able to best represent soil status is a critical step for the calculation of SQIs. Current studies show the effectiveness of statistical methods for variable selection to extract relevant information deriving from multivariate datasets. Principal component analysis (PCA) has been mainly used, however supervised multivariate methods and regressive techniques are progressively being evaluated (Armenise et al., 2013; de Paul Obade et al., 2016; Pulido Moncada et al., 2014). The present study explores the effectiveness of partial least square regression (PLSR) in selecting critical soil variables, using a dataset comparing conventional tillage and sod-seeding on durum wheat. The results were compared to those obtained using PCA and stepwise discriminant analysis (SDA). The soil data derived from a long-term field experiment in Southern Italy. On samples collected in April 2015, the following set of variables was quantified: (i) chemical: total organic carbon and nitrogen (TOC and TN), alkali-extractable C (TEC and humic substances - HA-FA), water extractable N and organic C (WEN and WEOC), Olsen extractable P, exchangeable cations, pH and EC; (ii) physical: texture, dry bulk density (BD), macroporosity (Pmac), air capacity (AC), and relative field capacity (RFC); (iii) biological: carbon of the microbial biomass quantified with the fumigation-extraction method. PCA and SDA were previously applied to the multivariate dataset (Stellacci et al., 2016). PLSR was carried out on mean centered and variance scaled data of predictors (soil variables) and response (wheat yield) variables using the PLS procedure of SAS/STAT. In addition, variable importance for projection (VIP) statistics was used to quantitatively assess the predictors most relevant for response variable estimation and then for variable selection (Andersen and Bro, 2010). PCA and SDA returned TOC and RFC as influential variables both on the set of chemical and physical data analyzed separately as well as on the whole dataset (Stellacci et al., 2016). Highly weighted variables in PCA were also TEC, followed by K, and AC, followed by Pmac and BD, in the first PC (41.2% of total variance); Olsen P and HA-FA in the second PC (12.6%), Ca in the third (10.6%) component. Variables enabling maximum discrimination among treatments for SDA were WEOC, on the whole dataset, humic substances, followed by Olsen P, EC and clay, in the separate data analyses. The highest PLS-VIP statistics were recorded for Olsen P and Pmac, followed by TOC, TEC, pH and Mg for chemical variables and clay, RFC and AC for the physical variables. Results show that different methods may provide different ranking of the selected variables and the presence of a response variable, in regressive techniques, may affect variable selection. Further investigation with different response variables and with multi-year datasets would allow to better define advantages and limits of single or combined approaches. Acknowledgment The work was supported by the projects "BIOTILLAGE, approcci innovative per il miglioramento delle performances ambientali e produttive dei sistemi cerealicoli no-tillage", financed by PSR-Basilicata 2007-2013, and "DESERT, Low-cost water desalination and sensor technology compact module" financed by ERANET-WATERWORKS 2014. References Andersen C.M. and Bro R., 2010. Variable selection in regression - a tutorial. Journal of Chemometrics, 24 728-737. Armenise et al., 2013. Developing a soil quality index to compare soil fitness for agricultural use under different managements in the mediterranean environment. Soil and Tillage Research, 130:91-98. de Paul Obade et al., 2016. A standardized soil quality index for diverse field conditions. Sci. Total Env. 541:424-434. Pulido Moncada et al., 2014. Data-driven analysis of soil quality indicators using limited data. Geoderma, 235:271-278. Stellacci et al., 2016. Comparison of different multivariate methods to select key soil variables for soil quality indices computation. XLV Congress of the Italian Society of Agronomy (SIA), Sassari, 20-22 September 2016.
Lidar - ND Halo Scanning Doppler, Boardman - Reviewed Data
Otarola, Sebastian
2017-10-23
The University of Notre Dame (ND) scanning LiDAR dataset used for the WFIP2 Campaign is provided. The LiDAR is a Halo Photonics Stream Line Scanning Doppler LiDAR. **It is highly recommended to discuss any planned use of these data with University of Notre Dame scientists**. For more information refer to Section 4.c) in the updated version of the "WFIP2 Project (lidar.z07)" Readme file, where the lidar.z07.b0 dataset is fully explained.
Robust Statistical Fusion of Image Labels
Landman, Bennett A.; Asman, Andrew J.; Scoggins, Andrew G.; Bogovic, John A.; Xing, Fangxu; Prince, Jerry L.
2011-01-01
Image labeling and parcellation (i.e. assigning structure to a collection of voxels) are critical tasks for the assessment of volumetric and morphometric features in medical imaging data. The process of image labeling is inherently error prone as images are corrupted by noise and artifacts. Even expert interpretations are subject to subjectivity and the precision of the individual raters. Hence, all labels must be considered imperfect with some degree of inherent variability. One may seek multiple independent assessments to both reduce this variability and quantify the degree of uncertainty. Existing techniques have exploited maximum a posteriori statistics to combine data from multiple raters and simultaneously estimate rater reliabilities. Although quite successful, wide-scale application has been hampered by unstable estimation with practical datasets, for example, with label sets with small or thin objects to be labeled or with partial or limited datasets. As well, these approaches have required each rater to generate a complete dataset, which is often impossible given both human foibles and the typical turnover rate of raters in a research or clinical environment. Herein, we propose a robust approach to improve estimation performance with small anatomical structures, allow for missing data, account for repeated label sets, and utilize training/catch trial data. With this approach, numerous raters can label small, overlapping portions of a large dataset, and rater heterogeneity can be robustly controlled while simultaneously estimating a single, reliable label set and characterizing uncertainty. The proposed approach enables many individuals to collaborate in the construction of large datasets for labeling tasks (e.g., human parallel processing) and reduces the otherwise detrimental impact of rater unavailability. PMID:22010145
EBprot: Statistical analysis of labeling-based quantitative proteomics data.
Koh, Hiromi W L; Swa, Hannah L F; Fermin, Damian; Ler, Siok Ghee; Gunaratne, Jayantha; Choi, Hyungwon
2015-08-01
Labeling-based proteomics is a powerful method for detection of differentially expressed proteins (DEPs). The current data analysis platform typically relies on protein-level ratios, which is obtained by summarizing peptide-level ratios for each protein. In shotgun proteomics, however, some proteins are quantified with more peptides than others, and this reproducibility information is not incorporated into the differential expression (DE) analysis. Here, we propose a novel probabilistic framework EBprot that directly models the peptide-protein hierarchy and rewards the proteins with reproducible evidence of DE over multiple peptides. To evaluate its performance with known DE states, we conducted a simulation study to show that the peptide-level analysis of EBprot provides better receiver-operating characteristic and more accurate estimation of the false discovery rates than the methods based on protein-level ratios. We also demonstrate superior classification performance of peptide-level EBprot analysis in a spike-in dataset. To illustrate the wide applicability of EBprot in different experimental designs, we applied EBprot to a dataset for lung cancer subtype analysis with biological replicates and another dataset for time course phosphoproteome analysis of EGF-stimulated HeLa cells with multiplexed labeling. Through these examples, we show that the peptide-level analysis of EBprot is a robust alternative to the existing statistical methods for the DE analysis of labeling-based quantitative datasets. The software suite is freely available on the Sourceforge website http://ebprot.sourceforge.net/. All MS data have been deposited in the ProteomeXchange with identifier PXD001426 (http://proteomecentral.proteomexchange.org/dataset/PXD001426/). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Wong, Gerard; Leckie, Christopher; Gorringe, Kylie L; Haviv, Izhak; Campbell, Ian G; Kowalczyk, Adam
2010-04-15
High-density single nucleotide polymorphism (SNP) genotyping arrays are efficient and cost effective platforms for the detection of copy number variation (CNV). To ensure accuracy in probe synthesis and to minimize production costs, short oligonucleotide probe sequences are used. The use of short probe sequences limits the specificity of binding targets in the human genome. The specificity of these short probeset sequences has yet to be fully analysed against a normal reference human genome. Sequence similarity can artificially elevate or suppress copy number measurements, and hence reduce the reliability of affected probe readings. For the purpose of detecting narrow CNVs reliably down to the width of a single probeset, sequence similarity is an important issue that needs to be addressed. We surveyed the Affymetrix Human Mapping SNP arrays for probeset sequence similarity against the reference human genome. Utilizing sequence similarity results, we identified a collection of fine-scaled putative CNVs between gender from autosomal probesets whose sequence matches various loci on the sex chromosomes. To detect these variations, we utilized our statistical approach, Detecting REcurrent Copy number change using rank-order Statistics (DRECS), and showed that its performance was superior and more stable than the t-test in detecting CNVs. Through the application of DRECS on the HapMap population datasets with multi-matching probesets filtered, we identified biologically relevant SNPs in aberrant regions across populations with known association to physical traits, such as height, covered by the span of a single probe. This provided empirical confirmation of the existence of naturally occurring narrow CNVs as well as the sensitivity of the Affymetrix SNP array technology in detecting them. The MATLAB implementation of DRECS is available at http://ww2.cs.mu.oz.au/ approximately gwong/DRECS/index.html.
Dimova, Violeta; Oertel, Bruno G; Lötsch, Jörn
2017-01-01
Skin sensitivity to sensory stimuli varies among different body areas. A standardized clinical quantitative sensory testing (QST) battery, established for the diagnosis of neuropathic pain, was used to assess whether the magnitude of differences between test sites reaches clinical significance. Ten different sensory QST measures derived from thermal and mechanical stimuli were obtained from 21 healthy volunteers (10 men) and used to create somatosensory profiles bilateral from the dorsum of the hands (the standard area for the assessment of normative values for the upper extremities as proposed by the German Research Network on Neuropathic Pain) and bilateral at volar forearms as a neighboring nonstandard area. The parameters obtained were statistically compared between test sites. Three of the 10 QST parameters differed significantly with respect to the "body area," that is, warmth detection, thermal sensory limen, and mechanical pain thresholds. After z-transformation and interpretation according to the QST battery's standard instructions, 22 abnormal values were obtained at the hand. Applying the same procedure to parameters assessed at the nonstandard site forearm, that is, z-transforming them to the reference values for the hand, 24 measurements values emerged as abnormal, which was not significantly different compared with the hand (P=0.4185). Sensory differences between neighboring body areas are statistically significant, reproducing prior knowledge. This has to be considered in scientific assessments where a small variation of the tested body areas may not be an option. However, the magnitude of these differences was below the difference in sensory parameters that is judged as abnormal, indicating a robustness of the QST instrument against protocol deviations with respect to the test area when using the method of comparison with a 95 % confidence interval of a reference dataset.
Bohner, Lauren Oliveira Lima; De Luca Canto, Graziela; Marció, Bruno Silva; Laganá, Dalva Cruz; Sesma, Newton; Tortamano Neto, Pedro
2017-11-01
The internal and marginal adaptation of a computer-aided design and computer-aided manufacturing (CAD-CAM) prosthesis relies on the quality of the 3-dimensional image. The quality of imaging systems requires evaluation. The purpose of this in vitro study was to evaluate and compare the trueness of intraoral and extraoral scanners in scanning prepared teeth. Ten acrylic resin teeth to be used as a reference dataset were prepared according to standard guidelines and scanned with an industrial computed tomography system. Data were acquired with 4 scanner devices (n=10): the Trios intraoral scanner (TIS), the D250 extraoral scanner (DES), the Cerec Bluecam intraoral scanner (CBIS), and the Cerec InEosX5 extraoral scanner (CIES). For intraoral scanners, each tooth was digitized individually. Extraoral scanning was obtained from dental casts of each prepared tooth. The discrepancy between each scan and its respective reference model was obtained by deviation analysis (μm) and volume/area difference (μm). Statistical analysis was performed using linear models for repeated measurement factors test and 1-way ANOVA (α=.05). No significant differences in deviation values were found among scanners. For CBIS and CIES, the deviation was significantly higher (P<.05) for occlusal and cervical surfaces. With regard to volume differences, no statistically significant differences were found (TIS=340 ±230 μm; DES=380 ±360 μm; CBIS=780 ±770 μm; CIES=340 ±300 μm). Intraoral and extraoral scanners showed similar trueness in scanning prepared teeth. Higher discrepancies are expected to occur in the cervical region and on the occlusal surface. Copyright © 2017 Editorial Council for the Journal of Prosthetic Dentistry. Published by Elsevier Inc. All rights reserved.
Di, Yanming; Schafer, Daniel W.; Wilhelm, Larry J.; Fox, Samuel E.; Sullivan, Christopher M.; Curzon, Aron D.; Carrington, James C.; Mockler, Todd C.; Chang, Jeff H.
2011-01-01
GENE-counter is a complete Perl-based computational pipeline for analyzing RNA-Sequencing (RNA-Seq) data for differential gene expression. In addition to its use in studying transcriptomes of eukaryotic model organisms, GENE-counter is applicable for prokaryotes and non-model organisms without an available genome reference sequence. For alignments, GENE-counter is configured for CASHX, Bowtie, and BWA, but an end user can use any Sequence Alignment/Map (SAM)-compliant program of preference. To analyze data for differential gene expression, GENE-counter can be run with any one of three statistics packages that are based on variations of the negative binomial distribution. The default method is a new and simple statistical test we developed based on an over-parameterized version of the negative binomial distribution. GENE-counter also includes three different methods for assessing differentially expressed features for enriched gene ontology (GO) terms. Results are transparent and data are systematically stored in a MySQL relational database to facilitate additional analyses as well as quality assessment. We used next generation sequencing to generate a small-scale RNA-Seq dataset derived from the heavily studied defense response of Arabidopsis thaliana and used GENE-counter to process the data. Collectively, the support from analysis of microarrays as well as the observed and substantial overlap in results from each of the three statistics packages demonstrates that GENE-counter is well suited for handling the unique characteristics of small sample sizes and high variability in gene counts. PMID:21998647
Dynamic and thermodynamic processes driving the January 2014 precipitation record in southern UK
NASA Astrophysics Data System (ADS)
Oueslati, B.; Yiou, P.; Jezequel, A.
2017-12-01
Regional extreme precipitation are projected to intensify as a response to planetary climate change, with important impacts on societies. Understanding and anticipating those events remain a major challenge. In this study, we revisit the mechanisms of winter precipitation record that occurred in southern United Kingdom in January 2014. The physical drivers of this event are analyzed using the water vapor budget. Precipitation changes are decomposed into dynamic contributions, related to changes in atmospheric circulation, and thermodynamic contributions, related to changes in water vapor. We attempt to quantify the relative importance of the two contributions during this event and examine the applicability of Clausius-Clapeyron scaling. This work provides a physical interpretation of the mechanisms associated with Southern UK's wettest event, which is complementary to other studies based on statistical approaches (Schaller et al., 2016, Yiou et al., 2017). The analysis is carried out using the ERA-Interim reanalysis. This is motivated by the horizontal resolution of this dataset. It is then applied to present-day simulations and future projections of CMIP5 models on selected extreme precipitation events in southern UK that are comparable to January 2014 in terms of atmospheric circulation.References:Schaller, N. et al. Human influence on climate in the 2014 southern England winter floods and their impacts, Nature Clim. Change, 2016, 6, 627-634 Yiou, P., et al. A statistical framework for conditional extreme event attribution Advances in Statistical Climatology, Meteorology and Oceanography, 2017, 3, 17-31
Privacy-Preserving Data Exploration in Genome-Wide Association Studies.
Johnson, Aaron; Shmatikov, Vitaly
2013-08-01
Genome-wide association studies (GWAS) have become a popular method for analyzing sets of DNA sequences in order to discover the genetic basis of disease. Unfortunately, statistics published as the result of GWAS can be used to identify individuals participating in the study. To prevent privacy breaches, even previously published results have been removed from public databases, impeding researchers' access to the data and hindering collaborative research. Existing techniques for privacy-preserving GWAS focus on answering specific questions, such as correlations between a given pair of SNPs (DNA sequence variations). This does not fit the typical GWAS process, where the analyst may not know in advance which SNPs to consider and which statistical tests to use, how many SNPs are significant for a given dataset, etc. We present a set of practical, privacy-preserving data mining algorithms for GWAS datasets. Our framework supports exploratory data analysis, where the analyst does not know a priori how many and which SNPs to consider. We develop privacy-preserving algorithms for computing the number and location of SNPs that are significantly associated with the disease, the significance of any statistical test between a given SNP and the disease, any measure of correlation between SNPs, and the block structure of correlations. We evaluate our algorithms on real-world datasets and demonstrate that they produce significantly more accurate results than prior techniques while guaranteeing differential privacy.
NASA Astrophysics Data System (ADS)
Li, Ming-Xia; Palchykov, Vasyl; Jiang, Zhi-Qiang; Kaski, Kimmo; Kertész, János; Miccichè, Salvatore; Tumminello, Michele; Zhou, Wei-Xing; Mantegna, Rosario N.
2014-08-01
Big data open up unprecedented opportunities for investigating complex systems, including society. In particular, communication data serve as major sources for computational social sciences, but they have to be cleaned and filtered as they may contain spurious information due to recording errors as well as interactions, like commercial and marketing activities, not directly related to the social network. The network constructed from communication data can only be considered as a proxy for the network of social relationships. Here we apply a systematic method, based on multiple-hypothesis testing, to statistically validate the links and then construct the corresponding Bonferroni network, generalized to the directed case. We study two large datasets of mobile phone records, one from Europe and the other from China. For both datasets we compare the raw data networks with the corresponding Bonferroni networks and point out significant differences in the structures and in the basic network measures. We show evidence that the Bonferroni network provides a better proxy for the network of social interactions than the original one. Using the filtered networks, we investigated the statistics and temporal evolution of small directed 3-motifs and concluded that closed communication triads have a formation time scale, which is quite fast and typically intraday. We also find that open communication triads preferentially evolve into other open triads with a higher fraction of reciprocated calls. These stylized facts were observed for both datasets.
Teaching Students to Use Summary Statistics and Graphics to Clean and Analyze Data
ERIC Educational Resources Information Center
Holcomb, John; Spalsbury, Angela
2005-01-01
Textbooks and websites today abound with real data. One neglected issue is that statistical investigations often require a good deal of "cleaning" to ready data for analysis. The purpose of this dataset and exercise is to teach students to use exploratory tools to identify erroneous observations. This article discusses the merits of such…
Our study assesses the value of both in vitro assay and quantitative structure activity relationship (QSAR) data in predicting in vivo toxicity using numerous statistical models and approaches to process the data. Our models are built on datasets of (i) 586 chemicals for which bo...
DMRfinder: efficiently identifying differentially methylated regions from MethylC-seq data.
Gaspar, John M; Hart, Ronald P
2017-11-29
DNA methylation is an epigenetic modification that is studied at a single-base resolution with bisulfite treatment followed by high-throughput sequencing. After alignment of the sequence reads to a reference genome, methylation counts are analyzed to determine genomic regions that are differentially methylated between two or more biological conditions. Even though a variety of software packages is available for different aspects of the bioinformatics analysis, they often produce results that are biased or require excessive computational requirements. DMRfinder is a novel computational pipeline that identifies differentially methylated regions efficiently. Following alignment, DMRfinder extracts methylation counts and performs a modified single-linkage clustering of methylation sites into genomic regions. It then compares methylation levels using beta-binomial hierarchical modeling and Wald tests. Among its innovative attributes are the analyses of novel methylation sites and methylation linkage, as well as the simultaneous statistical analysis of multiple sample groups. To demonstrate its efficiency, DMRfinder is benchmarked against other computational approaches using a large published dataset. Contrasting two replicates of the same sample yielded minimal genomic regions with DMRfinder, whereas two alternative software packages reported a substantial number of false positives. Further analyses of biological samples revealed fundamental differences between DMRfinder and another software package, despite the fact that they utilize the same underlying statistical basis. For each step, DMRfinder completed the analysis in a fraction of the time required by other software. Among the computational approaches for identifying differentially methylated regions from high-throughput bisulfite sequencing datasets, DMRfinder is the first that integrates all the post-alignment steps in a single package. Compared to other software, DMRfinder is extremely efficient and unbiased in this process. DMRfinder is free and open-source software, available on GitHub ( github.com/jsh58/DMRfinder ); it is written in Python and R, and is supported on Linux.
NASA Astrophysics Data System (ADS)
Rieder, H. E.; Staehelin, J.; Maeder, J. A.; Ribatet, M.; Davison, A. C.
2009-04-01
Various generations of satellites (e.g. TOMS, GOME, OMI) made spatial datasets of column ozone available to the scientific community. This study has a special focus on column ozone over the northern mid-latitudes. Tools from geostatistics and extreme value theory are applied to analyze variability, long-term trends and frequency distributions of extreme events in total ozone. In a recent case study (Rieder et al., 2009) new tools from extreme value theory (Coles, 2001; Ribatet, 2007) have been applied to the world's longest total ozone record from Arosa, Switzerland (e.g. Staehelin 1998a,b), in order to describe extreme events in low and high total ozone. Within the current study this analysis is extended to satellite datasets for the northern mid-latitudes. Further special emphasis is given on patterns and spatial correlations and the influence of changes in atmospheric dynamics (e.g. tropospheric and lower stratospheric pressure systems) on column ozone. References: Coles, S.: An Introduction to Statistical Modeling of Extreme Values, Springer Series in Statistics, ISBN:1852334592, Springer, Berlin, 2001. Ribatet, M.: POT: Modelling peaks over a threshold, R News, 7, 34-36, 2007. Rieder, H.E., Staehelin, J., Maeder, J.A., Ribatet, M., Stübi, R., Weihs, P., Holawe, F., Peter, T., and Davison, A.C.: From ozone mini holes and mini highs towards extreme value theory: New insights from extreme events and non stationarity, submitted to J. Geophys. Res., 2009. Staehelin, J., Kegel, R., and Harris, N. R.: Trend analysis of the homogenized total ozone series of Arosa (Switzerland), 1929-1996, J. Geophys. Res., 103(D7), 8389-8400, doi:10.1029/97JD03650, 1998a. Staehelin, J., Renaud, A., Bader, J., McPeters, R., Viatte, P., Hoegger, B., Bugnion, V., Giroud, M., and Schill, H.: Total ozone series at Arosa (Switzerland): Homogenization and data comparison, J. Geophys. Res., 103(D5), 5827-5842, doi:10.1029/97JD02402, 1998b.
NASA Astrophysics Data System (ADS)
Zolina, Olga; Simmer, Clemens; Kapala, Alice; Mächel, Hermann; Gulev, Sergey; Groisman, Pavel
2014-05-01
We present new high resolution precipitation daily grids developed at Meteorological Institute, University of Bonn and German Weather Service (DWD) under the STAMMEX project (Spatial and Temporal Scales and Mechanisms of Extreme Precipitation Events over Central Europe). Daily precipitation grids have been developed from the daily-observing precipitation network of DWD, which runs one of the World's densest rain gauge networks comprising more than 7500 stations. Several quality-controlled daily gridded products with homogenized sampling were developed covering the periods 1931-onwards (with 0.5 degree resolution), 1951-onwards (0.25 degree and 0.5 degree), and 1971-2000 (0.1 degree). Different methods were tested to select the best gridding methodology that minimizes errors of integral grid estimates over hilly terrain. Besides daily precipitation values with uncertainty estimates (which include standard estimates of the kriging uncertainty as well as error estimates derived by a bootstrapping algorithm), the STAMMEX data sets include a variety of statistics that characterize temporal and spatial dynamics of the precipitation distribution (quantiles, extremes, wet/dry spells, etc.). Comparisons with existing continental-scale daily precipitation grids (e.g., CRU, ECA E-OBS, GCOS) which include considerably less observations compared to those used in STAMMEX, demonstrate the added value of high-resolution grids for extreme rainfall analyses. These data exhibit spatial variability pattern and trends in precipitation extremes, which are missed or incorrectly reproduced over Central Europe from coarser resolution grids based on sparser networks. The STAMMEX dataset can be used for high-quality climate diagnostics of precipitation variability, as a reference for reanalyses and remotely-sensed precipitation products (including the upcoming Global Precipitation Mission products), and for input into regional climate and operational weather forecast models. We will present numerous application of the STAMMEX grids spanning from case studies of the major Central European floods to long-term changes in different precipitation statistics, including those accounting for the alternation of dry and wet periods and precipitation intensities associated with prolonged rainy episodes.
Modeling potential habitats for alien species Dreissena polymorpha in continental USA
Mingyang, Li; Yunwei, Ju; Kumar, Sunil; Stohlgren, Thomas J.
2008-01-01
The effective measure to minimize the damage of invasive species is to block the potential invasive species to enter into suitable areas. 1864 occurrence points with GPS coordinates and 34 environmental variables from Daymet datasets were gathered, and 4 modeling methods, i.e., Logistic Regression (LR), Classification and Regression Trees (CART), Genetic Algorithm for Rule-Set Prediction (GARP), and maximum entropy method (Maxent), were introduced to generate potential geographic distributions for invasive species Dreissena polymorpha in Continental USA. Then 3 statistical criteria of the area under the Receiver Operating Characteristic curve (AUC), Pearson correlation (COR) and Kappa value were calculated to evaluate the performance of the models, followed by analyses on major contribution variables. Results showed that in terms of the 3 statistical criteria, the prediction results of the 4 ecological niche models were either excellent or outstanding, in which Maxent outperformed the others in 3 aspects of predicting current distribution habitats, selecting major contribution factors, and quantifying the influence of environmental variables on habitats. Distance to water, elevation, frequency of precipitation and solar radiation were 4 environmental forcing factors. The method suggested in the paper can have some reference meaning for modeling habitats of alien species in China and provide a direction to prevent Mytilopsis sallei on the Chinese coast line.
Remote visual analysis of large turbulence databases at multiple scales
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pulido, Jesus; Livescu, Daniel; Kanov, Kalin
The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Remote visual analysis of large turbulence databases at multiple scales
Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...
2018-06-15
The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Statistical analysis of co-occurrence patterns in microbial presence-absence datasets
Bewick, Sharon; Thielen, Peter; Mehoke, Thomas; Breitwieser, Florian P.; Paudel, Shishir; Adhikari, Arjun; Wolfe, Joshua; Slud, Eric V.; Karig, David; Fagan, William F.
2017-01-01
Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson’s correlation coefficient (r) and Jaccard’s index (J)–two of the most common metrics for correlation analysis of presence-absence data–can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson’s correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard’s index of similarity (J) can yield improvements over Pearson’s correlation coefficient. However, the standard null model for Jaccard’s index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for species prevalence. This model is available from recent statistics literature, and can be used for evaluating the significance of any value of an empirically observed Jaccard’s index. The resulting simple, yet effective method for handling correlation analysis of microbial presence-absence datasets provides a robust means of testing and finding relationships and/or shared environmental responses among microbial taxa. PMID:29145425
Enabling Linked Science in Global Climate Uncertainty Quantification (UQ) Research
NASA Astrophysics Data System (ADS)
Elsethagen, T.; Stephan, E.; Lin, G.; Williams, D.; Banks, E.
2012-12-01
This paper shares a real-world global climate UQ science use case and illustrates how a linked science application called Provenance Environment (ProvEn), currently being developed, enables and facilitates scientific teams to publish, share, link, and discover new links over their UQ research results. UQ results include terascale datasets that are published to an Earth Systems Grid Federation (ESGF) repository. ProvEn demonstrates how a scientific team conducting UQ studies can discover dataset links using its domain knowledgebase, allowing them to better understand the UQ study research objectives, the experimental protocol used, the resulting dataset lineage, related analytical findings, ancillary literature citations, along with the social network of scientists associated with the study. This research claims that scientists using this linked science approach will not only allow them to greatly benefit from understanding a particular dataset within a knowledge context, a benefit can also be seen by the cross reference of knowledge among the numerous UQ studies being stored in ESGF. ProvEn collects native forms of data provenance resources as the UQ study is carried out. The native data provenance resources can be collected from a variety of sources such as scripts, a workflow engine log, simulation log files, scientific team members etc. Schema alignment is used to translate the native forms of provenance into a set of W3C PROV-O semantic statements used as a common interchange format which will also contain URI references back to resources in the UQ study dataset for querying and cross referencing. ProvEn leverages Fedora Commons' digital object model in a Resource Oriented Architecture (ROA) (i.e. a RESTful framework) to logically organize and partition native and translated provenance resources by UQ study. The ROA also provides scientists the means to both search native and translated forms of provenance.
Yue, Lilly Q
2012-01-01
In the evaluation of medical products, including drugs, biological products, and medical devices, comparative observational studies could play an important role when properly conducted randomized, well-controlled clinical trials are infeasible due to ethical or practical reasons. However, various biases could be introduced at every stage and into every aspect of the observational study, and consequently the interpretation of the resulting statistical inference would be of concern. While there do exist statistical techniques for addressing some of the challenging issues, often based on propensity score methodology, these statistical tools probably have not been as widely employed in prospectively designing observational studies as they should be. There are also times when they are implemented in an unscientific manner, such as performing propensity score model selection for a dataset involving outcome data in the same dataset, so that the integrity of observational study design and the interpretability of outcome analysis results could be compromised. In this paper, regulatory considerations on prospective study design using propensity scores are shared and illustrated with hypothetical examples.
Software for the Integration of Multiomics Experiments in Bioconductor.
Ramos, Marcel; Schiffer, Lucas; Re, Angela; Azhar, Rimsha; Basunia, Azfar; Rodriguez, Carmen; Chan, Tiffany; Chapman, Phil; Davis, Sean R; Gomez-Cabrero, David; Culhane, Aedin C; Haibe-Kains, Benjamin; Hansen, Kasper D; Kodali, Hanish; Louis, Marie S; Mer, Arvind S; Riester, Markus; Morgan, Martin; Carey, Vince; Waldron, Levi
2017-11-01
Multiomics experiments are increasingly commonplace in biomedical research and add layers of complexity to experimental design, data integration, and analysis. R and Bioconductor provide a generic framework for statistical analysis and visualization, as well as specialized data classes for a variety of high-throughput data types, but methods are lacking for integrative analysis of multiomics experiments. The MultiAssayExperiment software package, implemented in R and leveraging Bioconductor software and design principles, provides for the coordinated representation of, storage of, and operation on multiple diverse genomics data. We provide the unrestricted multiple 'omics data for each cancer tissue in The Cancer Genome Atlas as ready-to-analyze MultiAssayExperiment objects and demonstrate in these and other datasets how the software simplifies data representation, statistical analysis, and visualization. The MultiAssayExperiment Bioconductor package reduces major obstacles to efficient, scalable, and reproducible statistical analysis of multiomics data and enhances data science applications of multiple omics datasets. Cancer Res; 77(21); e39-42. ©2017 AACR . ©2017 American Association for Cancer Research.
MOnthly TEmperature DAtabase of Spain 1951-2010: MOTEDAS. (1) Quality control
NASA Astrophysics Data System (ADS)
Peña-Angulo, Dhais; Cortesi, Nicola; Simolo, Claudia; Stepanek, Peter; Brunetti, Michele; González-Hidalgo, José Carlos
2014-05-01
The HIDROCAES project (Impactos Hidrológicos del Calentamiento Global en España, Spanish Ministery of Research CGL2011-27574-C02-01) is focused on the high resolution in the Spanish continental land of the warming processes during the 1951-2010. To do that the Department of Geography (University of Zaragoza, Spain), the Hydrometeorological Service (Brno Division, Chezck Republic) and the ISAC-CNR (Bologna, Italy) are developing the new dataset MOTEDAS (MOnthly TEmperature DAtabase of Spain), from which we present a collection of poster to show (1) the general structure of dataset and quality control; (2) the analyses of spatial correlation of monthly mean values of maximum (Tmax) and minimum (Tmin temperature; (3) the reconstruction processes of series and high resolution grid developing; (4) the first initial results of trend analyses of annual, seasonal and monthly range mean values. MOTEDAS has been created after exhaustive analyses and quality control of the original digitalized data of the Spanish National Meteorological Agency (Agencia Estatal de Meteorología, AEMET). Quality control was applied without any prior reconstruction, i.e. on original series. Then, from the total amount of series stored at AEMet archives (more than 4680) we selected only those series with at least 10 years of data (i.e. 120 months, 3066 series) to apply a quality control and reconstruction processes (see Poster MOTEDAS 3). Length of series was Tmin, upper and lower thresholds of absolute data, etc), and by comparison with reference series (see Poster MOTEDAS 3, about reconstruction). Anomalous data were considered when difference between Candidate and Reference series were higher than three times the interquartile distance. The total amount of monthly suspicious data recognized and discarded at the end of this analyses was 7832 data for Tmin, and 8063 for Tmax data; they represent less than 0,8% of original total monthly data, for both Tmax and Tmin. No spatial pattern was detected in the suspicious data; month by month Tmin shows maximum detection in summer months, while Tmax does not show any monthly pattern. Secondly, the homogeneity analyses was performed on the list of series free of anomalous data by using an arrays of test (SNHT, Bivariate, T de Student and Pettit) after new reference series calculated with data free of anomalous. The tests were applied at monthly, seasonal and annual scale (i.e. 17 times per method). Statistical inhomogeneity detections were accepted as follows: Three annual detections (monthly, seasonal, annual) must be found in SNHT or Bivariate test. The total amount of detections by the four tests was greater than 5% of the total possible detection per year. Before any correction we examined the Candidate and reference series chart. Proclim and Anclim software were used during all the processes The total amount of series affected by inhomogeneities was 1013 (Tmax) and 1011 (Tmin), i.e. 1/3 of original series was considered as inhomogeneous. We notice that identified inhomogeneous series in Tmax and Tmin usually do not coincide. This apparently small amount of series compared with previous work could be originated because of the mean length of series is around 15-20 years. References. Stepánek P. 2008a. AnClim - software for time series analysis (for Windows 95/NT). Department of Geography, Faculty of Natural Sciences, MU, Brno, 1.47 B. Stepánek P.. 2008b. ProClimDB - Software for Processing Climatological Datasets. CHMI, Regional office, Brno.
U.S. Department of Energy Reference Model Program RM1: Experimental Results.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hill, Craig; Neary, Vincent Sinclair; Gunawan, Budi
The Reference Model Project (RMP), sponsored by the U.S. Department of Energy’s (DOE) Wind and Water Power Technologies Program within the Office of Energy Efficiency & Renewable Energy (EERE), aims at expediting industry growth and efficiency by providing nonproprietary Reference Models (RM) of MHK technology designs as study objects for opensource research and development (Neary et al. 2014a,b). As part of this program, MHK turbine models were tested in a large open channel facility at the University of Minnesota’s St. Anthony Falls Laboratory (UMN-SAFL). Reference Model 1 (RM1) is a 1:40 geometric scale dual-rotor axial flow horizontal axis device withmore » counter-rotating rotors, each with a rotor diameter dT = 0.5m. Precise blade angular position and torque measurements were synchronized with three acoustic Doppler velocimeters (ADVs) aligned with each rotor and the midpoint for RM1. Flow conditions for each case were controlled such that depth, h = 1m, and volumetric flow rate, Qw = 2.425m3s-1, resulting in a hub height velocity of approximately Uhub = 1.05ms-1 and blade chord length Reynolds numbers of Rec ≈ 3.0x105. Vertical velocity profiles collected in the wake of each device from 1 to 10 rotor diameters are used to estimate the velocity recovery and turbulent characteristics in the wake, as well as the interaction of the counter-rotating rotor wakes. The development of this high resolution laboratory investigation provides a robust dataset that enables assessing turbulence performance models and their ability to accurately predict device performance metrics, including computational fluid dynamics (CFD) models that can be used to predict turbulent inflow environments, reproduce wake velocity deficit, recovery and higher order turbulent statistics, as well as device performance metrics.« less
Abràmoff, Michael David; Lou, Yiyue; Erginay, Ali; Clarida, Warren; Amelon, Ryan; Folk, James C; Niemeijer, Meindert
2016-10-01
To compare performance of a deep-learning enhanced algorithm for automated detection of diabetic retinopathy (DR), to the previously published performance of that algorithm, the Iowa Detection Program (IDP)-without deep learning components-on the same publicly available set of fundus images and previously reported consensus reference standard set, by three US Board certified retinal specialists. We used the previously reported consensus reference standard of referable DR (rDR), defined as International Clinical Classification of Diabetic Retinopathy moderate, severe nonproliferative (NPDR), proliferative DR, and/or macular edema (ME). Neither Messidor-2 images, nor the three retinal specialists setting the Messidor-2 reference standard were used for training IDx-DR version X2.1. Sensitivity, specificity, negative predictive value, area under the curve (AUC), and their confidence intervals (CIs) were calculated. Sensitivity was 96.8% (95% CI: 93.3%-98.8%), specificity was 87.0% (95% CI: 84.2%-89.4%), with 6/874 false negatives, resulting in a negative predictive value of 99.0% (95% CI: 97.8%-99.6%). No cases of severe NPDR, PDR, or ME were missed. The AUC was 0.980 (95% CI: 0.968-0.992). Sensitivity was not statistically different from published IDP sensitivity, which had a CI of 94.4% to 99.3%, but specificity was significantly better than the published IDP specificity CI of 55.7% to 63.0%. A deep-learning enhanced algorithm for the automated detection of DR, achieves significantly better performance than a previously reported, otherwise essentially identical, algorithm that does not employ deep learning. Deep learning enhanced algorithms have the potential to improve the efficiency of DR screening, and thereby to prevent visual loss and blindness from this devastating disease.
Howie, Bryan N.; Donnelly, Peter; Marchini, Jonathan
2009-01-01
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions. PMID:19543373
Nahid, Abdullah-Al; Mehrabi, Mohamad Ali; Kong, Yinan
2018-01-01
Breast Cancer is a serious threat and one of the largest causes of death of women throughout the world. The identification of cancer largely depends on digital biomedical photography analysis such as histopathological images by doctors and physicians. Analyzing histopathological images is a nontrivial task, and decisions from investigation of these kinds of images always require specialised knowledge. However, Computer Aided Diagnosis (CAD) techniques can help the doctor make more reliable decisions. The state-of-the-art Deep Neural Network (DNN) has been recently introduced for biomedical image analysis. Normally each image contains structural and statistical information. This paper classifies a set of biomedical breast cancer images (BreakHis dataset) using novel DNN techniques guided by structural and statistical information derived from the images. Specifically a Convolutional Neural Network (CNN), a Long-Short-Term-Memory (LSTM), and a combination of CNN and LSTM are proposed for breast cancer image classification. Softmax and Support Vector Machine (SVM) layers have been used for the decision-making stage after extracting features utilising the proposed novel DNN models. In this experiment the best Accuracy value of 91.00% is achieved on the 200x dataset, the best Precision value 96.00% is achieved on the 40x dataset, and the best F -Measure value is achieved on both the 40x and 100x datasets.
Development and application of GIS-based PRISM integration through a plugin approach
NASA Astrophysics Data System (ADS)
Lee, Woo-Seop; Chun, Jong Ahn; Kang, Kwangmin
2014-05-01
A PRISM (Parameter-elevation Regressions on Independent Slopes Model) QGIS-plugin was developed on Quantum GIS platform in this study. This Quantum GIS plugin system provides user-friendly graphic user interfaces (GUIs) so that users can obtain gridded meteorological data of high resolutions (1 km × 1 km). Also, this software is designed to run on a personal computer so that it does not require an internet access or a sophisticated computer system. This module is a user-friendly system that a user can generate PRISM data with ease. The proposed PRISM QGIS-plugin is a hybrid statistical-geographic model system that uses coarse resolution datasets (APHRODITE datasets in this study) with digital elevation data to generate the fine-resolution gridded precipitation. To validate the performance of the software, Prek Thnot River Basin in Kandal, Cambodia is selected for application. Overall statistical analysis shows promising outputs generated by the proposed plugin. Error measures such as RMSE (Root Mean Square Error) and MAPE (Mean Absolute Percentage Error) were used to evaluate the performance of the developed PRISM QGIS-plugin. Evaluation results using RMSE and MAPE were 2.76 mm and 4.2%, respectively. This study suggested that the plugin can be used to generate high resolution precipitation datasets for hydrological and climatological studies at a watershed where observed weather datasets are limited.
NASA Technical Reports Server (NTRS)
Witte, J. C.; Thompson, A. M.; Schmidlin, F. J.; Oltmans, S. J.; McPeters, R. D.; Smit, H. G. J.
2003-01-01
A network of 12 southern hemisphere tropical and subtropical stations in the Southern Hemisphere ADditional OZonesondes (SHADOZ) project has provided over 2000 profiles of stratospheric and tropospheric ozone since 1998. Balloon-borne electrochemical concentration cell (ECC) ozonesondes are used with standard radiosondes for pressure, temperature and relative humidity measurements. The archived data are available at:http: //croc.gsfc.nasa.gov/shadoz. In Thompson et al., accuracies and imprecisions in the SHADOZ 1998- 2000 dataset were examined using ground-based instruments and the TOMS total ozone measurement (version 7) as references. Small variations in ozonesonde technique introduced possible biases from station-to-station. SHADOZ total ozone column amounts are now compared to version 8 TOMS; discrepancies between the two datasets are reduced 2\\% on average. An evaluation of ozone variations among the stations is made using the results of a series of chamber simulations of ozone launches (JOSIE-2000, Juelich Ozonesonde Intercomparison Experiment) in which a standard reference ozone instrument was employed with the various sonde techniques used in SHADOZ. A number of variations in SHADOZ ozone data are explained when differences in solution strength, data processing and instrument type (manufacturer) are taken into account.
Sunspot Pattern Classification using PCA and Neural Networks (Poster)
NASA Technical Reports Server (NTRS)
Rajkumar, T.; Thompson, D. E.; Slater, G. L.
2005-01-01
The sunspot classification scheme presented in this paper is considered as a 2-D classification problem on archived datasets, and is not a real-time system. As a first step, it mirrors the Zuerich/McIntosh historical classification system and reproduces classification of sunspot patterns based on preprocessing and neural net training datasets. Ultimately, the project intends to move from more rudimentary schemes, to develop spatial-temporal-spectral classes derived by correlating spatial and temporal variations in various wavelengths to the brightness fluctuation spectrum of the sun in those wavelengths. Once the approach is generalized, then the focus will naturally move from a 2-D to an n-D classification, where "n" includes time and frequency. Here, the 2-D perspective refers both to the actual SOH0 Michelson Doppler Imager (MDI) images that are processed, but also refers to the fact that a 2-D matrix is created from each image during preprocessing. The 2-D matrix is the result of running Principal Component Analysis (PCA) over the selected dataset images, and the resulting matrices and their eigenvalues are the objects that are stored in a database, classified, and compared. These matrices are indexed according to the standard McIntosh classification scheme.
Enrichment of OpenStreetMap Data Completeness with Sidewalk Geometries Using Data Mining Techniques.
Mobasheri, Amin; Huang, Haosheng; Degrossi, Lívia Castro; Zipf, Alexander
2018-02-08
Tailored routing and navigation services utilized by wheelchair users require certain information about sidewalk geometries and their attributes to execute efficiently. Except some minor regions/cities, such detailed information is not present in current versions of crowdsourced mapping databases including OpenStreetMap. CAP4Access European project aimed to use (and enrich) OpenStreetMap for making it fit to the purpose of wheelchair routing. In this respect, this study presents a modified methodology based on data mining techniques for constructing sidewalk geometries using multiple GPS traces collected by wheelchair users during an urban travel experiment. The derived sidewalk geometries can be used to enrich OpenStreetMap to support wheelchair routing. The proposed method was applied to a case study in Heidelberg, Germany. The constructed sidewalk geometries were compared to an official reference dataset ("ground truth dataset"). The case study shows that the constructed sidewalk network overlays with 96% of the official reference dataset. Furthermore, in terms of positional accuracy, a low Root Mean Square Error (RMSE) value (0.93 m) is achieved. The article presents our discussion on the results as well as the conclusion and future research directions.
GAGES-II: Geospatial Attributes of Gages for Evaluating Streamflow
Falcone, James A.
2011-01-01
This dataset, termed "GAGES II", an acronym for Geospatial Attributes of Gages for Evaluating Streamflow, version II, provides geospatial data and classifications for 9,322 stream gages maintained by the U.S. Geological Survey (USGS). It is an update to the original GAGES, which was published as a Data Paper on the journal Ecology's website (Falcone and others, 2010b) in 2010. The GAGES II dataset consists of gages which have had either 20+ complete years (not necessarily continuous) of discharge record since 1950, or are currently active, as of water year 2009, and whose watersheds lie within the United States, including Alaska, Hawaii, and Puerto Rico. Reference gages were identified based on indicators that they were the least-disturbed watersheds within the framework of broad regions, based on 12 major ecoregions across the United States. Of the 9,322 total sites, 2,057 are classified as reference, and 7,265 as non-reference. Of the 2,057 reference sites, 1,633 have (through 2009) 20+ years of record since 1950. Some sites have very long flow records: a number of gages have been in continuous service since 1900 (at least), and have 110 years of complete record (1900-2009) to date. The geospatial data include several hundred watershed characteristics compiled from national data sources, including environmental features (e.g. climate – including historical precipitation, geology, soils, topography) and anthropogenic influences (e.g. land use, road density, presence of dams, canals, or power plants). The dataset also includes comments from local USGS Water Science Centers, based on Annual Data Reports, pertinent to hydrologic modifications and influences. The data posted also include watershed boundaries in GIS format. This overall dataset is different in nature to the USGS Hydro-Climatic Data Network (HCDN; Slack and Landwehr 1992), whose data evaluation ended with water year 1988. The HCDN identifies stream gages which at some point in their history had periods which represented natural flow, and the years in which those natural flows occurred were identified (i.e. not all HCDN sites were in reference condition even in 1988, for example, 02353500). The HCDN remains a valuable indication of historic natural streamflow data. However, the goal of this dataset was to identify watersheds which currently have near-natural flow conditions, and the 2,057 reference sites identified here were derived independently of the HCDN. A subset, however, noted in the BasinID worksheet as “HCDN-2009”, has been identified as an updated list of 743 sites for potential hydro-climatic study. The HCDN-2009 sites fulfill all of the following criteria: (a) have 20 years of complete and continuous flow record in the last 20 years (water years 1990-2009), and were thus also currently active as of 2009, (b) are identified as being in current reference condition according to the GAGES-II classification, (c) have less than 5 percent imperviousness as measured from the NLCD 2006, and (d) were not eliminated by a review from participating state Water Science Center evaluators. The data posted here consist of the following items:- This point shapefile, with summary data for the 9,322 gages.- A zip file containing basin characteristics, variable definitions, and a more detailed report.- A zip file containing shapefiles of basin boundaries, organized by classification and aggregated ecoregion.- A zip file containing mainstem stream lines (Arc line coverages) for each gage.
Miller, Robert; Stalder, Tobias; Jarczok, Marc; Almeida, David M.; Badrick, Ellena; Bartels, Meike; Boomsma, Dorret I.; Coe, Christopher L.; Dekker, Marieke C. J.; Donzella, Bonny; Fischer, Joachim E.; Gunnar, Megan R.; Kumari, Meena; Lederbogen, Florian; Oldehinkel, Albertine J.; Power, Christine; Rosmalen, Judith G.; Ryff, Carol D.; Subramanian, S V; Tiemeier, Henning; Watamura, Sarah E.; Kirschbaum, Clemens
2016-01-01
Diurnal salivary cortisol profiles are valuable indicators of adrenocortical functioning in epidemiological research and clinical practice. However, normative reference values derived from a large number of participants and across a wide age range are still missing. To fill this gap, data were compiled from 15 independently conducted field studies with a total of 104,623 salivary cortisol samples obtained from 18,698 unselected individuals (mean age: 48.3 years, age range: 0.5 to 98.5 years, 39% females). Besides providing a descriptive analysis of the complete dataset, we also performed mixed-effects growth curve modeling of diurnal salivary cortisol (i.e., 1 to 16 hours after awakening). Cortisol decreased significantly across the day and was influenced by both, age and sex. Intriguingly, we also found a pronounced impact of sampling season with elevated diurnal cortisol in spring and decreased levels in autumn. However, the majority of variance was accounted for by between-participant and between-study variance components. Based on these analyses, reference ranges (LC/MS-MS calibrated) for cortisol concentrations in saliva were derived for different times across the day, with more specific reference ranges generated for males and females in different age categories. This integrative summary provides important reference values on salivary cortisol to aid basic scientists and clinicians in interpreting deviations from the normal diurnal cycle. PMID:27448524
Zhang, Yiyan; Xin, Yi; Li, Qin; Ma, Jianshe; Li, Shuai; Lv, Xiaodan; Lv, Weiqi
2017-11-02
Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly. In this paper, seven kinds of sophisticated active algorithms, namely, C4.5, support vector machine, AdaBoost, k-nearest neighbor, naïve Bayes, random forest, and logistic regression, were selected as the research objects. The seven algorithms were applied to the 12 top-click UCI public datasets with the task of classification, and their performances were compared through induction and analysis. The sample size, number of attributes, number of missing values, and the sample size of each class, correlation coefficients between variables, class entropy of task variable, and the ratio of the sample size of the largest class to the least class were calculated to character the 12 research datasets. The two ensemble algorithms reach high accuracy of classification on most datasets. Moreover, random forest performs better than AdaBoost on the unbalanced dataset of the multi-class task. Simple algorithms, such as the naïve Bayes and logistic regression model are suitable for a small dataset with high correlation between the task and other non-task attribute variables. K-nearest neighbor and C4.5 decision tree algorithms perform well on binary- and multi-class task datasets. Support vector machine is more adept on the balanced small dataset of the binary-class task. No algorithm can maintain the best performance in all datasets. The applicability of the seven data mining algorithms on the datasets with different characteristics was summarized to provide a reference for biomedical researchers or beginners in different fields.
Harnessing Multivariate Statistics for Ellipsoidal Data in Structural Geology
NASA Astrophysics Data System (ADS)
Roberts, N.; Davis, J. R.; Titus, S.; Tikoff, B.
2015-12-01
Most structural geology articles do not state significance levels, report confidence intervals, or perform regressions to find trends. This is, in part, because structural data tend to include directions, orientations, ellipsoids, and tensors, which are not treatable by elementary statistics. We describe a full procedural methodology for the statistical treatment of ellipsoidal data. We use a reconstructed dataset of deformed ooids in Maryland from Cloos (1947) to illustrate the process. Normalized ellipsoids have five degrees of freedom and can be represented by a second order tensor. This tensor can be permuted into a five dimensional vector that belongs to a vector space and can be treated with standard multivariate statistics. Cloos made several claims about the distribution of deformation in the South Mountain fold, Maryland, and we reexamine two particular claims using hypothesis testing: 1) octahedral shear strain increases towards the axial plane of the fold; 2) finite strain orientation varies systematically along the trend of the axial trace as it bends with the Appalachian orogen. We then test the null hypothesis that the southern segment of South Mountain is the same as the northern segment. This test illustrates the application of ellipsoidal statistics, which combine both orientation and shape. We report confidence intervals for each test, and graphically display our results with novel plots. This poster illustrates the importance of statistics in structural geology, especially when working with noisy or small datasets.
Data on the application of Functional Data Analysis in food fermentations.
Ruiz-Bellido, M A; Romero-Gil, V; García-García, P; Rodríguez-Gómez, F; Arroyo-López, F N; Garrido-Fernández, A
2016-12-01
This article refers to the paper "Assessment of table olive fermentation by functional data analysis" (Ruiz-Bellido et al., 2016) [1]. The dataset include pH, titratable acidity, yeast count and area values obtained during fermentation process (380 days) of Aloreña de Málaga olives subjected to five different fermentation systems: i) control of acidified cured olives, ii) highly acidified cured olives, iii) intermediate acidified cured olives, iv) control of traditional cracked olives, and v) traditional olives cracked after 72 h of exposure to air. Many of the Tables and Figures shown in this paper were deduced after application of Functional Data Analysis to raw data using a routine executed under R software for comparison among treatments by the transformation of raw data into smooth curves and the application of a new battery of statistical tools (functional pointwise estimation of the averages and standard deviations, maximum, minimum, first and second derivatives, functional regression, and functional F and t-tests).
Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics
Giacomoni, Franck; Le Corguillé, Gildas; Monsoor, Misharl; Landi, Marion; Pericard, Pierre; Pétéra, Mélanie; Duperier, Christophe; Tremblay-Franco, Marie; Martin, Jean-François; Jacob, Daniel; Goulitquer, Sophie; Thévenot, Etienne A.; Caron, Christophe
2015-01-01
Summary: The complex, rapidly evolving field of computational metabolomics calls for collaborative infrastructures where the large volume of new algorithms for data pre-processing, statistical analysis and annotation can be readily integrated whatever the language, evaluated on reference datasets and chained to build ad hoc workflows for users. We have developed Workflow4Metabolomics (W4M), the first fully open-source and collaborative online platform for computational metabolomics. W4M is a virtual research environment built upon the Galaxy web-based platform technology. It enables ergonomic integration, exchange and running of individual modules and workflows. Alternatively, the whole W4M framework and computational tools can be downloaded as a virtual machine for local installation. Availability and implementation: http://workflow4metabolomics.org homepage enables users to open a private account and access the infrastructure. W4M is developed and maintained by the French Bioinformatics Institute (IFB) and the French Metabolomics and Fluxomics Infrastructure (MetaboHUB). Contact: contact@workflow4metabolomics.org PMID:25527831
Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics.
Giacomoni, Franck; Le Corguillé, Gildas; Monsoor, Misharl; Landi, Marion; Pericard, Pierre; Pétéra, Mélanie; Duperier, Christophe; Tremblay-Franco, Marie; Martin, Jean-François; Jacob, Daniel; Goulitquer, Sophie; Thévenot, Etienne A; Caron, Christophe
2015-05-01
The complex, rapidly evolving field of computational metabolomics calls for collaborative infrastructures where the large volume of new algorithms for data pre-processing, statistical analysis and annotation can be readily integrated whatever the language, evaluated on reference datasets and chained to build ad hoc workflows for users. We have developed Workflow4Metabolomics (W4M), the first fully open-source and collaborative online platform for computational metabolomics. W4M is a virtual research environment built upon the Galaxy web-based platform technology. It enables ergonomic integration, exchange and running of individual modules and workflows. Alternatively, the whole W4M framework and computational tools can be downloaded as a virtual machine for local installation. http://workflow4metabolomics.org homepage enables users to open a private account and access the infrastructure. W4M is developed and maintained by the French Bioinformatics Institute (IFB) and the French Metabolomics and Fluxomics Infrastructure (MetaboHUB). contact@workflow4metabolomics.org. © The Author 2014. Published by Oxford University Press.
MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit
Li, Junhua; Chen, Weineng; Chen, Hua; Mende, Daniel R.; Arumugam, Manimozhiyan; Pan, Qi; Liu, Binghang; Qin, Junjie; Wang, Jun; Bork, Peer
2012-01-01
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/. PMID:23082188
Seismo-ionospheric anomalies in DEMETER observationsduring the Wenchuan M7.9 earthquake
NASA Astrophysics Data System (ADS)
Huang, C. C.; Liu, J. Y. G.
2014-12-01
This paper examines pre-earthquake ionospheric anomalies (PEIAs) observed by the French satellite DEMETER (Detection of Electro-Magnetic Emissions Transmitted from Earthquake Regions) during the 12 May 2008 M7.9 Wenchuan earthquake. Both daytime and nighttime electron density (Ne), electron temperature (Te), ion density (Ni) and ion temperature (Ti) are investigated. A statistical analysis of the box-and-whisker method is utilized to see if the four DEMETER datasets 1-6 days before and after the earthquake are significantly different. The analysis is employed to investigate the epicenter and three reference areas along the same magnetic latitude and to discriminate the earthquake-related anomalies from global effects. Results show that the nighttime Ne and Ni over the epicenter significantly decrease 1-6 days before the earthquake. The ionospheric total electron content (TEC) of global ionosphere map (GIM) over the epicenter is further inspected to find the sensitive local time for detecting the PEIAs of the M7.9 Wenchuan earthquake.
Lottering, Nicolene; MacGregor, Donna M; Alston, Clair L; Watson, Debbie; Gregory, Laura S
2016-01-01
Contemporary, population-specific ossification timings of the cranium are lacking in current literature due to challenges in obtaining large repositories of documented subadult material, forcing Australian practitioners to rely on North American, arguably antiquated reference standards for age estimation. This study assessed the temporal pattern of ossification of the cranium and provides recalibrated probabilistic information for age estimation of modern Australian children. Fusion status of the occipital and frontal bones, atlas, and axis was scored using a modified two- to four-tier system from cranial/cervical DICOM datasets of 585 children aged birth to 10 years. Transition analysis was applied to elucidate maximum-likelihood estimates between consecutive fusion stages, in conjunction with Bayesian statistics to calculate credible intervals for age estimation. Results demonstrate significant sex differences in skeletal maturation (p < 0.05) and earlier timings in comparison with major literary sources, underscoring the requisite of updated standards for age estimation of modern individuals. © 2015 American Academy of Forensic Sciences.
TreSpEx—Detection of Misleading Signal in Phylogenetic Reconstructions Based on Tree Information
Struck, Torsten H
2014-01-01
Phylogenies of species or genes are commonplace nowadays in many areas of comparative biological studies. However, for phylogenetic reconstructions one must refer to artificial signals such as paralogy, long-branch attraction, saturation, or conflict between different datasets. These signals might eventually mislead the reconstruction even in phylogenomic studies employing hundreds of genes. Unfortunately, there has been no program allowing the detection of such effects in combination with an implementation into automatic process pipelines. TreSpEx (Tree Space Explorer) now combines different approaches (including statistical tests), which utilize tree-based information like nodal support or patristic distances (PDs) to identify misleading signals. The program enables the parallel analysis of hundreds of trees and/or predefined gene partitions, and being command-line driven, it can be integrated into automatic process pipelines. TreSpEx is implemented in Perl and supported on Linux, Mac OS X, and MS Windows. Source code, binaries, and additional material are freely available at http://www.annelida.de/research/bioinformatics/software.html. PMID:24701118
MOCAT: a metagenomics assembly and gene prediction toolkit.
Kultima, Jens Roat; Sunagawa, Shinichi; Li, Junhua; Chen, Weineng; Chen, Hua; Mende, Daniel R; Arumugam, Manimozhiyan; Pan, Qi; Liu, Binghang; Qin, Junjie; Wang, Jun; Bork, Peer
2012-01-01
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.
2012-01-01
Background It is known from recent studies that more than 90% of human multi-exon genes are subject to Alternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a single gene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiation and pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon 1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns. The exon array technology, with more than five million data points, can detect approximately one million exons, and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated user-friendly bioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a data warehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases. Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at http://beat.ba.itb.cnr.it. Results BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statistical methods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples and tuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samples produced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza. To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemical pathways annotations are integrated with exon and gene level expression plots. The user can customize the results choosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samples for a multivariate AS analysis. Conclusions Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysis tools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendly platform for a comprehensive study of AS events in human diseases, displaying the analysis results with easily interpretable and interactive tables and graphics. PMID:22536968
Statistical Analysis of Large Simulated Yield Datasets for Studying Climate Effects
NASA Technical Reports Server (NTRS)
Makowski, David; Asseng, Senthold; Ewert, Frank; Bassu, Simona; Durand, Jean-Louis; Martre, Pierre; Adam, Myriam; Aggarwal, Pramod K.; Angulo, Carlos; Baron, Chritian;
2015-01-01
Many studies have been carried out during the last decade to study the effect of climate change on crop yields and other key crop characteristics. In these studies, one or several crop models were used to simulate crop growth and development for different climate scenarios that correspond to different projections of atmospheric CO2 concentration, temperature, and rainfall changes (Semenov et al., 1996; Tubiello and Ewert, 2002; White et al., 2011). The Agricultural Model Intercomparison and Improvement Project (AgMIP; Rosenzweig et al., 2013) builds on these studies with the goal of using an ensemble of multiple crop models in order to assess effects of climate change scenarios for several crops in contrasting environments. These studies generate large datasets, including thousands of simulated crop yield data. They include series of yield values obtained by combining several crop models with different climate scenarios that are defined by several climatic variables (temperature, CO2, rainfall, etc.). Such datasets potentially provide useful information on the possible effects of different climate change scenarios on crop yields. However, it is sometimes difficult to analyze these datasets and to summarize them in a useful way due to their structural complexity; simulated yield data can differ among contrasting climate scenarios, sites, and crop models. Another issue is that it is not straightforward to extrapolate the results obtained for the scenarios to alternative climate change scenarios not initially included in the simulation protocols. Additional dynamic crop model simulations for new climate change scenarios are an option but this approach is costly, especially when a large number of crop models are used to generate the simulated data, as in AgMIP. Statistical models have been used to analyze responses of measured yield data to climate variables in past studies (Lobell et al., 2011), but the use of a statistical model to analyze yields simulated by complex process-based crop models is a rather new idea. We demonstrate herewith that statistical methods can play an important role in analyzing simulated yield data sets obtained from the ensembles of process-based crop models. Formal statistical analysis is helpful to estimate the effects of different climatic variables on yield, and to describe the between-model variability of these effects.
Westbrook, John D.; Shao, Chenghua; Feng, Zukang; Zhuravleva, Marina; Velankar, Sameer; Young, Jasmine
2015-01-01
Summary: The Chemical Component Dictionary (CCD) is a chemical reference data resource that describes all residue and small molecule components found in Protein Data Bank (PDB) entries. The CCD contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands and solvent molecules. Each chemical definition includes descriptions of chemical properties such as stereochemical assignments, chemical descriptors, systematic chemical names and idealized coordinates. The content, preparation, validation and distribution of this CCD chemical reference dataset are described. Availability and implementation: The CCD is updated regularly in conjunction with the scheduled weekly release of new PDB structure data. The CCD and amino acid variant reference datasets are hosted in the public PDB ftp repository at ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif.gz, ftp://ftp.wwpdb.org/pub/pdb/data/monomers/aa-variants-v1.cif.gz, and its mirror sites, and can be accessed from http://wwpdb.org. Contact: jwest@rcsb.rutgers.edu. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25540181
Nine martian years of dust optical depth observations: A reference dataset
NASA Astrophysics Data System (ADS)
Montabone, Luca; Forget, Francois; Kleinboehl, Armin; Kass, David; Wilson, R. John; Millour, Ehouarn; Smith, Michael; Lewis, Stephen; Cantor, Bruce; Lemmon, Mark; Wolff, Michael
2016-07-01
We present a multi-annual reference dataset of the horizontal distribution of airborne dust from martian year 24 to 32 using observations of the martian atmosphere from April 1999 to June 2015 made by the Thermal Emission Spectrometer (TES) aboard Mars Global Surveyor, the Thermal Emission Imaging System (THEMIS) aboard Mars Odyssey, and the Mars Climate Sounder (MCS) aboard Mars Reconnaissance Orbiter (MRO). Our methodology to build the dataset works by gridding the available retrievals of column dust optical depth (CDOD) from TES and THEMIS nadir observations, as well as the estimates of this quantity from MCS limb observations. The resulting (irregularly) gridded maps (one per sol) were validated with independent observations of CDOD by PanCam cameras and Mini-TES spectrometers aboard the Mars Exploration Rovers "Spirit" and "Opportunity", by the Surface Stereo Imager aboard the Phoenix lander, and by the Compact Reconnaissance Imaging Spectrometer for Mars aboard MRO. Finally, regular maps of CDOD are produced by spatially interpolating the irregularly gridded maps using a kriging method. These latter maps are used as dust scenarios in the Mars Climate Database (MCD) version 5, and are useful in many modelling applications. The two datasets (daily irregularly gridded maps and regularly kriged maps) for the nine available martian years are publicly available as NetCDF files and can be downloaded from the MCD website at the URL: http://www-mars.lmd.jussieu.fr/mars/dust_climatology/index.html
Li, Xuejian; Wang, Youqing
2016-12-01
Offline general-type models are widely used for patients' monitoring in intensive care units (ICUs), which are developed by using past collected datasets consisting of thousands of patients. However, these models may fail to adapt to the changing states of ICU patients. Thus, to be more robust and effective, the monitoring models should be adaptable to individual patients. A novel combination of just-in-time learning (JITL) and principal component analysis (PCA), referred to learning-type PCA (L-PCA), was proposed for adaptive online monitoring of patients in ICUs. JITL was used to gather the most relevant data samples for adaptive modeling of complex physiological processes. PCA was used to build an online individual-type model and calculate monitoring statistics, and then to judge whether the patient's status is normal or not. The adaptability of L-PCA lies in the usage of individual data and the continuous updating of the training dataset. Twelve subjects were selected from the Physiobank's Multi-parameter Intelligent Monitoring for Intensive Care II (MIMIC II) database, and five vital signs of each subject were chosen. The proposed method was compared with the traditional PCA and fast moving-window PCA (Fast MWPCA). The experimental results demonstrated that the fault detection rates respectively increased by 20 % and 47 % compared with PCA and Fast MWPCA. L-PCA is first introduced into ICU patients monitoring and achieves the best monitoring performance in terms of adaptability to changes in patient status and sensitivity for abnormality detection.
Computational Functional Analysis of Lipid Metabolic Enzymes.
Bagnato, Carolina; Have, Arjen Ten; Prados, María B; Beligni, María V
2017-01-01
The computational analysis of enzymes that participate in lipid metabolism has both common and unique challenges when compared to the whole protein universe. Some of the hurdles that interfere with the functional annotation of lipid metabolic enzymes that are common to other pathways include the definition of proper starting datasets, the construction of reliable multiple sequence alignments, the definition of appropriate evolutionary models, and the reconstruction of phylogenetic trees with high statistical support, particularly for large datasets. Most enzymes that take part in lipid metabolism belong to complex superfamilies with many members that are not involved in lipid metabolism. In addition, some enzymes that do not have sequence similarity catalyze similar or even identical reactions. Some of the challenges that, albeit not unique, are more specific to lipid metabolism refer to the high compartmentalization of the routes, the catalysis in hydrophobic environments and, related to this, the function near or in biological membranes.In this work, we provide guidelines intended to assist in the proper functional annotation of lipid metabolic enzymes, based on previous experiences related to the phospholipase D superfamily and the annotation of the triglyceride synthesis pathway in algae. We describe a pipeline that starts with the definition of an initial set of sequences to be used in similarity-based searches and ends in the reconstruction of phylogenies. We also mention the main issues that have to be taken into consideration when using tools to analyze subcellular localization, hydrophobicity patterns, or presence of transmembrane domains in lipid metabolic enzymes.
2013-01-01
Background and purpose Guidelines for fracture treatment and evaluation require a valid classification. Classifications especially designed for children are available, but they might lead to reduced accuracy, considering the relative infrequency of childhood fractures in a general orthopedic department. We tested the reliability and accuracy of the Müller classification when used for long bone fractures in children. Methods We included all long bone fractures in children aged < 16 years who were treated in 2008 at the surgical ward of Stavanger University Hospital. 20 surgeons recorded 232 fractures. Datasets were generated for intra- and inter-rater analysis, as well as a reference dataset for accuracy calculations. We present proportion of agreement (PA) and kappa (K) statistics. Results For intra-rater analysis, overall agreement (κ) was 0.75 (95% CI: 0.68–0.81) and PA was 79%. For inter-rater assessment, K was 0.71 (95% CI: 0.61–0.80) and PA was 77%. Accuracy was estimated: κ = 0.72 (95% CI: 0.64–0.79) and PA = 76%. Interpretation The Müller classification (slightly adjusted for pediatric fractures) showed substantial to excellent accuracy among general orthopedic surgeons when applied to long bone fractures in children. However, separate knowledge about the child-specific fracture pattern, the maturity of the bone, and the degree of displacement must be considered when the treatment and the prognosis of the fractures are evaluated. PMID:23245225
Öztoprak, Hüseyin; Toycan, Mehmet; Alp, Yaşar Kemal; Arıkan, Orhan; Doğutepe, Elvin; Karakaş, Sirel
2017-12-01
Attention-deficit/hyperactivity disorder (ADHD) is the most frequent diagnosis among children who are referred to psychiatry departments. Although ADHD was discovered at the beginning of the 20th century, its diagnosis is still confronted with many problems. A novel classification approach that discriminates ADHD and nonADHD groups over the time-frequency domain features of event-related potential (ERP) recordings that are taken during Stroop task is presented. Time-Frequency Hermite-Atomizer (TFHA) technique is used for the extraction of high resolution time-frequency domain features that are highly localized in time-frequency domain. Based on an extensive investigation, Support Vector Machine-Recursive Feature Elimination (SVM-RFE) was used to obtain the best discriminating features. When the best three features were used, the classification accuracy for the training dataset reached 98%, and the use of five features further improved the accuracy to 99.5%. The accuracy was 100% for the testing dataset. Based on extensive experiments, the delta band emerged as the most contributing frequency band and statistical parameters emerged as the most contributing feature group. The classification performance of this study suggests that TFHA can be employed as an auxiliary component of the diagnostic and prognostic procedures for ADHD. The features obtained in this study can potentially contribute to the neuroelectrical understanding and clinical diagnosis of ADHD. Copyright © 2017 International Federation of Clinical Neurophysiology. Published by Elsevier B.V. All rights reserved.
Song, Minsun; Wheeler, William; Caporaso, Neil E; Landi, Maria Teresa; Chatterjee, Nilanjan
2018-03-01
Genome-wide association studies (GWAS) are now routinely imputed for untyped single nucleotide polymorphisms (SNPs) based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele counts for imputed SNPs as the dosage variable is known to produce valid score test for genetic association. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic associations incorporating gene-environment interactions. We focus on case-control association studies where inference for an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large-scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also describe simple mechanisms for implementing score tests based on standard meta-analysis of "one-step" maximum-likelihood estimates across studies. Applications of the methods in simulation studies and a dataset from GWAS of lung cancer illustrate ability of the proposed methods to maintain type-I error rates for the underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, the retrospective methods can lead to considerable efficiency gain for modeling of gene-environment interactions under the assumption of gene-environment independence. Methods are made available for public use through CGEN R software package. © 2017 WILEY PERIODICALS, INC.
Primer for Using RE-Powering Data to Screen Sites for Renewable Energy Potential
This reference guide provides users with tips for using the RE-Powering Screening Dataset spreadsheet, which contains detailed site information on over 60,000 contaminated lands, landfills, and mine sites.
Luo, Li; Zhu, Yun
2012-01-01
Abstract The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T2, collapsing method, multivariate and collapsing (CMC) method, individual χ2 test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets. PMID:22651812
Luo, Li; Zhu, Yun; Xiong, Momiao
2012-06-01
The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T(2), collapsing method, multivariate and collapsing (CMC) method, individual χ(2) test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets.
Gioutlakis, Aris; Klapa, Maria I.
2017-01-01
It has been acknowledged that source databases recording experimentally supported human protein-protein interactions (PPIs) exhibit limited overlap. Thus, the reconstruction of a comprehensive PPI network requires appropriate integration of multiple heterogeneous primary datasets, presenting the PPIs at various genetic reference levels. Existing PPI meta-databases perform integration via normalization; namely, PPIs are merged after converted to a certain target level. Hence, the node set of the integrated network depends each time on the number and type of the combined datasets. Moreover, the irreversible a priori normalization process hinders the identification of normalization artifacts in the integrated network, which originate from the nonlinearity characterizing the genetic information flow. PICKLE (Protein InteraCtion KnowLedgebasE) 2.0 implements a new architecture for this recently introduced human PPI meta-database. Its main novel feature over the existing meta-databases is its approach to primary PPI dataset integration via genetic information ontology. Building upon the PICKLE principles of using the reviewed human complete proteome (RHCP) of UniProtKB/Swiss-Prot as the reference protein interactor set, and filtering out protein interactions with low probability of being direct based on the available evidence, PICKLE 2.0 first assembles the RHCP genetic information ontology network by connecting the corresponding genes, nucleotide sequences (mRNAs) and proteins (UniProt entries) and then integrates PPI datasets by superimposing them on the ontology network without any a priori transformations. Importantly, this process allows the resulting heterogeneous integrated network to be reversibly normalized to any level of genetic reference without loss of the original information, the latter being used for identification of normalization biases, and enables the appraisal of potential false positive interactions through PPI source database cross-checking. The PICKLE web-based interface (www.pickle.gr) allows for the simultaneous query of multiple entities and provides integrated human PPI networks at either the protein (UniProt) or the gene level, at three PPI filtering modes. PMID:29023571
The CMEMS L3 scatterometer wind product
NASA Astrophysics Data System (ADS)
de Kloe, Jos; Stoffelen, Ad; Verhoef, Anton
2017-04-01
Within the Copernicus Marine Environment Monitoring Service KNMI produces several ocean surface Level 3 wind products. These are daily updated global maps on a regular grid of the available scatterometer wind observations and derived properties, and produced from our EUMETSAT Ocean and Sea Ice Satellite Application Facility (OSI SAF) operational near-real time (NRT) Level 2 swath-based wind products by linear interpolation. Currently available products are the ASCAT on Metop A/B stress equivalent wind vectors, accompanied by ECMWF NWP reference stress equivalent winds from the operational ECMWF NWP model. For each ASCAT scatterometer we provide products on 2 different resolutions, 0.25 and 0.125 degrees. In addition we provide wind stress vectors, and derivative fields (curl and divergence) for stress equivalent wind and wind stress, both for the observations and for the NWP reference winds. New NRT scatterometer products will be made available when additional scatterometer instruments become available, and NRT access to the data can be arranged. We hope OSCAT on the Indian ScatSat-1 satellite will be the the next NRT product to be added. In addition multi-year reprocessing datasets have been made available for ASCAT on Metop-A (1-Jan-2007 up to 31-Mar-2014) and Seawinds on QuikScat (19-Jul-1999 up to 21-Nov-2009). For ASCAT 0.25 and 0.125 degree resolution products are provided, and for QuikScat 0.50 and 0.25 degree resolution products are provided, These products are based on reprocessing the L2 scatterometer products with the latest processing software version, and include reference winds from the ECMWF ERA-Interim model. Additional reprocessing datasets will be added when reprocessed L2 datasets become available. This will hopefully include the ERS-1 and ERS-2 scatterometer datasets (1992-2001), which will extend the available date range back to 1992. These products are available for download through the CMEMS portal website: http://marine.copernicus.eu/
McArt, Darragh G.; Dunne, Philip D.; Blayney, Jaine K.; Salto-Tellez, Manuel; Van Schaeybroeck, Sandra; Hamilton, Peter W.; Zhang, Shu-Dong
2013-01-01
The advent of next generation sequencing technologies (NGS) has expanded the area of genomic research, offering high coverage and increased sensitivity over older microarray platforms. Although the current cost of next generation sequencing is still exceeding that of microarray approaches, the rapid advances in NGS will likely make it the platform of choice for future research in differential gene expression. Connectivity mapping is a procedure for examining the connections among diseases, genes and drugs by differential gene expression initially based on microarray technology, with which a large collection of compound-induced reference gene expression profiles have been accumulated. In this work, we aim to test the feasibility of incorporating NGS RNA-Seq data into the current connectivity mapping framework by utilizing the microarray based reference profiles and the construction of a differentially expressed gene signature from a NGS dataset. This would allow for the establishment of connections between the NGS gene signature and those microarray reference profiles, alleviating the associated incurring cost of re-creating drug profiles with NGS technology. We examined the connectivity mapping approach on a publicly available NGS dataset with androgen stimulation of LNCaP cells in order to extract candidate compounds that could inhibit the proliferative phenotype of LNCaP cells and to elucidate their potential in a laboratory setting. In addition, we also analyzed an independent microarray dataset of similar experimental settings. We found a high level of concordance between the top compounds identified using the gene signatures from the two datasets. The nicotine derivative cotinine was returned as the top candidate among the overlapping compounds with potential to suppress this proliferative phenotype. Subsequent lab experiments validated this connectivity mapping hit, showing that cotinine inhibits cell proliferation in an androgen dependent manner. Thus the results in this study suggest a promising prospect of integrating NGS data with connectivity mapping. PMID:23840550
Quantifying Interannual Variability for Photovoltaic Systems in PVWatts
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ryberg, David Severin; Freeman, Janine; Blair, Nate
2015-10-01
The National Renewable Energy Laboratory's (NREL's) PVWatts is a relatively simple tool used by industry and individuals alike to easily estimate the amount of energy a photovoltaic (PV) system will produce throughout the course of a typical year. PVWatts Version 5 has previously been shown to be able to reasonably represent an operating system's output when provided with concurrent weather data, however this type of data is not available when estimating system output during future time frames. For this purpose PVWatts uses weather data from typical meteorological year (TMY) datasets which are available on the NREL website. The TMY filesmore » represent a statistically 'typical' year which by definition excludes anomalous weather patterns and as a result may not provide sufficient quantification of project risk to the financial community. It was therefore desired to quantify the interannual variability associated with TMY files in order to improve the understanding of risk associated with these projects. To begin to understand the interannual variability of a PV project, we simulated two archetypal PV system designs, which are common in the PV industry, in PVWatts using the NSRDB's 1961-1990 historical dataset. This dataset contains measured hourly weather data and spans the thirty years from 1961-1990 for 239 locations in the United States. To note, this historical dataset was used to compose the TMY2 dataset. Using the results of these simulations we computed several statistical metrics which may be of interest to the financial community and normalized the results with respect to the TMY energy prediction at each location, so that these results could be easily translated to similar systems. This report briefly describes the simulation process used and the statistical methodology employed for this project, but otherwise focuses mainly on a sample of our results. A short discussion of these results is also provided. It is our hope that this quantification of the interannual variability of PV systems will provide a starting point for variability considerations in future PV system designs and investigations. however this type of data is not available when estimating system output during future time frames.« less
NASA Astrophysics Data System (ADS)
Pedretti, Daniele; Beckie, Roger Daniel
2014-05-01
Missing data in hydrological time-series databases are ubiquitous in practical applications, yet it is of fundamental importance to make educated decisions in problems involving exhaustive time-series knowledge. This includes precipitation datasets, since recording or human failures can produce gaps in these time series. For some applications, directly involving the ratio between precipitation and some other quantity, lack of complete information can result in poor understanding of basic physical and chemical dynamics involving precipitated water. For instance, the ratio between precipitation (recharge) and outflow rates at a discharge point of an aquifer (e.g. rivers, pumping wells, lysimeters) can be used to obtain aquifer parameters and thus to constrain model-based predictions. We tested a suite of methodologies to reconstruct missing information in rainfall datasets. The goal was to obtain a suitable and versatile method to reduce the errors given by the lack of data in specific time windows. Our analyses included both a classical chronologically-pairing approach between rainfall stations and a probability-based approached, which accounted for the probability of exceedence of rain depths measured at two or multiple stations. Our analyses proved that it is not clear a priori which method delivers the best methodology. Rather, this selection should be based considering the specific statistical properties of the rainfall dataset. In this presentation, our emphasis is to discuss the effects of a few typical parametric distributions used to model the behavior of rainfall. Specifically, we analyzed the role of distributional "tails", which have an important control on the occurrence of extreme rainfall events. The latter strongly affect several hydrological applications, including recharge-discharge relationships. The heavy-tailed distributions we considered were parametric Log-Normal, Generalized Pareto, Generalized Extreme and Gamma distributions. The methods were first tested on synthetic examples, to have a complete control of the impact of several variables such as minimum amount of data required to obtain reliable statistical distributions from the selected parametric functions. Then, we applied the methodology to precipitation datasets collected in the Vancouver area and on a mining site in Peru.
A Two-Stream Deep Fusion Framework for High-Resolution Aerial Scene Classification
Liu, Fuxian
2018-01-01
One of the challenging problems in understanding high-resolution remote sensing images is aerial scene classification. A well-designed feature representation method and classifier can improve classification accuracy. In this paper, we construct a new two-stream deep architecture for aerial scene classification. First, we use two pretrained convolutional neural networks (CNNs) as feature extractor to learn deep features from the original aerial image and the processed aerial image through saliency detection, respectively. Second, two feature fusion strategies are adopted to fuse the two different types of deep convolutional features extracted by the original RGB stream and the saliency stream. Finally, we use the extreme learning machine (ELM) classifier for final classification with the fused features. The effectiveness of the proposed architecture is tested on four challenging datasets: UC-Merced dataset with 21 scene categories, WHU-RS dataset with 19 scene categories, AID dataset with 30 scene categories, and NWPU-RESISC45 dataset with 45 challenging scene categories. The experimental results demonstrate that our architecture gets a significant classification accuracy improvement over all state-of-the-art references. PMID:29581722
A Two-Stream Deep Fusion Framework for High-Resolution Aerial Scene Classification.
Yu, Yunlong; Liu, Fuxian
2018-01-01
One of the challenging problems in understanding high-resolution remote sensing images is aerial scene classification. A well-designed feature representation method and classifier can improve classification accuracy. In this paper, we construct a new two-stream deep architecture for aerial scene classification. First, we use two pretrained convolutional neural networks (CNNs) as feature extractor to learn deep features from the original aerial image and the processed aerial image through saliency detection, respectively. Second, two feature fusion strategies are adopted to fuse the two different types of deep convolutional features extracted by the original RGB stream and the saliency stream. Finally, we use the extreme learning machine (ELM) classifier for final classification with the fused features. The effectiveness of the proposed architecture is tested on four challenging datasets: UC-Merced dataset with 21 scene categories, WHU-RS dataset with 19 scene categories, AID dataset with 30 scene categories, and NWPU-RESISC45 dataset with 45 challenging scene categories. The experimental results demonstrate that our architecture gets a significant classification accuracy improvement over all state-of-the-art references.
ERIC Educational Resources Information Center
Wine, Jennifer; Bryan, Michael; Siegel, Peter
2013-01-01
The National Postsecondary Student Aid Study (NPSAS) helps fulfill the U.S. Department of Education's National Center for Education Statistics (NCES) mandate to collect, analyze, and publish statistics related to education. The purpose of NPSAS is to compile a comprehensive research dataset, based on student-level records, on financial aid…
ERIC Educational Resources Information Center
Wine, Jennifer; Bryan, Michael; Siegel, Peter
2013-01-01
The National Postsecondary Student Aid Study (NPSAS) helps fulfill the U.S. Department of Education's National Center for Education Statistics (NCES) mandate to collect, analyze, and publish statistics related to education. The purpose of NPSAS is to compile a comprehensive research dataset, based on student-level records, on financial aid…