Data reuse and the open data citation advantage
Vision, Todd J.
2013-01-01
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003. PMID:24109559
DeWitt, Jessica D.; Chirico, Peter G.; Malpeli, Katherine C.
2015-11-18
This work represents the fourth installment of the series, and publishes a dataset of eight new AOIs and one subarea within Afghanistan. These areas include Dasht-e-Nawar, Farah, North Ghazni, South Ghazni, Chakhansur, Godzareh East, Godzareh West, and Namaksar-e-Herat AOIs and the Central Bamyan subarea of the South Bamyan AOI (datasets for South Bamyan were published previously in Casey and Chirico, 2013). For each AOI and subarea, this dataset collection consists of the areal extent boundaries, elevation contours at 25-, 50-, and 100-m intervals, and an enhanced DEM. Hydrographic datasets covering the extent of four AOIs and one subarea are also included in the collection. The resulting raster and vector layers are intended for use by government agencies, developmental organizations, and private companies in Afghanistan to support mineral assessments, monitoring, management, and investment.
Gan, Lin; Denecke, Bernd
2013-06-24
It came to our attention that a paper has recently been published concerning one of the GEO datasets (GSE34413) we cited in our published paper [1]. The original reference (reference 27) cited for this dataset leads to a paper about a similar study from the same research group [2]. In order to provide readers with exact citation information, we would like to update reference 27 in our previous paper to the new published paper concerning GSE34413 [3]. The authors apologize for this inconvenience. [...].
The Lunar Source Disk: Old Lunar Datasets on a New CD-ROM
NASA Astrophysics Data System (ADS)
Hiesinger, H.
1998-01-01
A compilation of previously published datasets on CD-ROM is presented. This Lunar Source Disk is intended to be a first step in the improvement/expansion of the Lunar Consortium Disk, in order to create an "image-cube"-like data pool that can be easily accessed and might be useful for a variety of future lunar investigations. All datasets were transformed to a standard map projection that allows direct comparison of different types of information on a pixel-by pixel basis. Lunar observations have a long history and have been important to mankind for centuries, notably since the work of Plutarch and Galileo. As a consequence of centuries of lunar investigations, knowledge of the characteristics and properties of the Moon has accumulated over time. However, a side effect of this accumulation is that it has become more and more complicated for scientists to review all the datasets obtained through different techniques, to interpret them properly, to recognize their weaknesses and strengths in detail, and to combine them synoptically in geologic interpretations. Such synoptic geologic interpretations are crucial for the study of planetary bodies through remote-sensing data in order to avoid misinterpretation. In addition, many of the modem datasets, derived from Earth-based telescopes as well as from spacecraft missions, are acquired at different geometric and radiometric conditions. These differences make it challenging to compare or combine datasets directly or to extract information from different datasets on a pixel-by-pixel basis. Also, as there is no convention for the presentation of lunar datasets, different authors choose different map projections, depending on the location of the investigated areas and their personal interests. Insufficient or incomplete information on the map parameters used by different authors further complicates the reprojection of these datasets to a standard geometry. The goal of our efforts was to transfer previously published lunar datasets to a selected standard geometry in order to create an "image-cube"-like data pool for further interpretation. The starting point was a number of datasets on a CD-ROM published by the Lunar Consortium. The task of creating an uniform data pool was further complicated by some missing or wrong references and keys on the Lunar Consortium CD as well as erroneous reproduction of some datasets in the literature.
Sprague, Lori A.; Gronberg, Jo Ann M.
2013-01-01
Anthropogenic inputs of nitrogen and phosphorus to each county in the conterminous United States and to the watersheds of 495 surface-water sites studied as part of the U.S. Geological Survey National Water-Quality Assessment Program were quantified for the years 1992, 1997, and 2002. Estimates of inputs of nitrogen and phosphorus from biological fixation by crops (for nitrogen only), human consumption, crop production for human consumption, animal production for human consumption, animal consumption, and crop production for animal consumption for each county are provided in a tabular dataset. These county-level estimates were allocated to the watersheds of the surface-water sites to estimate watershed-level inputs from the same sources; these estimates also are provided in a tabular dataset, together with calculated estimates of net import of food and net import of feed and previously published estimates of inputs from atmospheric deposition, fertilizer, and recoverable manure. The previously published inputs are provided for each watershed so that final estimates of total anthropogenic nutrient inputs could be calculated. Estimates of total anthropogenic inputs are presented together with previously published estimates of riverine loads of total nitrogen and total phosphorus for reference.
NASA Astrophysics Data System (ADS)
Baudin, François; Martinez, Philippe; Dennielou, Bernard; Charlier, Karine; Marsset, Tania; Droz, Laurence; Rabouille, Christophe
2017-08-01
Geochemical data (total organic carbon-TOC content, δ13Corg, C:N, Rock-Eval analyses) were obtained on 150 core tops from the Angola basin, with a special focus on the Congo deep-sea fan. Combined with the previously published data, the resulting dataset (322 stations) shows a good spatial and bathymetric representativeness. TOC content and δ13Corg maps of the Angola basin were generated using this enhanced dataset. The main difference in our map with previously published ones is the high terrestrial organic matter content observed downslope along the active turbidite channel of the Congo deep-sea fan till the distal lobe complex near 5000 m of water-depth. Interpretation of downslope trends in TOC content and organic matter composition indicates that lateral particle transport by turbidity currents is the primary mechanism controlling supply and burial of organic matter in the bathypelagic depths.
Xu, Lingyu; Xu, Yuancheng; Coulden, Richard; Sonnex, Emer; Hrybouski, Stanislau; Paterson, Ian; Butler, Craig
2018-05-11
Epicardial adipose tissue (EAT) volume derived from contrast enhanced (CE) computed tomography (CT) scans is not well validated. We aim to establish a reliable threshold to accurately quantify EAT volume from CE datasets. We analyzed EAT volume on paired non-contrast (NC) and CE datasets from 25 patients to derive appropriate Hounsfield (HU) cutpoints to equalize two EAT volume estimates. The gold standard threshold (-190HU, -30HU) was used to assess EAT volume on NC datasets. For CE datasets, EAT volumes were estimated using three previously reported thresholds: (-190HU, -30HU), (-190HU, -15HU), (-175HU, -15HU) and were analyzed by a semi-automated 3D Fat analysis software. Subsequently, we applied a threshold correction to (-190HU, -30HU) based on mean differences in radiodensity between NC and CE images (ΔEATrd = CE radiodensity - NC radiodensity). We then validated our findings on EAT threshold in 21 additional patients with paired CT datasets. EAT volume from CE datasets using previously published thresholds consistently underestimated EAT volume from NC dataset standard by a magnitude of 8.2%-19.1%. Using our corrected threshold (-190HU, -3HU) in CE datasets yielded statistically identical EAT volume to NC EAT volume in the validation cohort (186.1 ± 80.3 vs. 185.5 ± 80.1 cm 3 , Δ = 0.6 cm 3 , 0.3%, p = 0.374). Estimating EAT volume from contrast enhanced CT scans using a corrected threshold of -190HU, -3HU provided excellent agreement with EAT volume from non-contrast CT scans using a standard threshold of -190HU, -30HU. Copyright © 2018. Published by Elsevier B.V.
A single factor underlies the metabolic syndrome: a confirmatory factor analysis.
Pladevall, Manel; Singal, Bonita; Williams, L Keoki; Brotons, Carlos; Guyer, Heidi; Sadurni, Josep; Falces, Carles; Serrano-Rios, Manuel; Gabriel, Rafael; Shaw, Jonathan E; Zimmet, Paul Z; Haffner, Steven
2006-01-01
Confirmatory factor analysis (CFA) was used to test the hypothesis that the components of the metabolic syndrome are manifestations of a single common factor. Three different datasets were used to test and validate the model. The Spanish and Mauritian studies included 207 men and 203 women and 1,411 men and 1,650 women, respectively. A third analytical dataset including 847 men was obtained from a previously published CFA of a U.S. population. The one-factor model included the metabolic syndrome core components (central obesity, insulin resistance, blood pressure, and lipid measurements). We also tested an expanded one-factor model that included uric acid and leptin levels. Finally, we used CFA to compare the goodness of fit of one-factor models with the fit of two previously published four-factor models. The simplest one-factor model showed the best goodness-of-fit indexes (comparative fit index 1, root mean-square error of approximation 0.00). Comparisons of one-factor with four-factor models in the three datasets favored the one-factor model structure. The selection of variables to represent the different metabolic syndrome components and model specification explained why previous exploratory and confirmatory factor analysis, respectively, failed to identify a single factor for the metabolic syndrome. These analyses support the current clinical definition of the metabolic syndrome, as well as the existence of a single factor that links all of the core components.
Gravity, aeromagnetic and rock-property data of the central California Coast Ranges
Langenheim, V.E.
2014-01-01
Gravity, aeromagnetic, and rock-property data were collected to support geologic-mapping, water-resource, and seismic-hazard studies for the central California Coast Ranges. These data are combined with existing data to provide gravity, aeromagnetic, and physical-property datasets for this region. The gravity dataset consists of approximately 18,000 measurements. The aeromagnetic dataset consists of total-field anomaly values from several detailed surveys that have been merged and gridded at an interval of 200 m. The physical property dataset consists of approximately 800 density measurements and 1,100 magnetic-susceptibility measurements from rock samples, in addition to previously published borehole gravity surveys from Santa Maria Basin, density logs from Salinas Valley, and intensities of natural remanent magnetization.
Assessment of published models and prognostic variables in epithelial ovarian cancer at Mayo Clinic
Hendrickson, Andrea Wahner; Hawthorne, Kieran M.; Goode, Ellen L.; Kalli, Kimberly R.; Goergen, Krista M.; Bakkum-Gamez, Jamie N.; Cliby, William A.; Keeney, Gary L.; Visscher, Dan W.; Tarabishy, Yaman; Oberg, Ann L.; Hartmann, Lynn C.; Maurer, Matthew J.
2015-01-01
Objectives Epithelial ovarian cancer (EOC) is an aggressive disease in which first line therapy consists of a surgical staging/debulking procedure and platinum based chemotherapy. There is significant interest in clinically applicable, easy to use prognostic tools to estimate risk of recurrence and overall survival. In this study we used a large prospectively collected cohort of women with EOC to validate currently published models and assess prognostic variables. Methods Women with invasive ovarian, peritoneal, or fallopian tube cancer diagnosed between 2000-2011 and prospectively enrolled into the Mayo Clinic Ovarian Cancer registry were identified. Demographics and known prognostic markers as well as epidemiologic exposure variables were abstracted from the medical record and collected via questionnaire. Six previously published models of overall and recurrence-free survival were assessed for external validity. In addition, predictors of outcome were assessed in our dataset. Results Previously published models validated with a range of c-statistics (0.587-0.827), though application of models containing variables not part of routine practice were somewhat limited by missing data; utilization of all applicable models and comparison of results is suggested. Examination of prognostic variables identified only the presence of ascites and ASA score to be independent predictors of prognosis in our dataset, albeit with marginal gain in prognostic information, after accounting for stage and debulking. Conclusions Existing prognostic models for newly diagnosed EOC showed acceptable calibration in our cohort for clinical application. However, modeling of prospective variables in our dataset reiterates that stage and debulking remain the most important predictors of prognosis in this setting. PMID:25620544
Yang, Chihae; Barlow, Susan M; Muldoon Jacobs, Kristi L; Vitcheva, Vessela; Boobis, Alan R; Felter, Susan P; Arvidson, Kirk B; Keller, Detlef; Cronin, Mark T D; Enoch, Steven; Worth, Andrew; Hollnagel, Heli M
2017-11-01
A new dataset of cosmetics-related chemicals for the Threshold of Toxicological Concern (TTC) approach has been compiled, comprising 552 chemicals with 219, 40, and 293 chemicals in Cramer Classes I, II, and III, respectively. Data were integrated and curated to create a database of No-/Lowest-Observed-Adverse-Effect Level (NOAEL/LOAEL) values, from which the final COSMOS TTC dataset was developed. Criteria for study inclusion and NOAEL decisions were defined, and rigorous quality control was performed for study details and assignment of Cramer classes. From the final COSMOS TTC dataset, human exposure thresholds of 42 and 7.9 μg/kg-bw/day were derived for Cramer Classes I and III, respectively. The size of Cramer Class II was insufficient for derivation of a TTC value. The COSMOS TTC dataset was then federated with the dataset of Munro and colleagues, previously published in 1996, after updating the latter using the quality control processes for this project. This federated dataset expands the chemical space and provides more robust thresholds. The 966 substances in the federated database comprise 245, 49 and 672 chemicals in Cramer Classes I, II and III, respectively. The corresponding TTC values of 46, 6.2 and 2.3 μg/kg-bw/day are broadly similar to those of the original Munro dataset. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.
Palmblad, Magnus; van der Burgt, Yuri E M; Dalebout, Hans; Derks, Rico J E; Schoenmaker, Bart; Deelder, André M
2009-05-02
Accurate mass determination enhances peptide identification in mass spectrometry based proteomics. We here describe the combination of two previously published open source software tools to improve mass measurement accuracy in Fourier transform ion cyclotron resonance mass spectrometry (FTICRMS). The first program, msalign, aligns one MS/MS dataset with one FTICRMS dataset. The second software, recal2, uses peptides identified from the MS/MS data for automated internal calibration of the FTICR spectra, resulting in sub-ppm mass measurement errors.
Klein, Hans-Ulrich; Ruckert, Christian; Kohlmann, Alexander; Bullinger, Lars; Thiede, Christian; Haferlach, Torsten; Dugas, Martin
2009-12-15
Multiple gene expression signatures derived from microarray experiments have been published in the field of leukemia research. A comparison of these signatures with results from new experiments is useful for verification as well as for interpretation of the results obtained. Currently, the percentage of overlapping genes is frequently used to compare published gene signatures against a signature derived from a new experiment. However, it has been shown that the percentage of overlapping genes is of limited use for comparing two experiments due to the variability of gene signatures caused by different array platforms or assay-specific influencing parameters. Here, we present a robust approach for a systematic and quantitative comparison of published gene expression signatures with an exemplary query dataset. A database storing 138 leukemia-related published gene signatures was designed. Each gene signature was manually annotated with terms according to a leukemia-specific taxonomy. Two analysis steps are implemented to compare a new microarray dataset with the results from previous experiments stored and curated in the database. First, the global test method is applied to assess gene signatures and to constitute a ranking among them. In a subsequent analysis step, the focus is shifted from single gene signatures to chromosomal aberrations or molecular mutations as modeled in the taxonomy. Potentially interesting disease characteristics are detected based on the ranking of gene signatures associated with these aberrations stored in the database. Two example analyses are presented. An implementation of the approach is freely available as web-based application. The presented approach helps researchers to systematically integrate the knowledge derived from numerous microarray experiments into the analysis of a new dataset. By means of example leukemia datasets we demonstrate that this approach detects related experiments as well as related molecular mutations and may help to interpret new microarray data.
A century of transitions in New York City's measles dynamics.
Hempel, Karsten; Earn, David J D
2015-05-06
Infectious diseases spreading in a human population occasionally exhibit sudden transitions in their qualitative dynamics. Previous work has successfully predicted such transitions in New York City's historical measles incidence using the seasonally forced susceptible-infectious-recovered (SIR) model. This work relied on a dataset spanning 45 years (1928-1973), which we have extended to 93 years (1891-1984). We identify additional dynamical transitions in the longer dataset and successfully explain them by analysing attractors and transients of the same mechanistic epidemiological model. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Costa, Marta; Manton, James D; Ostrovsky, Aaron D; Prohaska, Steffen; Jefferis, Gregory S X E
2016-07-20
Neural circuit mapping is generating datasets of tens of thousands of labeled neurons. New computational tools are needed to search and organize these data. We present NBLAST, a sensitive and rapid algorithm, for measuring pairwise neuronal similarity. NBLAST considers both position and local geometry, decomposing neurons into short segments; matched segments are scored using a probabilistic scoring matrix defined by statistics of matches and non-matches. We validated NBLAST on a published dataset of 16,129 single Drosophila neurons. NBLAST can distinguish neuronal types down to the finest level (single identified neurons) without a priori information. Cluster analysis of extensively studied neuronal classes identified new types and unreported topographical features. Fully automated clustering organized the validation dataset into 1,052 clusters, many of which map onto previously described neuronal types. NBLAST supports additional query types, including searching neurons against transgene expression patterns. Finally, we show that NBLAST is effective with data from other invertebrates and zebrafish. VIDEO ABSTRACT. Copyright © 2016 MRC Laboratory of Molecular Biology. Published by Elsevier Inc. All rights reserved.
Forcino, Frank L; Leighton, Lindsey R; Twerdy, Pamela; Cahill, James F
2015-01-01
Community ecologists commonly perform multivariate techniques (e.g., ordination, cluster analysis) to assess patterns and gradients of taxonomic variation. A critical requirement for a meaningful statistical analysis is accurate information on the taxa found within an ecological sample. However, oversampling (too many individuals counted per sample) also comes at a cost, particularly for ecological systems in which identification and quantification is substantially more resource consuming than the field expedition itself. In such systems, an increasingly larger sample size will eventually result in diminishing returns in improving any pattern or gradient revealed by the data, but will also lead to continually increasing costs. Here, we examine 396 datasets: 44 previously published and 352 created datasets. Using meta-analytic and simulation-based approaches, the research within the present paper seeks (1) to determine minimal sample sizes required to produce robust multivariate statistical results when conducting abundance-based, community ecology research. Furthermore, we seek (2) to determine the dataset parameters (i.e., evenness, number of taxa, number of samples) that require larger sample sizes, regardless of resource availability. We found that in the 44 previously published and the 220 created datasets with randomly chosen abundances, a conservative estimate of a sample size of 58 produced the same multivariate results as all larger sample sizes. However, this minimal number varies as a function of evenness, where increased evenness resulted in increased minimal sample sizes. Sample sizes as small as 58 individuals are sufficient for a broad range of multivariate abundance-based research. In cases when resource availability is the limiting factor for conducting a project (e.g., small university, time to conduct the research project), statistically viable results can still be obtained with less of an investment.
Parameterizing sorption isotherms using a hybrid global-local fitting procedure.
Matott, L Shawn; Singh, Anshuman; Rabideau, Alan J
2017-05-01
Predictive modeling of the transport and remediation of groundwater contaminants requires an accurate description of the sorption process, which is usually provided by fitting an isotherm model to site-specific laboratory data. Commonly used calibration procedures, listed in order of increasing sophistication, include: trial-and-error, linearization, non-linear regression, global search, and hybrid global-local search. Given the considerable variability in fitting procedures applied in published isotherm studies, we investigated the importance of algorithm selection through a series of numerical experiments involving 13 previously published sorption datasets. These datasets, considered representative of state-of-the-art for isotherm experiments, had been previously analyzed using trial-and-error, linearization, or non-linear regression methods. The isotherm expressions were re-fit using a 3-stage hybrid global-local search procedure (i.e. global search using particle swarm optimization followed by Powell's derivative free local search method and Gauss-Marquardt-Levenberg non-linear regression). The re-fitted expressions were then compared to previously published fits in terms of the optimized weighted sum of squared residuals (WSSR) fitness function, the final estimated parameters, and the influence on contaminant transport predictions - where easily computed concentration-dependent contaminant retardation factors served as a surrogate measure of likely transport behavior. Results suggest that many of the previously published calibrated isotherm parameter sets were local minima. In some cases, the updated hybrid global-local search yielded order-of-magnitude reductions in the fitness function. In particular, of the candidate isotherms, the Polanyi-type models were most likely to benefit from the use of the hybrid fitting procedure. In some cases, improvements in fitness function were associated with slight (<10%) changes in parameter values, but in other cases significant (>50%) changes in parameter values were noted. Despite these differences, the influence of isotherm misspecification on contaminant transport predictions was quite variable and difficult to predict from inspection of the isotherms. Copyright © 2017 Elsevier B.V. All rights reserved.
Annual variation in the atmospheric radon concentration in Japan.
Kobayashi, Yuka; Yasuoka, Yumi; Omori, Yasutaka; Nagahama, Hiroyuki; Sanada, Tetsuya; Muto, Jun; Suzuki, Toshiyuki; Homma, Yoshimi; Ihara, Hayato; Kubota, Kazuhito; Mukai, Takahiro
2015-08-01
Anomalous atmospheric variations in radon related to earthquakes have been observed in hourly exhaust-monitoring data from radioisotope institutes in Japan. The extraction of seismic anomalous radon variations would be greatly aided by understanding the normal pattern of variation in radon concentrations. Using atmospheric daily minimum radon concentration data from five sampling sites, we show that a sinusoidal regression curve can be fitted to the data. In addition, we identify areas where the atmospheric radon variation is significantly affected by the variation in atmospheric turbulence and the onshore-offshore pattern of Asian monsoons. Furthermore, by comparing the sinusoidal regression curve for the normal annual (seasonal) variations at the five sites to the sinusoidal regression curve for a previously published dataset of radon values at the five Japanese prefectures, we can estimate the normal annual variation pattern. By fitting sinusoidal regression curves to the previously published dataset containing sites in all Japanese prefectures, we find that 72% of the Japanese prefectures satisfy the requirements of the sinusoidal regression curve pattern. Using the normal annual variation pattern of atmospheric daily minimum radon concentration data, these prefectures are suitable areas for obtaining anomalous radon variations related to earthquakes. Copyright © 2015 Elsevier Ltd. All rights reserved.
Full Life Cycle of Data Analysis with Climate Model Diagnostic Analyzer (CMDA)
NASA Astrophysics Data System (ADS)
Lee, S.; Zhai, C.; Pan, L.; Tang, B.; Zhang, J.; Bao, Q.; Malarout, N.
2017-12-01
We have developed a system that supports the full life cycle of a data analysis process, from data discovery, to data customization, to analysis, to reanalysis, to publication, and to reproduction. The system called Climate Model Diagnostic Analyzer (CMDA) is designed to demonstrate that the full life cycle of data analysis can be supported within one integrated system for climate model diagnostic evaluation with global observational and reanalysis datasets. CMDA has four subsystems that are highly integrated to support the analysis life cycle. Data System manages datasets used by CMDA analysis tools, Analysis System manages CMDA analysis tools which are all web services, Provenance System manages the meta data of CMDA datasets and the provenance of CMDA analysis history, and Recommendation System extracts knowledge from CMDA usage history and recommends datasets/analysis tools to users. These four subsystems are not only highly integrated but also easily expandable. New datasets can be easily added to Data System and scanned to be visible to the other subsystems. New analysis tools can be easily registered to be available in the Analysis System and Provenance System. With CMDA, a user can start a data analysis process by discovering datasets of relevance to their research topic using the Recommendation System. Next, the user can customize the discovered datasets for their scientific use (e.g. anomaly calculation, regridding, etc) with tools in the Analysis System. Next, the user can do their analysis with the tools (e.g. conditional sampling, time averaging, spatial averaging) in the Analysis System. Next, the user can reanalyze the datasets based on the previously stored analysis provenance in the Provenance System. Further, they can publish their analysis process and result to the Provenance System to share with other users. Finally, any user can reproduce the published analysis process and results. By supporting the full life cycle of climate data analysis, CMDA improves the research productivity and collaboration level of its user.
Matott, L Shawn; Jiang, Zhengzheng; Rabideau, Alan J; Allen-King, Richelle M
2015-01-01
Numerous isotherm expressions have been developed for describing sorption of hydrophobic organic compounds (HOCs), including "dual-mode" approaches that combine nonlinear behavior with a linear partitioning component. Choosing among these alternative expressions for describing a given dataset is an important task that can significantly influence subsequent transport modeling and/or mechanistic interpretation. In this study, a series of numerical experiments were undertaken to identify "best-in-class" isotherms by refitting 10 alternative models to a suite of 13 previously published literature datasets. The corrected Akaike Information Criterion (AICc) was used for ranking these alternative fits and distinguishing between plausible and implausible isotherms for each dataset. The occurrence of multiple plausible isotherms was inversely correlated with dataset "richness", such that datasets with fewer observations and/or a narrow range of aqueous concentrations resulted in a greater number of plausible isotherms. Overall, only the Polanyi-partition dual-mode isotherm was classified as "plausible" across all 13 of the considered datasets, indicating substantial statistical support consistent with current advances in sorption theory. However, these findings are predicated on the use of the AICc measure as an unbiased ranking metric and the adoption of a subjective, but defensible, threshold for separating plausible and implausible isotherms. Copyright © 2015 Elsevier B.V. All rights reserved.
Mahmood, Khalid; Jung, Chol-Hee; Philip, Gayle; Georgeson, Peter; Chung, Jessica; Pope, Bernard J; Park, Daniel J
2017-05-16
Genetic variant effect prediction algorithms are used extensively in clinical genomics and research to determine the likely consequences of amino acid substitutions on protein function. It is vital that we better understand their accuracies and limitations because published performance metrics are confounded by serious problems of circularity and error propagation. Here, we derive three independent, functionally determined human mutation datasets, UniFun, BRCA1-DMS and TP53-TA, and employ them, alongside previously described datasets, to assess the pre-eminent variant effect prediction tools. Apparent accuracies of variant effect prediction tools were influenced significantly by the benchmarking dataset. Benchmarking with the assay-determined datasets UniFun and BRCA1-DMS yielded areas under the receiver operating characteristic curves in the modest ranges of 0.52 to 0.63 and 0.54 to 0.75, respectively, considerably lower than observed for other, potentially more conflicted datasets. These results raise concerns about how such algorithms should be employed, particularly in a clinical setting. Contemporary variant effect prediction tools are unlikely to be as accurate at the general prediction of functional impacts on proteins as reported prior. Use of functional assay-based datasets that avoid prior dependencies promises to be valuable for the ongoing development and accurate benchmarking of such tools.
Verhoef, Petra; Dötsch-Klerk, Mariska; Lathrop, Mark; Xu, Peng; Nordestgaard, Børge G.; Holm, Hilma; Hopewell, Jemma C.; Saleheen, Danish; Tanaka, Toshihiro; Anand, Sonia S.; Chambers, John C.; Kleber, Marcus E.; Ouwehand, Willem H.; Yamada, Yoshiji; Elbers, Clara; Peters, Bas; Stewart, Alexandre F. R.; Reilly, Muredach M.; Thorand, Barbara; Yusuf, Salim; Engert, James C.; Assimes, Themistocles L.; Kooner, Jaspal; Danesh, John; Watkins, Hugh; Samani, Nilesh J.
2012-01-01
Background Moderately elevated blood levels of homocysteine are weakly correlated with coronary heart disease (CHD) risk, but causality remains uncertain. When folate levels are low, the TT genotype of the common C677T polymorphism (rs1801133) of the methylene tetrahydrofolate reductase gene (MTHFR) appreciably increases homocysteine levels, so “Mendelian randomization” studies using this variant as an instrumental variable could help test causality. Methods and Findings Nineteen unpublished datasets were obtained (total 48,175 CHD cases and 67,961 controls) in which multiple genetic variants had been measured, including MTHFR C677T. These datasets did not include measurements of blood homocysteine, but homocysteine levels would be expected to be about 20% higher with TT than with CC genotype in the populations studied. In meta-analyses of these unpublished datasets, the case-control CHD odds ratio (OR) and 95% CI comparing TT versus CC homozygotes was 1.02 (0.98–1.07; p = 0.28) overall, and 1.01 (0.95–1.07) in unsupplemented low-folate populations. By contrast, in a slightly updated meta-analysis of the 86 published studies (28,617 CHD cases and 41,857 controls), the OR was 1.15 (1.09–1.21), significantly discrepant (p = 0.001) with the OR in the unpublished datasets. Within the meta-analysis of published studies, the OR was 1.12 (1.04–1.21) in the 14 larger studies (those with variance of log OR<0.05; total 13,119 cases) and 1.18 (1.09–1.28) in the 72 smaller ones (total 15,498 cases). Conclusions The CI for the overall result from large unpublished datasets shows lifelong moderate homocysteine elevation has little or no effect on CHD. The discrepant overall result from previously published studies reflects publication bias or methodological problems. Please see later in the article for the Editors' Summary PMID:22363213
Fekete, Tibor; Rásó, Erzsébet; Pete, Imre; Tegze, Bálint; Liko, István; Munkácsy, Gyöngyi; Sipos, Norbert; Rigó, János; Györffy, Balázs
2012-07-01
Transcriptomic analysis of global gene expression in ovarian carcinoma can identify dysregulated genes capable to serve as molecular markers for histology subtypes and survival. The aim of our study was to validate previous candidate signatures in an independent setting and to identify single genes capable to serve as biomarkers for ovarian cancer progression. As several datasets are available in the GEO today, we were able to perform a true meta-analysis. First, 829 samples (11 datasets) were downloaded, and the predictive power of 16 previously published gene sets was assessed. Of these, eight were capable to discriminate histology subtypes, and none was capable to predict survival. To overcome the differences in previous studies, we used the 829 samples to identify new predictors. Then, we collected 64 ovarian cancer samples (median relapse-free survival 24.5 months) and performed TaqMan Real Time Polimerase Chain Reaction (RT-PCR) analysis for the best 40 genes associated with histology subtypes and survival. Over 90% of subtype-associated genes were confirmed. Overall survival was effectively predicted by hormone receptors (PGR and ESR2) and by TSPAN8. Relapse-free survival was predicted by MAPT and SNCG. In summary, we successfully validated several gene sets in a meta-analysis in large datasets of ovarian samples. Additionally, several individual genes identified were validated in a clinical cohort. Copyright © 2011 UICC.
De-identification of patient notes with recurrent neural networks.
Dernoncourt, Franck; Lee, Ji Young; Uzuner, Ozlem; Szolovits, Peter
2017-05-01
Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of electronic health record databases, the limited number of researchers with access to non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value. We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset. Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall of 97.38 and a precision of 98.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall of 99.25 and a precision of 99.21. Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no manual feature engineering. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com
A wavelet method for modeling and despiking motion artifacts from resting-state fMRI time series.
Patel, Ameera X; Kundu, Prantik; Rubinov, Mikail; Jones, P Simon; Vértes, Petra E; Ersche, Karen D; Suckling, John; Bullmore, Edward T
2014-07-15
The impact of in-scanner head movement on functional magnetic resonance imaging (fMRI) signals has long been established as undesirable. These effects have been traditionally corrected by methods such as linear regression of head movement parameters. However, a number of recent independent studies have demonstrated that these techniques are insufficient to remove motion confounds, and that even small movements can spuriously bias estimates of functional connectivity. Here we propose a new data-driven, spatially-adaptive, wavelet-based method for identifying, modeling, and removing non-stationary events in fMRI time series, caused by head movement, without the need for data scrubbing. This method involves the addition of just one extra step, the Wavelet Despike, in standard pre-processing pipelines. With this method, we demonstrate robust removal of a range of different motion artifacts and motion-related biases including distance-dependent connectivity artifacts, at a group and single-subject level, using a range of previously published and new diagnostic measures. The Wavelet Despike is able to accommodate the substantial spatial and temporal heterogeneity of motion artifacts and can consequently remove a range of high and low frequency artifacts from fMRI time series, that may be linearly or non-linearly related to physical movements. Our methods are demonstrated by the analysis of three cohorts of resting-state fMRI data, including two high-motion datasets: a previously published dataset on children (N=22) and a new dataset on adults with stimulant drug dependence (N=40). We conclude that there is a real risk of motion-related bias in connectivity analysis of fMRI data, but that this risk is generally manageable, by effective time series denoising strategies designed to attenuate synchronized signal transients induced by abrupt head movements. The Wavelet Despiking software described in this article is freely available for download at www.brainwavelet.org. Copyright © 2014. Published by Elsevier Inc.
Moiseenko, Vitali; Wu, Jonn; Hovan, Allan; Saleh, Ziad; Apte, Aditya; Deasy, Joseph O; Harrow, Stephen; Rabuka, Carman; Muggli, Adam; Thompson, Anna
2012-03-01
The severe reduction of salivary function (xerostomia) is a common complication after radiation therapy for head-and-neck cancer. Consequently, guidelines to ensure adequate function based on parotid gland tolerance dose-volume parameters have been suggested by the QUANTEC group and by Ortholan et al. We perform a validation test of these guidelines against a prospectively collected dataset and compared with a previously published dataset. Whole-mouth stimulated salivary flow data from 66 head-and-neck cancer patients treated with radiotherapy at the British Columbia Cancer Agency (BCCA) were measured, and treatment planning data were abstracted. Flow measurements were collected from 50 patients at 3 months, and 60 patients at 12-month follow-up. Previously published data from a second institution, Washington University in St. Louis (WUSTL), were used for comparison. A logistic model was used to describe the incidence of Grade 4 xerostomia as a function of the mean dose of the spared parotid gland. The rate of correctly predicting the lack of xerostomia (negative predictive value [NPV]) was computed for both the QUANTEC constraints and Ortholan et al. recommendation to constrain the total volume of both glands receiving more than 40 Gy to less than 33%. Both datasets showed a rate of xerostomia of less than 20% when the mean dose to the least-irradiated parotid gland is kept to less than 20 Gy. Logistic model parameters for the incidence of xerostomia at 12 months after therapy, based on the least-irradiated gland, were D(50) = 32.4 Gy and and γ = 0.97. NPVs for QUANTEC guideline were 94% (BCCA data), and 90% (WUSTL data). For Ortholan et al. guideline NPVs were 85% (BCCA) and 86% (WUSTL). These data confirm that the QUANTEC guideline effectively avoids xerostomia, and this is somewhat more effective than constraints on the volume receiving more than 40 Gy. Copyright © 2012 Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moiseenko, Vitali, E-mail: vmoiseenko@bccancer.bc.ca; Wu, Jonn; Hovan, Allan
2012-03-01
Purpose: The severe reduction of salivary function (xerostomia) is a common complication after radiation therapy for head-and-neck cancer. Consequently, guidelines to ensure adequate function based on parotid gland tolerance dose-volume parameters have been suggested by the QUANTEC group and by Ortholan et al. We perform a validation test of these guidelines against a prospectively collected dataset and compared with a previously published dataset. Methods and Materials: Whole-mouth stimulated salivary flow data from 66 head-and-neck cancer patients treated with radiotherapy at the British Columbia Cancer Agency (BCCA) were measured, and treatment planning data were abstracted. Flow measurements were collected from 50more » patients at 3 months, and 60 patients at 12-month follow-up. Previously published data from a second institution, Washington University in St. Louis (WUSTL), were used for comparison. A logistic model was used to describe the incidence of Grade 4 xerostomia as a function of the mean dose of the spared parotid gland. The rate of correctly predicting the lack of xerostomia (negative predictive value [NPV]) was computed for both the QUANTEC constraints and Ortholan et al. recommendation to constrain the total volume of both glands receiving more than 40 Gy to less than 33%. Results: Both datasets showed a rate of xerostomia of less than 20% when the mean dose to the least-irradiated parotid gland is kept to less than 20 Gy. Logistic model parameters for the incidence of xerostomia at 12 months after therapy, based on the least-irradiated gland, were D{sub 50} = 32.4 Gy and and {gamma} = 0.97. NPVs for QUANTEC guideline were 94% (BCCA data), and 90% (WUSTL data). For Ortholan et al. guideline NPVs were 85% (BCCA) and 86% (WUSTL). Conclusion: These data confirm that the QUANTEC guideline effectively avoids xerostomia, and this is somewhat more effective than constraints on the volume receiving more than 40 Gy.« less
Shingrani, Rahul; Krenz, Gary; Molthen, Robert
2010-01-01
With advances in medical imaging scanners, it has become commonplace to generate large multidimensional datasets. These datasets require tools for a rapid, thorough analysis. To address this need, we have developed an automated algorithm for morphometric analysis incorporating A Visualization Workshop computational and image processing libraries for three-dimensional segmentation, vascular tree generation and structural hierarchical ordering with a two-stage numeric optimization procedure for estimating vessel diameters. We combine this new technique with our mathematical models of pulmonary vascular morphology to quantify structural and functional attributes of lung arterial trees. Our physiological studies require repeated measurements of vascular structure to determine differences in vessel biomechanical properties between animal models of pulmonary disease. Automation provides many advantages including significantly improved speed and minimized operator interaction and biasing. The results are validated by comparison with previously published rat pulmonary arterial micro-CT data analysis techniques, in which vessels were manually mapped and measured using intense operator intervention. Published by Elsevier Ireland Ltd.
T1-weighted in vivo human whole brain MRI dataset with an ultrahigh isotropic resolution of 250 μm.
Lüsebrink, Falk; Sciarra, Alessandro; Mattern, Hendrik; Yakupov, Renat; Speck, Oliver
2017-03-14
We present an ultrahigh resolution in vivo human brain magnetic resonance imaging (MRI) dataset. It consists of T 1 -weighted whole brain anatomical data acquired at 7 Tesla with a nominal isotropic resolution of 250 μm of a single young healthy Caucasian subject and was recorded using prospective motion correction. The raw data amounts to approximately 1.2 TB and was acquired in eight hours total scan time. The resolution of this dataset is far beyond any previously published in vivo structural whole brain dataset. Its potential use is to build an in vivo MR brain atlas. Methods for image reconstruction and image restoration can be improved as the raw data is made available. Pre-processing and segmentation procedures can possibly be enhanced for high magnetic field strength and ultrahigh resolution data. Furthermore, potential resolution induced changes in quantitative data analysis can be assessed, e.g., cortical thickness or volumetric measures, as high quality images with an isotropic resolution of 1 and 0.5 mm of the same subject are included in the repository as well.
T1-weighted in vivo human whole brain MRI dataset with an ultrahigh isotropic resolution of 250 μm
NASA Astrophysics Data System (ADS)
Lüsebrink, Falk; Sciarra, Alessandro; Mattern, Hendrik; Yakupov, Renat; Speck, Oliver
2017-03-01
We present an ultrahigh resolution in vivo human brain magnetic resonance imaging (MRI) dataset. It consists of T1-weighted whole brain anatomical data acquired at 7 Tesla with a nominal isotropic resolution of 250 μm of a single young healthy Caucasian subject and was recorded using prospective motion correction. The raw data amounts to approximately 1.2 TB and was acquired in eight hours total scan time. The resolution of this dataset is far beyond any previously published in vivo structural whole brain dataset. Its potential use is to build an in vivo MR brain atlas. Methods for image reconstruction and image restoration can be improved as the raw data is made available. Pre-processing and segmentation procedures can possibly be enhanced for high magnetic field strength and ultrahigh resolution data. Furthermore, potential resolution induced changes in quantitative data analysis can be assessed, e.g., cortical thickness or volumetric measures, as high quality images with an isotropic resolution of 1 and 0.5 mm of the same subject are included in the repository as well.
2011-01-01
Background To date, nine Parkinson disease (PD) genome-wide association studies in North American, European and Asian populations have been published. The majority of studies have confirmed the association of the previously identified genetic risk factors, SNCA and MAPT, and two studies have identified three new PD susceptibility loci/genes (PARK16, BST1 and HLA-DRB5). In a recent meta-analysis of datasets from five of the published PD GWAS an additional 6 novel candidate genes (SYT11, ACMSD, STK39, MCCC1/LAMP3, GAK and CCDC62/HIP1R) were identified. Collectively the associations identified in these GWAS account for only a small proportion of the estimated total heritability of PD suggesting that an 'unknown' component of the genetic architecture of PD remains to be identified. Methods We applied a GWAS approach to a relatively homogeneous Ashkenazi Jewish (AJ) population from New York to search for both 'rare' and 'common' genetic variants that confer risk of PD by examining any SNPs with allele frequencies exceeding 2%. We have focused on a genetic isolate, the AJ population, as a discovery dataset since this cohort has a higher sharing of genetic background and historically experienced a significant bottleneck. We also conducted a replication study using two publicly available datasets from dbGaP. The joint analysis dataset had a combined sample size of 2,050 cases and 1,836 controls. Results We identified the top 57 SNPs showing the strongest evidence of association in the AJ dataset (p < 9.9 × 10-5). Six SNPs located within gene regions had positive signals in at least one other independent dbGaP dataset: LOC100505836 (Chr3p24), LOC153328/SLC25A48 (Chr5q31.1), UNC13B (9p13.3), SLCO3A1(15q26.1), WNT3(17q21.3) and NSF (17q21.3). We also replicated published associations for the gene regions SNCA (Chr4q21; rs3775442, p = 0.037), PARK16 (Chr1q32.1; rs823114 (NUCKS1), p = 6.12 × 10-4), BST1 (Chr4p15; rs12502586, p = 0.027), STK39 (Chr2q24.3; rs3754775, p = 0.005), and LAMP3 (Chr3; rs12493050, p = 0.005) in addition to the two most common PD susceptibility genes in the AJ population LRRK2 (Chr12q12; rs34637584, p = 1.56 × 10-4) and GBA (Chr1q21; rs2990245, p = 0.015). Conclusions We have demonstrated the utility of the AJ dataset in PD candidate gene and SNP discovery both by replication in dbGaP datasets with a larger sample size and by replicating association of previously identified PD susceptibility genes. Our GWAS study has identified candidate gene regions for PD that are implicated in neuronal signalling and the dopamine pathway. PMID:21812969
Validation of a Radiosensitivity Molecular Signature in Breast Cancer
Eschrich, Steven A.; Fulp, William J.; Pawitan, Yudi; Foekens, John A.; Smid, Marcel; Martens, John W. M.; Echevarria, Michelle; Kamath, Vidya; Lee, Ji-Hyun; Harris, Eleanor E.; Bergh, Jonas; Torres-Roca, Javier F.
2014-01-01
Purpose Previously, we developed a radiosensitivity molecular signature (RSI) that was clinically-validated in three independent datasets (rectal, esophageal, head and neck) in 118 patients. Here, we test RSI in radiotherapy (RT) treated breast cancer patients. Experimental Design RSI was tested in two previously published breast cancer datasets. Patients were treated at the Karolinska University Hospital (n=159) and Erasmus Medical Center (n=344). RSI was applied as previously described. Results We tested RSI in RT-treated patients (Karolinska). Patients predicted to be radiosensitive (RS) had an improved 5 yr relapse-free survival when compared with radioresistant (RR) patients (95% vs. 75%, p=0.0212) but there was no difference between RS/RR patients treated without RT (71% vs. 77%, p=0.6744), consistent with RSI being RT-specific (interaction term RSIxRT, p=0.05). Similarly, in the Erasmus dataset RT-treated RS patients had an improved 5-year distant-metastasis-free survival over RR patients (77% vs. 64%, p=0.0409) but no difference was observed in patients treated without RT (RS vs. RR, 80% vs. 81%, p=0.9425). Multivariable analysis showed RSI is the strongest variable in RT-treated patients (Karolinska, HR=5.53, p=0.0987, Erasmus, HR=1.64, p=0.0758) and in backward selection (removal alpha of 0.10) RSI was the only variable remaining in the final model. Finally, RSI is an independent predictor of outcome in RT-treated ER+ patients (Erasmus, multivariable analysis, HR=2.64, p=0.0085). Conclusions RSI is validated in two independent breast cancer datasets totaling 503 patients. Including prior data, RSI is validated in five independent cohorts (621 patients) and represents, to our knowledge, the most extensively validated molecular signature in radiation oncology. PMID:22832933
DNA methylation as a predictor of fetal alcohol spectrum disorder.
Lussier, Alexandre A; Morin, Alexander M; MacIsaac, Julia L; Salmon, Jenny; Weinberg, Joanne; Reynolds, James N; Pavlidis, Paul; Chudley, Albert E; Kobor, Michael S
2018-01-01
Fetal alcohol spectrum disorder (FASD) is a developmental disorder that manifests through a range of cognitive, adaptive, physiological, and neurobiological deficits resulting from prenatal alcohol exposure. Although the North American prevalence is currently estimated at 2-5%, FASD has proven difficult to identify in the absence of the overt physical features characteristic of fetal alcohol syndrome. As interventions may have the greatest impact at an early age, accurate biomarkers are needed to identify children at risk for FASD. Building on our previous work identifying distinct DNA methylation patterns in children and adolescents with FASD, we have attempted to validate these associations in a different clinical cohort and to use our DNA methylation signature to develop a possible epigenetic predictor of FASD. Genome-wide DNA methylation patterns were analyzed using the Illumina HumanMethylation450 array in the buccal epithelial cells of a cohort of 48 individuals aged 3.5-18 (24 FASD cases, 24 controls). The DNA methylation predictor of FASD was built using a stochastic gradient boosting model on our previously published dataset FASD cases and controls (GSE80261). The predictor was tested on the current dataset and an independent dataset of 48 autism spectrum disorder cases and 48 controls (GSE50759). We validated findings from our previous study that identified a DNA methylation signature of FASD, replicating the altered DNA methylation levels of 161/648 CpGs in this independent cohort, which may represent a robust signature of FASD in the epigenome. We also generated a predictive model of FASD using machine learning in a subset of our previously published cohort of 179 samples (83 FASD cases, 96 controls), which was tested in this novel cohort of 48 samples and resulted in a moderately accurate predictor of FASD status. Upon testing the algorithm in an independent cohort of individuals with autism spectrum disorder, we did not detect any bias towards autism, sex, age, or ethnicity. These findings further support the association of FASD with distinct DNA methylation patterns, while providing a possible entry point towards the development of epigenetic biomarkers of FASD.
Assessment of composite motif discovery methods.
Klepper, Kjetil; Sandve, Geir K; Abul, Osman; Johansen, Jostein; Drablos, Finn
2008-02-26
Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery - discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis-regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery. We have developed a benchmarking framework for composite motif discovery and used it to evaluate the performance of eight published module discovery tools. Benchmark datasets were constructed based on real genomic sequences containing experimentally verified regulatory modules, and the module discovery programs were asked to predict both the locations of these modules and to specify the single motifs involved. To aid the programs in their search, we provided position weight matrices corresponding to the binding motifs of the transcription factors involved. In addition, selections of decoy matrices were mixed with the genuine matrices on one dataset to test the response of programs to varying levels of noise. Although some of the methods tested tended to score somewhat better than others overall, there were still large variations between individual datasets and no single method performed consistently better than the rest in all situations. The variation in performance on individual datasets also shows that the new benchmark datasets represents a suitable variety of challenges to most methods for module discovery.
An Open-Access Modeled Passenger Flow Matrix for the Global Air Network in 2010
Huang, Zhuojie; Wu, Xiao; Garcia, Andres J.; Fik, Timothy J.; Tatem, Andrew J.
2013-01-01
The expanding global air network provides rapid and wide-reaching connections accelerating both domestic and international travel. To understand human movement patterns on the network and their socioeconomic, environmental and epidemiological implications, information on passenger flow is required. However, comprehensive data on global passenger flow remain difficult and expensive to obtain, prompting researchers to rely on scheduled flight seat capacity data or simple models of flow. This study describes the construction of an open-access modeled passenger flow matrix for all airports with a host city-population of more than 100,000 and within two transfers of air travel from various publicly available air travel datasets. Data on network characteristics, city population, and local area GDP amongst others are utilized as covariates in a spatial interaction framework to predict the air transportation flows between airports. Training datasets based on information from various transportation organizations in the United States, Canada and the European Union were assembled. A log-linear model controlling the random effects on origin, destination and the airport hierarchy was then built to predict passenger flows on the network, and compared to the results produced using previously published models. Validation analyses showed that the model presented here produced improved predictive power and accuracy compared to previously published models, yielding the highest successful prediction rate at the global scale. Based on this model, passenger flows between 1,491 airports on 644,406 unique routes were estimated in the prediction dataset. The airport node characteristics and estimated passenger flows are freely available as part of the Vector-Borne Disease Airline Importation Risk (VBD-Air) project at: www.vbd-air.com/data. PMID:23691194
An open-access modeled passenger flow matrix for the global air network in 2010.
Huang, Zhuojie; Wu, Xiao; Garcia, Andres J; Fik, Timothy J; Tatem, Andrew J
2013-01-01
The expanding global air network provides rapid and wide-reaching connections accelerating both domestic and international travel. To understand human movement patterns on the network and their socioeconomic, environmental and epidemiological implications, information on passenger flow is required. However, comprehensive data on global passenger flow remain difficult and expensive to obtain, prompting researchers to rely on scheduled flight seat capacity data or simple models of flow. This study describes the construction of an open-access modeled passenger flow matrix for all airports with a host city-population of more than 100,000 and within two transfers of air travel from various publicly available air travel datasets. Data on network characteristics, city population, and local area GDP amongst others are utilized as covariates in a spatial interaction framework to predict the air transportation flows between airports. Training datasets based on information from various transportation organizations in the United States, Canada and the European Union were assembled. A log-linear model controlling the random effects on origin, destination and the airport hierarchy was then built to predict passenger flows on the network, and compared to the results produced using previously published models. Validation analyses showed that the model presented here produced improved predictive power and accuracy compared to previously published models, yielding the highest successful prediction rate at the global scale. Based on this model, passenger flows between 1,491 airports on 644,406 unique routes were estimated in the prediction dataset. The airport node characteristics and estimated passenger flows are freely available as part of the Vector-Borne Disease Airline Importation Risk (VBD-Air) project at: www.vbd-air.com/data.
Causes and Consequences of Genetic Background Effects Illuminated by Integrative Genomic Analysis
Chandler, Christopher H.; Chari, Sudarshan; Dworkin, Ian
2014-01-01
The phenotypic consequences of individual mutations are modulated by the wild-type genetic background in which they occur. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist or about the mechanisms underlying it. We also lack knowledge on how mutations interact with genetic background to influence gene expression and how this in turn mediates mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets from a mapping-by-introgression experiment and a tagged RNA gene expression dataset. In addition we used whole genome resequencing of the parental lines—two commonly used laboratory strains—to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutation screen. By searching for genes showing a congruent signal across multiple datasets, we were able to identify a robust set of candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers previously reported are caused by higher-order epistasis, not quantitative noncomplementation. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well. PMID:24504186
Radiative effects of global MODIS cloud regimes
Oreopoulos, Lazaros; Cho, Nayeong; Lee, Dongmin; Kato, Seiji
2018-01-01
We update previously published MODIS global cloud regimes (CRs) using the latest MODIS cloud retrievals in the Collection 6 dataset. We implement a slightly different derivation method, investigate the composition of the regimes, and then proceed to examine several aspects of CR radiative appearance with the aid of various radiative flux datasets. Our results clearly show the CRs are radiatively distinct in terms of shortwave, longwave and their combined (total) cloud radiative effect. We show that we can clearly distinguish regimes based on whether they radiatively cool or warm the atmosphere, and thanks to radiative heating profiles to discern the vertical distribution of cooling and warming. Terra and Aqua comparisons provide information about the degree to which morning and afternoon occurrences of regimes affect the symmetry of CR radiative contribution. We examine how the radiative discrepancies among multiple irradiance datasets suffering from imperfect spatiotemporal matching depend on CR, and whether they are therefore related to the complexity of cloud structure, its interpretation by different observational systems, and its subsequent representation in radiative transfer calculations. PMID:29619289
Radiative effects of global MODIS cloud regimes.
Oreopoulos, Lazaros; Cho, Nayeong; Lee, Dongmin; Kato, Seiji
2016-03-16
We update previously published MODIS global cloud regimes (CRs) using the latest MODIS cloud retrievals in the Collection 6 dataset. We implement a slightly different derivation method, investigate the composition of the regimes, and then proceed to examine several aspects of CR radiative appearance with the aid of various radiative flux datasets. Our results clearly show the CRs are radiatively distinct in terms of shortwave, longwave and their combined (total) cloud radiative effect. We show that we can clearly distinguish regimes based on whether they radiatively cool or warm the atmosphere, and thanks to radiative heating profiles to discern the vertical distribution of cooling and warming. Terra and Aqua comparisons provide information about the degree to which morning and afternoon occurrences of regimes affect the symmetry of CR radiative contribution. We examine how the radiative discrepancies among multiple irradiance datasets suffering from imperfect spatiotemporal matching depend on CR, and whether they are therefore related to the complexity of cloud structure, its interpretation by different observational systems, and its subsequent representation in radiative transfer calculations.
Radiative Effects of Global MODIS Cloud Regimes
NASA Technical Reports Server (NTRS)
Oraiopoulos, Lazaros; Cho, Nayeong; Lee, Dong Min; Kato, Seiji
2016-01-01
We update previously published MODIS global cloud regimes (CRs) using the latest MODIS cloud retrievals in the Collection 6 dataset. We implement a slightly different derivation method, investigate the composition of the regimes, and then proceed to examine several aspects of CR radiative appearance with the aid of various radiative flux datasets. Our results clearly show the CRs are radiatively distinct in terms of shortwave, longwave and their combined (total) cloud radiative effect. We show that we can clearly distinguish regimes based on whether they radiatively cool or warm the atmosphere, and thanks to radiative heating profiles to discern the vertical distribution of cooling and warming. Terra and Aqua comparisons provide information about the degree to which morning and afternoon occurrences of regimes affect the symmetry of CR radiative contribution. We examine how the radiative discrepancies among multiple irradiance datasets suffering from imperfect spatiotemporal matching depend on CR, and whether they are therefore related to the complexity of cloud structure, its interpretation by different observational systems, and its subsequent representation in radiative transfer calculations.
De Hertogh, Benoît; De Meulder, Bertrand; Berger, Fabrice; Pierre, Michael; Bareke, Eric; Gaigneaux, Anthoula; Depiereux, Eric
2010-01-11
Recent reanalysis of spike-in datasets underscored the need for new and more accurate benchmark datasets for statistical microarray analysis. We present here a fresh method using biologically-relevant data to evaluate the performance of statistical methods. Our novel method ranks the probesets from a dataset composed of publicly-available biological microarray data and extracts subset matrices with precise information/noise ratios. Our method can be used to determine the capability of different methods to better estimate variance for a given number of replicates. The mean-variance and mean-fold change relationships of the matrices revealed a closer approximation of biological reality. Performance analysis refined the results from benchmarks published previously.We show that the Shrinkage t test (close to Limma) was the best of the methods tested, except when two replicates were examined, where the Regularized t test and the Window t test performed slightly better. The R scripts used for the analysis are available at http://urbm-cluster.urbm.fundp.ac.be/~bdemeulder/.
Weighted analysis of paired microarray experiments.
Kristiansson, Erik; Sjögren, Anders; Rudemo, Mats; Nerman, Olle
2005-01-01
In microarray experiments quality often varies, for example between samples and between arrays. The need for quality control is therefore strong. A statistical model and a corresponding analysis method is suggested for experiments with pairing, including designs with individuals observed before and after treatment and many experiments with two-colour spotted arrays. The model is of mixed type with some parameters estimated by an empirical Bayes method. Differences in quality are modelled by individual variances and correlations between repetitions. The method is applied to three real and several simulated datasets. Two of the real datasets are of Affymetrix type with patients profiled before and after treatment, and the third dataset is of two-colour spotted cDNA type. In all cases, the patients or arrays had different estimated variances, leading to distinctly unequal weights in the analysis. We suggest also plots which illustrate the variances and correlations that affect the weights computed by our analysis method. For simulated data the improvement relative to previously published methods without weighting is shown to be substantial.
Image Quality Ranking Method for Microscopy
Koho, Sami; Fazeli, Elnaz; Eriksson, John E.; Hänninen, Pekka E.
2016-01-01
Automated analysis of microscope images is necessitated by the increased need for high-resolution follow up of events in time. Manually finding the right images to be analyzed, or eliminated from data analysis are common day-to-day problems in microscopy research today, and the constantly growing size of image datasets does not help the matter. We propose a simple method and a software tool for sorting images within a dataset, according to their relative quality. We demonstrate the applicability of our method in finding good quality images in a STED microscope sample preparation optimization image dataset. The results are validated by comparisons to subjective opinion scores, as well as five state-of-the-art blind image quality assessment methods. We also show how our method can be applied to eliminate useless out-of-focus images in a High-Content-Screening experiment. We further evaluate the ability of our image quality ranking method to detect out-of-focus images, by extensive simulations, and by comparing its performance against previously published, well-established microscopy autofocus metrics. PMID:27364703
Zhang, Mingming; Mu, Hongbo; Shang, Zhenwei; Kang, Kai; Lv, Hongchao; Duan, Lian; Li, Jin; Chen, Xinren; Teng, Yanbo; Jiang, Yongshuai; Zhang, Ruijie
2017-01-06
Parkinson's disease (PD) is the second most common neurodegenerative disease. It is generally believed that it is influenced by both genetic and environmental factors, but the precise pathogenesis of PD is unknown to date. In this study, we performed a pathway analysis based on genome-wide association study (GWAS) to detect risk pathways of PD in three GWAS datasets. We first mapped all SNP markers to autosomal genes in each GWAS dataset. Then, we evaluated gene risk values using the minimum P-value of the tagSNPs. We took a pathway as a unit to identify the risk pathways based on the cumulative risks of the genes in the pathway. Finally, we combine the analysis results of the three datasets to detect the high risk pathways associated with PD. We found there were five same pathways in the three datasets. Besides, we also found there were five pathways which were shared in two datasets. Most of these pathways are associated with nervoussystem. Five pathways had been reported to be PD-related pathways in the previous literature. Our findings also implied that there was a close association between immune response and PD. Continued investigation of these pathways will further help us explain the pathogenesis of PD. Copyright © 2016. Published by Elsevier Ltd.
Hoffmann, Nils; Keck, Matthias; Neuweger, Heiko; Wilhelm, Mathias; Högy, Petra; Niehaus, Karsten; Stoye, Jens
2012-08-27
Modern analytical methods in biology and chemistry use separation techniques coupled to sensitive detectors, such as gas chromatography-mass spectrometry (GC-MS) and liquid chromatography-mass spectrometry (LC-MS). These hyphenated methods provide high-dimensional data. Comparing such data manually to find corresponding signals is a laborious task, as each experiment usually consists of thousands of individual scans, each containing hundreds or even thousands of distinct signals. In order to allow for successful identification of metabolites or proteins within such data, especially in the context of metabolomics and proteomics, an accurate alignment and matching of corresponding features between two or more experiments is required. Such a matching algorithm should capture fluctuations in the chromatographic system which lead to non-linear distortions on the time axis, as well as systematic changes in recorded intensities. Many different algorithms for the retention time alignment of GC-MS and LC-MS data have been proposed and published, but all of them focus either on aligning previously extracted peak features or on aligning and comparing the complete raw data containing all available features. In this paper we introduce two algorithms for retention time alignment of multiple GC-MS datasets: multiple alignment by bidirectional best hits peak assignment and cluster extension (BIPACE) and center-star multiple alignment by pairwise partitioned dynamic time warping (CeMAPP-DTW). We show how the similarity-based peak group matching method BIPACE may be used for multiple alignment calculation individually and how it can be used as a preprocessing step for the pairwise alignments performed by CeMAPP-DTW. We evaluate the algorithms individually and in combination on a previously published small GC-MS dataset studying the Leishmania parasite and on a larger GC-MS dataset studying grains of wheat (Triticum aestivum). We have shown that BIPACE achieves very high precision and recall and a very low number of false positive peak assignments on both evaluation datasets. CeMAPP-DTW finds a high number of true positives when executed on its own, but achieves even better results when BIPACE is used to constrain its search space. The source code of both algorithms is included in the OpenSource software framework Maltcms, which is available from http://maltcms.sf.net. The evaluation scripts of the present study are available from the same source.
2012-01-01
Background Modern analytical methods in biology and chemistry use separation techniques coupled to sensitive detectors, such as gas chromatography-mass spectrometry (GC-MS) and liquid chromatography-mass spectrometry (LC-MS). These hyphenated methods provide high-dimensional data. Comparing such data manually to find corresponding signals is a laborious task, as each experiment usually consists of thousands of individual scans, each containing hundreds or even thousands of distinct signals. In order to allow for successful identification of metabolites or proteins within such data, especially in the context of metabolomics and proteomics, an accurate alignment and matching of corresponding features between two or more experiments is required. Such a matching algorithm should capture fluctuations in the chromatographic system which lead to non-linear distortions on the time axis, as well as systematic changes in recorded intensities. Many different algorithms for the retention time alignment of GC-MS and LC-MS data have been proposed and published, but all of them focus either on aligning previously extracted peak features or on aligning and comparing the complete raw data containing all available features. Results In this paper we introduce two algorithms for retention time alignment of multiple GC-MS datasets: multiple alignment by bidirectional best hits peak assignment and cluster extension (BIPACE) and center-star multiple alignment by pairwise partitioned dynamic time warping (CeMAPP-DTW). We show how the similarity-based peak group matching method BIPACE may be used for multiple alignment calculation individually and how it can be used as a preprocessing step for the pairwise alignments performed by CeMAPP-DTW. We evaluate the algorithms individually and in combination on a previously published small GC-MS dataset studying the Leishmania parasite and on a larger GC-MS dataset studying grains of wheat (Triticum aestivum). Conclusions We have shown that BIPACE achieves very high precision and recall and a very low number of false positive peak assignments on both evaluation datasets. CeMAPP-DTW finds a high number of true positives when executed on its own, but achieves even better results when BIPACE is used to constrain its search space. The source code of both algorithms is included in the OpenSource software framework Maltcms, which is available from http://maltcms.sf.net. The evaluation scripts of the present study are available from the same source. PMID:22920415
RepExplore: addressing technical replicate variance in proteomics and metabolomics data analysis.
Glaab, Enrico; Schneider, Reinhard
2015-07-01
High-throughput omics datasets often contain technical replicates included to account for technical sources of noise in the measurement process. Although summarizing these replicate measurements by using robust averages may help to reduce the influence of noise on downstream data analysis, the information on the variance across the replicate measurements is lost in the averaging process and therefore typically disregarded in subsequent statistical analyses.We introduce RepExplore, a web-service dedicated to exploit the information captured in the technical replicate variance to provide more reliable and informative differential expression and abundance statistics for omics datasets. The software builds on previously published statistical methods, which have been applied successfully to biomedical omics data but are difficult to use without prior experience in programming or scripting. RepExplore facilitates the analysis by providing a fully automated data processing and interactive ranking tables, whisker plot, heat map and principal component analysis visualizations to interpret omics data and derived statistics. Freely available at http://www.repexplore.tk enrico.glaab@uni.lu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Microarray Data Mining for Potential Selenium Targets in Chemoprevention of Prostate Cancer
ZHANG, HAITAO; DONG, YAN; ZHAO, HONGJUAN; BROOKS, JAMES D.; HAWTHORN, LESLEYANN; NOWAK, NORMA; MARSHALL, JAMES R.; GAO, ALLEN C.; IP, CLEMENT
2008-01-01
Background A previous clinical trial showed that selenium supplementation significantly reduced the incidence of prostate cancer. We report here a bioinformatics approach to gain new insights into selenium molecular targets that might be relevant to prostate cancer chemoprevention. Materials and Methods We first performed data mining analysis to identify genes which are consistently dysregulated in prostate cancer using published datasets from gene expression profiling of clinical prostate specimens. We then devised a method to systematically analyze three selenium microarray datasets from the LNCaP human prostate cancer cells, and to match the analysis to the cohort of genes implicated in prostate carcinogenesis. Moreover, we compared the selenium datasets with two datasets obtained from expression profiling of androgen-stimulated LNCaP cells. Results We found that selenium reverses the expression of genes implicated in prostate carcinogenesis. In addition, we found that selenium could counteract the effect of androgen on the expression of a subset obtained from androgen-regulated genes. Conclusions The above information provides us with a treasure of new clues to investigate the mechanism of selenium chemoprevention of prostate cancer. Furthermore, these selenium target genes could also serve as biomarkers in future clinical trials to gauge the efficacy of selenium intervention. PMID:18548127
Junge relationships in measurement data for cyclic siloxanes in air.
MacLeod, Matthew; Kierkegaard, Amelie; Genualdi, Susie; Harner, Tom; Scheringer, Martin
2013-10-01
In 1974, Junge postulated a relationship between variability of concentrations of gases in air at remote locations and their atmospheric residence time, and this Junge relationship has subsequently been observed empirically for a range of trace gases. Here, we analyze two previously-published datasets of concentrations of cyclic volatile methyl siloxanes (cVMS) in air and find Junge relationships in both. The first dataset is a time series of concentrations of decamethylcyclopentasiloxane (D5) measured between January and June, 2009 at a rural site in southern Sweden that shows a Junge relationship in the temporal variability of the measurements. The second dataset consists of measurements of hexamethylcyclotrisiloxane (D3), octamethylcyclotetrasiloxane (D4) and D5 made simultaneously at 12 sites in the Global Atmospheric Passive Sampling (GAPS) network that shows a Junge relationship in the spatial variability of the three cVMS congeners. We use the Junge relationship for the GAPS dataset to estimate atmospheric lifetimes of dodecamethylcyclohexasiloxane (D6), 8:2-fluorotelomer alcohol and trichlorinated biphenyls that are within a factor of 3 of estimates based on degradation rate constants for reaction with hydroxyl radical determined in laboratory studies. Copyright © 2012 Elsevier Ltd. All rights reserved.
Privacy-Preserving Data Exploration in Genome-Wide Association Studies.
Johnson, Aaron; Shmatikov, Vitaly
2013-08-01
Genome-wide association studies (GWAS) have become a popular method for analyzing sets of DNA sequences in order to discover the genetic basis of disease. Unfortunately, statistics published as the result of GWAS can be used to identify individuals participating in the study. To prevent privacy breaches, even previously published results have been removed from public databases, impeding researchers' access to the data and hindering collaborative research. Existing techniques for privacy-preserving GWAS focus on answering specific questions, such as correlations between a given pair of SNPs (DNA sequence variations). This does not fit the typical GWAS process, where the analyst may not know in advance which SNPs to consider and which statistical tests to use, how many SNPs are significant for a given dataset, etc. We present a set of practical, privacy-preserving data mining algorithms for GWAS datasets. Our framework supports exploratory data analysis, where the analyst does not know a priori how many and which SNPs to consider. We develop privacy-preserving algorithms for computing the number and location of SNPs that are significantly associated with the disease, the significance of any statistical test between a given SNP and the disease, any measure of correlation between SNPs, and the block structure of correlations. We evaluate our algorithms on real-world datasets and demonstrate that they produce significantly more accurate results than prior techniques while guaranteeing differential privacy.
Improving the discoverability, accessibility, and citability of omics datasets: a case report.
Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J
2017-03-01
Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Analysis of copy number variations at 15 schizophrenia-associated loci.
Rees, Elliott; Walters, James T R; Georgieva, Lyudmila; Isles, Anthony R; Chambert, Kimberly D; Richards, Alexander L; Mahoney-Davies, Gerwyn; Legge, Sophie E; Moran, Jennifer L; McCarroll, Steven A; O'Donovan, Michael C; Owen, Michael J; Kirov, George
2014-02-01
A number of copy number variants (CNVs) have been suggested as susceptibility factors for schizophrenia. For some of these the data remain equivocal, and the frequency in individuals with schizophrenia is uncertain. To determine the contribution of CNVs at 15 schizophrenia-associated loci (a) using a large new data-set of patients with schizophrenia (n = 6882) and controls (n = 6316), and (b) combining our results with those from previous studies. We used Illumina microarrays to analyse our data. Analyses were restricted to 520 766 probes common to all arrays used in the different data-sets. We found higher rates in participants with schizophrenia than in controls for 13 of the 15 previously implicated CNVs. Six were nominally significantly associated (P<0.05) in this new data-set: deletions at 1q21.1, NRXN1, 15q11.2 and 22q11.2 and duplications at 16p11.2 and the Angelman/Prader-Willi Syndrome (AS/PWS) region. All eight AS/PWS duplications in patients were of maternal origin. When combined with published data, 11 of the 15 loci showed highly significant evidence for association with schizophrenia (P<4.1×10(-4)). We strengthen the support for the majority of the previously implicated CNVs in schizophrenia. About 2.5% of patients with schizophrenia and 0.9% of controls carry a large, detectable CNV at one of these loci. Routine CNV screening may be clinically appropriate given the high rate of known deleterious mutations in the disorder and the comorbidity associated with these heritable mutations.
Meyers, Robin M.; Bryan, Jordan G.; McFarland, James M.; Weir, Barbara A.; Sizemore, Ann E.; Xu, Han; Dharia, Neekesh V.; Montgomery, Phillip G.; Cowley, Glenn S.; Pantel, Sasha; Goodale, Amy; Lee, Yenarae; Ali, Levi D.; Jiang, Guozhi; Lubonja, Rakela; Harrington, William F.; Strickland, Matthew; Wu, Ting; Hawes, Derek C.; Zhivich, Victor A.; Wyatt, Meghan R.; Kalani, Zohra; Chang, Jaime J.; Okamoto, Michael; Stegmaier, Kimberly; Golub, Todd R.; Boehm, Jesse S.; Vazquez, Francisca; Root, David E.; Hahn, William C.; Tsherniak, Aviad
2017-01-01
The CRISPR-Cas9 system has revolutionized gene editing both on single genes and in multiplexed loss-of-function screens, enabling precise genome-scale identification of genes essential to proliferation and survival of cancer cells1,2. However, previous studies reported that a gene-independent anti-proliferative effect of Cas9-mediated DNA cleavage confounds such measurement of genetic dependency, leading to false positive results in copy number amplified regions3,4. We developed CERES, a computational method to estimate gene dependency levels from CRISPR-Cas9 essentiality screens while accounting for the copy-number-specific effect. As part of our efforts to define a cancer dependency map, we performed genome-scale CRISPR-Cas9 essentiality screens across 342 cancer cell lines and applied CERES to this dataset. We found that CERES reduced false positive results and estimated sgRNA activity for both this dataset and previously published screens performed with different sgRNA libraries. Here, we demonstrate the utility of this collection of screens, upon CERES correction, in revealing cancer-type-specific vulnerabilities. PMID:29083409
Zhang, Mingjing; Wen, Ming; Zhang, Zhi-Min; Lu, Hongmei; Liang, Yizeng; Zhan, Dejian
2015-03-01
Retention time shift is one of the most challenging problems during the preprocessing of massive chromatographic datasets. Here, an improved version of the moving window fast Fourier transform cross-correlation algorithm is presented to perform nonlinear and robust alignment of chromatograms by analyzing the shifts matrix generated by moving window procedure. The shifts matrix in retention time can be estimated by fast Fourier transform cross-correlation with a moving window procedure. The refined shift of each scan point can be obtained by calculating the mode of corresponding column of the shifts matrix. This version is simple, but more effective and robust than the previously published moving window fast Fourier transform cross-correlation method. It can handle nonlinear retention time shift robustly if proper window size has been selected. The window size is the only one parameter needed to adjust and optimize. The properties of the proposed method are investigated by comparison with the previous moving window fast Fourier transform cross-correlation and recursive alignment by fast Fourier transform using chromatographic datasets. The pattern recognition results of a gas chromatography mass spectrometry dataset of metabolic syndrome can be improved significantly after preprocessing by this method. Furthermore, the proposed method is available as an open source package at https://github.com/zmzhang/MWFFT2. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
NASA Astrophysics Data System (ADS)
Stolper, Daniel A.; Eiler, John M.; Higgins, John A.
2018-04-01
The measurement of multiply isotopically substituted ('clumped isotope') carbonate groups provides a way to reconstruct past mineral formation temperatures. However, dissolution-reprecipitation (i.e., recrystallization) reactions, which commonly occur during sedimentary burial, can alter a sample's clumped-isotope composition such that it partially or wholly reflects deeper burial temperatures. Here we derive a quantitative model of diagenesis to explore how diagenesis alters carbonate clumped-isotope values. We apply the model to a new dataset from deep-sea sediments taken from Ocean Drilling Project site 807 in the equatorial Pacific. This dataset is used to ground truth the model. We demonstrate that the use of the model with accompanying carbonate clumped-isotope and carbonate δ18O values provides new constraints on both the diagenetic history of deep-sea settings as well as past equatorial sea-surface temperatures. Specifically, the combination of the diagenetic model and data support previous work that indicates equatorial sea-surface temperatures were warmer in the Paleogene as compared to today. We then explore whether the model is applicable to shallow-water settings commonly preserved in the rock record. Using a previously published dataset from the Bahamas, we demonstrate that the model captures the main trends of the data as a function of burial depth and thus appears applicable to a range of depositional settings.
Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000
NASA Astrophysics Data System (ADS)
Reba, Meredith; Reitsma, Femke; Seto, Karen C.
2016-06-01
How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski. The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point. The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends.
Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000
Reba, Meredith; Reitsma, Femke; Seto, Karen C.
2016-01-01
How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski. The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point. The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends. PMID:27271481
Enrichment of Data Publications in Earth Sciences - Data Reports as a Missing Link
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Bertelmann, Roland; Haberland, Christian; Evans, Peter L.
2015-04-01
During the past decade, the relevance of research data stewardship has been rising significantly. Preservation and publication of scientific data for long-term use, including the storage in adequate repositories has been identified as a key issue by the scientific community as well as by bodies like research agencies. Essential for any kind of re-use is a proper description of the datasets. As a result of the increasing interest, data repositories have been developed and the included research data is accompanied with at least a minimum set of metadata. This metadata is useful for data discovery and a first insight to the content of a dataset. But often data re-use needs more and extended information. Many datasets are accompanied by a small 'readme file' with basic information on the data structure, or other accompanying documents. A source of additional information could be an article published in one of the newly emerging data journals (e.g. Copernicus's ESSD Earth System Science Data or Nature's Scientific Data). Obviously there is an information gap between a 'readme file', that is only accessible after data download (which often leads to less usage of published datasets than if the information was available beforehand) and the much larger effort to prepare an article for a peer-reviewed data journal. For many years, GFZ German Research Centre for Geosciences publishes 'Scientific Technical Reports (STR)' as a report series which is electronically persistently available and citable with assigned DOIs. This series was opened for the description of parallel published datasets as 'STR Data'. These are internally reviewed and offer a flexible publication format describing published data in depth, suitable for different datasets ranging from long-term monitoring time series of observatories to field data, to (meta-)databases, and software publications. STR Data offer a full and consistent overview and description to all relevant parameters of a linked published dataset. These reports are readable and citable on their own, but are, of course, closely connected to the respective datasets. Therefore, they give full insight into the framework of the data before data download. This is especially relevant for large and often heterogeneous datasets, like e.g. controlled-source seismic data gathered with instruments of the 'Geophysical Instrument Pool Potsdam GIPP'. Here, details of the instrumentation, data organization, data format, accuracy, geographical coordinates, timing and data completeness, etc. need to be documented. STR Data are also attractive for the publication of historic datasets, e.g. 30-40 years old seismic experiments. It is also possible for one STR Data to describe several datasets, e.g. from multiple diverse instruments types, or distinct regions of interest. The publication of DOI-assigned data reports is a helpful tool to fill the gap between basic metadata and restricted 'readme' information on the one hand and preparing extended journal articles on the other hand. They open the way for informed re-use and, with their comprehensive data description, may act as 'appetizer' for the re-use of published datasets.
A wavelet method for modeling and despiking motion artifacts from resting-state fMRI time series
Patel, Ameera X.; Kundu, Prantik; Rubinov, Mikail; Jones, P. Simon; Vértes, Petra E.; Ersche, Karen D.; Suckling, John; Bullmore, Edward T.
2014-01-01
The impact of in-scanner head movement on functional magnetic resonance imaging (fMRI) signals has long been established as undesirable. These effects have been traditionally corrected by methods such as linear regression of head movement parameters. However, a number of recent independent studies have demonstrated that these techniques are insufficient to remove motion confounds, and that even small movements can spuriously bias estimates of functional connectivity. Here we propose a new data-driven, spatially-adaptive, wavelet-based method for identifying, modeling, and removing non-stationary events in fMRI time series, caused by head movement, without the need for data scrubbing. This method involves the addition of just one extra step, the Wavelet Despike, in standard pre-processing pipelines. With this method, we demonstrate robust removal of a range of different motion artifacts and motion-related biases including distance-dependent connectivity artifacts, at a group and single-subject level, using a range of previously published and new diagnostic measures. The Wavelet Despike is able to accommodate the substantial spatial and temporal heterogeneity of motion artifacts and can consequently remove a range of high and low frequency artifacts from fMRI time series, that may be linearly or non-linearly related to physical movements. Our methods are demonstrated by the analysis of three cohorts of resting-state fMRI data, including two high-motion datasets: a previously published dataset on children (N = 22) and a new dataset on adults with stimulant drug dependence (N = 40). We conclude that there is a real risk of motion-related bias in connectivity analysis of fMRI data, but that this risk is generally manageable, by effective time series denoising strategies designed to attenuate synchronized signal transients induced by abrupt head movements. The Wavelet Despiking software described in this article is freely available for download at www.brainwavelet.org. PMID:24657353
Enrichment in 13C of atmospheric CH4 during the Younger Dryas termination
NASA Astrophysics Data System (ADS)
Melton, J. R.; Schaefer, H.; Whiticar, M. J.
2012-07-01
The abrupt warming across the Younger Dryas termination (~11 600 yr before present) was marked by a large increase in the global atmospheric methane mixing ratio. The debate over sources responsible for the rise in methane centers on the roles of global wetlands, marine gas hydrates, and thermokarst lakes. We present a new, higher-precision methane stable carbon isotope ratio (δ13CH4) dataset from ice sampled at Påkitsoq, Greenland that shows distinct 13C-enrichment associated with this rise. We investigate the validity of this finding in face of known anomalous methane concentrations that occur at Påkitsoq. Comparison with previously published datasets to determine the robustness of our results indicates a similar trend in ice from both an Antarctic ice core and previously published Påkitsoq data measured using four different extraction and analytical techniques. The δ13CH4 trend suggests that 13C-enriched CH4 sources played an important role in the concentration increase. In a first attempt at quantifying the various contributions from our data, we apply a methane triple mass balance of stable carbon and hydrogen isotope ratios and radiocarbon. The mass balance results suggest biomass burning (42-66% of total methane flux increase) and thermokarst lakes (27-59%) as the dominant contributing sources. Given the high uncertainty and low temporal resolution of the 14CH4 dataset used in the triple mass balance, we also performed a mass balance test using just δ13C and δD. These results further support biomass burning as a dominant source, but do not allow distinguishing of thermokarst lake contributions from boreal wetlands, aerobic plant methane, or termites. Our results in both mass balance tests do not suggest as large a role for tropical wetlands or marine gas hydrates as commonly proposed.
Moiseenko, Vitali; Wu, Jonn; Hovan, Allan; Saleh, Ziad; Apte, Aditya; Deasy, Joseph O.; Harrow, Stephen; Rabuka, Carman; Muggli, Adam; Thompson, Anna
2011-01-01
Purpose The severe reduction of salivary function (xerostomia) is a common complication following radiation therapy for head and neck cancer. Consequently, guidelines to ensure adequate function based on parotid gland tolerance dose-volume parameters have been suggested by the QUANTEC group (1) and by Ortholan et al. (2). We perform a validation test of these guidelines against a prospectively collected dataset and compared to a previously published dataset. Method and Materials Whole-mouth stimulated salivary flow data from 66 head and neck cancer patients treated with radiotherapy at the British Columbia Cancer Agency (BCCA) were measured, and treatment planning data were abstracted. Flow measurements were collected from 50 patients at 3 months, and 60 patients at 12 month follow-up. Previously published data from a second institution (WUSTL) were used for comparison. A logistic model was used to describe the incidence of grade 4 xerostomia as a function of the mean dose of the spared parotid gland. The rate of correctly predicting the lack of xerostomia (negative predictive value, NPV) was computed for both the QUANTEC constraints and Ortholan et al. (2) recommendation to constrain the total volume of both glands receiving more than 40 Gy to less than 33%. Results Both data sets showed a rate of xerostomia < 20 % when the mean dose to the least-irradiated parotid gland is kept below 20 Gy. Logistic model parameters for the incidence of xerostomia at 12 months after therapy, based on the least-irradiated gland, were D50=32.4 Gy and and γ=0.97. NPVs for QUANTEC guideline were 94% (BCCA data), 90% (WUSTL data). For Ortholan et al. (2) guideline NPVs were 85% (BCCA), and 86% (WUSTL). Conclusion This confirms that the QUANTEC guideline effectively avoids xerostomia, and this is somewhat more effective than constraints on the volume receiving more than 40 Gy. PMID:21640505
Satellite-derived pan-Arctic melt onset dataset, 2000-2009
NASA Astrophysics Data System (ADS)
Wang, L.; Derksen, C.; Howell, S.; Wolken, G. J.; Sharp, M. J.; Markus, T.
2009-12-01
The SeaWinds Scatterometer on QuikSCAT (QS) has been in orbit for over a decade since its launch in June 1999. Due to its high sensitivity to the appearance of liquid water in snow and day/night all weather capability, QS data have been successfully used to detect melt onset and melt duration for various elements of the cryosphere. These melt datasets are especially useful in the polar regions where the application of imagery from optical sensors is hindered by polar nights and frequent cloud cover. In this study, we generate a pan-Arctic, pan-cryosphere melt onset dataset by combining estimates from previously published algorithms optimized for individual cryospheric elements and applied to QS and Special Sensor Microwave Imager (SSM/I) data for the northern high latitude land surface, ice caps, large lakes, and sea ice. Comparisons of melt onset along the boundaries between different components of the cryosphere show that in general the integrated dataset provides consistent and spatially coherent melt onset estimates across the pan-Arctic. We present the climatology and the anomaly patterns in melt onset during 2000-2009, and identify synoptic-scale linkages between atmospheric conditions and the observed patterns. We also investigate the possible trends in melt onset in the pan-Arctic during the 10-year period.
Diagnostics for generalized linear hierarchical models in network meta-analysis.
Zhao, Hong; Hodges, James S; Carlin, Bradley P
2017-09-01
Network meta-analysis (NMA) combines direct and indirect evidence comparing more than 2 treatments. Inconsistency arises when these 2 information sources differ. Previous work focuses on inconsistency detection, but little has been done on how to proceed after identifying inconsistency. The key issue is whether inconsistency changes an NMA's substantive conclusions. In this paper, we examine such discrepancies from a diagnostic point of view. Our methods seek to detect influential and outlying observations in NMA at a trial-by-arm level. These observations may have a large effect on the parameter estimates in NMA, or they may deviate markedly from other observations. We develop formal diagnostics for a Bayesian hierarchical model to check the effect of deleting any observation. Diagnostics are specified for generalized linear hierarchical NMA models and investigated for both published and simulated datasets. Results from our example dataset using either contrast- or arm-based models and from the simulated datasets indicate that the sources of inconsistency in NMA tend not to be influential, though results from the example dataset suggest that they are likely to be outliers. This mimics a familiar result from linear model theory, in which outliers with low leverage are not influential. Future extensions include incorporating baseline covariates and individual-level patient data. Copyright © 2017 John Wiley & Sons, Ltd.
Fast randomization of large genomic datasets while preserving alteration counts.
Gobbi, Andrea; Iorio, Francesco; Dawson, Kevin J; Wedge, David C; Tamborero, David; Alexandrov, Ludmil B; Lopez-Bigas, Nuria; Garnett, Mathew J; Jurman, Giuseppe; Saez-Rodriguez, Julio
2014-09-01
Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a 'mutually exclusive' manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive. We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks. BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
An X-Ray Investigation of the NGC346 Field in the SMC (3): XMM-Newton Data
NASA Technical Reports Server (NTRS)
Naze, Yael; Manfroid, Jean; Corcoran, Michael F.; Stevens, Ian R.
2004-01-01
We present new XMM-Newton results on the field around the NGC346 star cluster in the SMC. This continues and extends previously published work on Chandra observations of the same field. The two XMM-Newton observations were obtained, respectively, six months before and six months after the previously published Chandra data. Of the 51 X-ray sources detected with XMM-Newton, 29 were already detected with Chandru. Comparing the properties of these X-ray sources in each of our three datasets has enabled us to investigate their variability on times scales of a year. Changes in the flux levels and/or spectral properties were observed for 21 of these sources. In addition, we discovered long-term variations in the X-ray properties of the peculiar system HD5980, a luminous blue variable star, that is likely to be a colliding wind binary system, which displays the largest luminosity during the first XMM-Newton observation.
Brokering technologies to realize the hydrology scenario in NSF BCube
NASA Astrophysics Data System (ADS)
Boldrini, Enrico; Easton, Zachary; Fuka, Daniel; Pearlman, Jay; Nativi, Stefano
2015-04-01
In the National Science Foundation (NSF) BCube project an international team composed of cyber infrastructure experts, geoscientists, social scientists and educators are working together to explore the use of brokering technologies, initially focusing on four domains: hydrology, oceans, polar, and weather. In the hydrology domain, environmental models are fundamental to understand the behaviour of hydrological systems. A specific model usually requires datasets coming from different disciplines for its initialization (e.g. elevation models from Earth observation, weather data from Atmospheric sciences, etc.). Scientific datasets are usually available on heterogeneous publishing services, such as inventory and access services (e.g. OGC Web Coverage Service, THREDDS Data Server, etc.). Indeed, datasets are published according to different protocols, moreover they usually come in different formats, resolutions, Coordinate Reference Systems (CRSs): in short different grid environments depending on the original data and the publishing service processing capabilities. Scientists can thus be impeded by the burden of discovery, access and normalize the desired datasets to the grid environment required by the model. These technological tasks of course divert scientists from their main, scientific goals. The use of GI-axe brokering framework has been experimented in a hydrology scenario where scientists needed to compare a particular hydrological model with two different input datasets (digital elevation models): - the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) dataset, v.2. - the Shuttle Radar Topography Mission (SRTM) dataset, v.3. These datasets were published by means of Hyrax Server technology, which can provide NetCDF files at their original resolution and CRS. Scientists had their model running on ArcGIS, so the main goal was to import the datasets using the available ArcPy library and have EPSG:4326 with the same resolution grid as the reference system, so that model outputs could be compared. ArcPy however is able to access only GeoTIff datasets that are published by a OGC Web Coverage Service (WCS). The GI-axe broker has then been deployed between the client application and the data providers. It has been configured to broker the two different Hyrax service endpoints and republish the data content through a WCS interface for the use of the ArcPy library. Finally, scientists were able to easily run the model, and to concentrate on the comparison of the different results obtained according to the selected input dataset. The use of a third party broker to perform such technological tasks has also shown to have the potential advantage of increasing the repeatability of a study among different researchers.
CrossCheck: an open-source web tool for high-throughput screen data analysis.
Najafov, Jamil; Najafov, Ayaz
2017-07-19
Modern high-throughput screening methods allow researchers to generate large datasets that potentially contain important biological information. However, oftentimes, picking relevant hits from such screens and generating testable hypotheses requires training in bioinformatics and the skills to efficiently perform database mining. There are currently no tools available to general public that allow users to cross-reference their screen datasets with published screen datasets. To this end, we developed CrossCheck, an online platform for high-throughput screen data analysis. CrossCheck is a centralized database that allows effortless comparison of the user-entered list of gene symbols with 16,231 published datasets. These datasets include published data from genome-wide RNAi and CRISPR screens, interactome proteomics and phosphoproteomics screens, cancer mutation databases, low-throughput studies of major cell signaling mediators, such as kinases, E3 ubiquitin ligases and phosphatases, and gene ontological information. Moreover, CrossCheck includes a novel database of predicted protein kinase substrates, which was developed using proteome-wide consensus motif searches. CrossCheck dramatically simplifies high-throughput screen data analysis and enables researchers to dig deep into the published literature and streamline data-driven hypothesis generation. CrossCheck is freely accessible as a web-based application at http://proteinguru.com/crosscheck.
White blood cells identification system based on convolutional deep neural learning networks.
Shahin, A I; Guo, Yanhui; Amin, K M; Sharawi, Amr A
2017-11-16
White blood cells (WBCs) differential counting yields valued information about human health and disease. The current developed automated cell morphology equipments perform differential count which is based on blood smear image analysis. Previous identification systems for WBCs consist of successive dependent stages; pre-processing, segmentation, feature extraction, feature selection, and classification. There is a real need to employ deep learning methodologies so that the performance of previous WBCs identification systems can be increased. Classifying small limited datasets through deep learning systems is a major challenge and should be investigated. In this paper, we propose a novel identification system for WBCs based on deep convolutional neural networks. Two methodologies based on transfer learning are followed: transfer learning based on deep activation features and fine-tuning of existed deep networks. Deep acrivation featues are extracted from several pre-trained networks and employed in a traditional identification system. Moreover, a novel end-to-end convolutional deep architecture called "WBCsNet" is proposed and built from scratch. Finally, a limited balanced WBCs dataset classification is performed through the WBCsNet as a pre-trained network. During our experiments, three different public WBCs datasets (2551 images) have been used which contain 5 healthy WBCs types. The overall system accuracy achieved by the proposed WBCsNet is (96.1%) which is more than different transfer learning approaches or even the previous traditional identification system. We also present features visualization for the WBCsNet activation which reflects higher response than the pre-trained activated one. a novel WBCs identification system based on deep learning theory is proposed and a high performance WBCsNet can be employed as a pre-trained network. Copyright © 2017. Published by Elsevier B.V.
Lindgren, Annie R; Anderson, Frank E
2018-01-01
Historically, deep-level relationships within the molluscan class Cephalopoda (squids, cuttlefishes, octopods and their relatives) have remained elusive due in part to the considerable morphological diversity of extant taxa, a limited fossil record for species that lack a calcareous shell and difficulties in sampling open ocean taxa. Many conflicts identified by morphologists in the early 1900s remain unresolved today in spite of advances in morphological, molecular and analytical methods. In this study we assess the utility of transcriptome data for resolving cephalopod phylogeny, with special focus on the orders of Decapodiformes (open-eye squids, bobtail squids, cuttlefishes and relatives). To do so, we took new and previously published transcriptome data and used a unique cephalopod core ortholog set to generate a dataset that was subjected to an array of filtering and analytical methods to assess the impacts of: taxon sampling, ortholog number, compositional and rate heterogeneity and incongruence across loci. Analyses indicated that datasets that maximized taxonomic coverage but included fewer orthologs were less stable than datasets that sacrificed taxon sampling to increase the number of orthologs. Clades recovered irrespective of dataset, filtering or analytical method included Octopodiformes (Vampyroteuthis infernalis + octopods), Decapodiformes (squids, cuttlefishes and their relatives), and orders Oegopsida (open-eyed squids) and Myopsida (e.g., loliginid squids). Ordinal-level relationships within Decapodiformes were the most susceptible to dataset perturbation, further emphasizing the challenges associated with uncovering relationships at deep nodes in the cephalopod tree of life. Copyright © 2017 Elsevier Inc. All rights reserved.
DOIs for Data: Progress in Data Citation and Publication in the Geosciences
NASA Astrophysics Data System (ADS)
Callaghan, S.; Murphy, F.; Tedds, J.; Allan, R.
2012-12-01
Identifiers for data are the bedrock on which data citation and publication rests. These, in their turn, are widely proposed as methods for encouraging researchers to share their datasets, and at the same time receive academic credit for their efforts in producing them. However, neither data citation nor publication can be properly achieved without a method of identifying clearly what is, and what isn't, part of the dataset. Once a dataset becomes part of the scientific record (either through formal data publication or through being cited) then issues such as dataset stability and permanence become vital to address. In the geosciences, several projects in the UK are concentrating on issues of dataset identification, citation and publication. The UK's Natural Environment Research Council's (NERC) Science Information Strategy data citation and publication project is addressing the issue of identifiers for data, stability, transparency, and credit for data producers through data citation. At a data publication level, 2012 has seen the launch of the new Wiley title Geoscience Data Journal and the PREPARDE (Peer Review for Publication & Accreditation of Research Data in the Earth sciences) project, both aiming to encourage data publication by addressing issues such as data paper submission workflows and the scientific peer-review of data. All of these initiatives work with a range of partners including academic institutions, learned societies, data centers and commercial publishers, both nationally and internationally, with a cross-project aim of developing the mechanisms so data can be identified, cited and published with confidence. This involves investigating barriers and drivers to data publishing and sharing, peer review, and re-use of geoscientific datasets, and specifically such topics as dataset requirements for citation, workflows for dataset ingestion into data centers and publishers, procedures and policies for editors, reviewers and authors of data publication, and assessing the trustworthiness of data archives. A key goal is to ensure that these projects reach out to, and are informed by, other related initiatives on a global basis, in particular anyone interested in developing long-term sustainable policies, processes, incentives and business models for managing and publishing research data. This presentation will give an overview of progress in the projects mentioned above, specifically focussing on the use of DOIs for datasets hosted in the NERC environmental data centers, and how DOIs are enabling formal data citation and publication in the geosciences.
Assessing the reproducibility of discriminant function analyses
Andrew, Rose L.; Albert, Arianne Y.K.; Renaut, Sebastien; Rennison, Diana J.; Bock, Dan G.
2015-01-01
Data are the foundation of empirical research, yet all too often the datasets underlying published papers are unavailable, incorrect, or poorly curated. This is a serious issue, because future researchers are then unable to validate published results or reuse data to explore new ideas and hypotheses. Even if data files are securely stored and accessible, they must also be accompanied by accurate labels and identifiers. To assess how often problems with metadata or data curation affect the reproducibility of published results, we attempted to reproduce Discriminant Function Analyses (DFAs) from the field of organismal biology. DFA is a commonly used statistical analysis that has changed little since its inception almost eight decades ago, and therefore provides an opportunity to test reproducibility among datasets of varying ages. Out of 100 papers we initially surveyed, fourteen were excluded because they did not present the common types of quantitative result from their DFA or gave insufficient details of their DFA. Of the remaining 86 datasets, there were 15 cases for which we were unable to confidently relate the dataset we received to the one used in the published analysis. The reasons ranged from incomprehensible or absent variable labels, the DFA being performed on an unspecified subset of the data, or the dataset we received being incomplete. We focused on reproducing three common summary statistics from DFAs: the percent variance explained, the percentage correctly assigned and the largest discriminant function coefficient. The reproducibility of the first two was fairly high (20 of 26, and 44 of 60 datasets, respectively), whereas our success rate with the discriminant function coefficients was lower (15 of 26 datasets). When considering all three summary statistics, we were able to completely reproduce 46 (65%) of 71 datasets. While our results show that a majority of studies are reproducible, they highlight the fact that many studies still are not the carefully curated research that the scientific community and public expects. PMID:26290793
The tragedy of the biodiversity data commons: a data impediment creeping nigher?
Galicia, David; Ariño, Arturo H
2018-01-01
Abstract Researchers are embracing the open access movement to facilitate unrestricted availability of scientific results. One sign of this willingness is the steady increase in data freely shared online, which has prompted a corresponding increase in the number of papers using such data. Publishing datasets is a time-consuming process that is often seen as a courtesy, rather than a necessary step in the research process. Making data accessible allows further research, provides basic information for decision-making and contributes to transparency in science. Nevertheless, the ease of access to heaps of data carries a perception of ‘free lunch for all’, and the work of data publishers is largely going unnoticed. Acknowledging such a significant effort involving the creation, management and publication of a dataset remains a flimsy, not well established practice in the scientific community. In a meta-analysis of published literature, we have observed various dataset citation practices, but mostly (92%) consisting of merely citing the data repository rather than the data publisher. Failing to recognize the work of data publishers might lead to a decrease in the number of quality datasets shared online, compromising potential research that is dependent on the availability of such data. We make an urgent appeal to raise awareness about this issue. PMID:29688384
GPI Spectra of HR8799 C, D, and E in H-K Bands with KLIP Forward Modeling
NASA Technical Reports Server (NTRS)
Greenbaum, Alexandra Z.; Pueyo, Laurent; Ruffio, Jean-Baptiste; Wang, Jason J.; De Rosa, Robert J.; Aguilar, Jonathan; Rameau, Julien; Barman, Travis; Marois, Christian; Marley, Mark S.;
2018-01-01
We demonstrate KLIP forward modeling spectral extraction on Gemini Planet Imager coronagraphic data of HR8799, using PyKLIP. We report new and re-reduced spectrophotometry of HR8799 c, d, and e from H-K bands. We discuss a strategy for choosing optimal KLIP PSF subtraction parameters by injecting fake sources and recovering them over a range of parameters. The K1/K2 spectra for planets c and d are similar to previously published results from the same dataset. We also present a K band spectrum of HR8799e for the first time and show that our H-band spectra agree well with previously published spectra from the VLT/SPHERE instrument. We compare planets c, d, and e with M, L, and T-type field objects. All objects are consistent with low gravity mid-to-late L dwarfs, however, a lack of standard spectra for low gravity late L-type objects lead to poor fit for gravity. We place our results in context of atmospheric models presented in previous publications and discuss differences in the spectra of the three planets.
Luo, Rutao; Piovoso, Michael J.; Martinez-Picado, Javier; Zurakowski, Ryan
2012-01-01
Mathematical models based on ordinary differential equations (ODE) have had significant impact on understanding HIV disease dynamics and optimizing patient treatment. A model that characterizes the essential disease dynamics can be used for prediction only if the model parameters are identifiable from clinical data. Most previous parameter identification studies for HIV have used sparsely sampled data from the decay phase following the introduction of therapy. In this paper, model parameters are identified from frequently sampled viral-load data taken from ten patients enrolled in the previously published AutoVac HAART interruption study, providing between 69 and 114 viral load measurements from 3–5 phases of viral decay and rebound for each patient. This dataset is considerably larger than those used in previously published parameter estimation studies. Furthermore, the measurements come from two separate experimental conditions, which allows for the direct estimation of drug efficacy and reservoir contribution rates, two parameters that cannot be identified from decay-phase data alone. A Markov-Chain Monte-Carlo method is used to estimate the model parameter values, with initial estimates obtained using nonlinear least-squares methods. The posterior distributions of the parameter estimates are reported and compared for all patients. PMID:22815727
Dataset used to improve liquid water absorption models in the microwave
Turner, David
2015-12-14
Two datasets, one a compilation of laboratory data and one a compilation from three field sites, are provided here. These datasets provide measurements of the real and imaginary refractive indices and absorption as a function of cloud temperature. These datasets were used in the development of the new liquid water absorption model that was published in Turner et al. 2015.
Ward, Keith W; Erhardt, Paul; Bachmann, Kenneth
2005-01-01
Previous publications from GlaxoSmithKline and University of Toledo laboratories convey our independent attempts to predict the half-lives of xenobiotics in humans using data obtained from rats. The present investigation was conducted to compare the performance of our published models against a common dataset obtained by merging the two sets of rat versus human half-life (hHL) data previously used by each laboratory. After combining data, mathematical analyses were undertaken by deploying both of our previous models, namely the use of an empirical algorithm based on a best-fit model and the use of rat-to-human liver blood flow ratios as a half-life correction factor. Both qualitative and quantitative analyses were performed, as well as evaluation of the impact of molecular properties on predictability. The merged dataset was remarkably diverse with respect to physiochemical and pharmacokinetic (PK) properties. Application of both models revealed similar predictability, depending upon the measure of stipulated accuracy. Certain molecular features, particularly rotatable bond count and pK(a), appeared to influence the accuracy of prediction. This collaborative effort has resulted in an improved understanding and appreciation of the value of rats to serve as a surrogate for the prediction of xenobiotic half-lives in humans when clinical pharmacokinetic studies are not possible or practicable.
Ratz, Joan M.; Conk, Shannon J.
2014-01-01
The Gap Analysis Program (GAP) of the U.S. Geological Survey (USGS) produces geospatial datasets providing information on land cover, predicted species distributions, stewardship (ownership and conservation status), and an analysis dataset which synthesizes the other three datasets. The intent in providing these datasets is to support the conservation of biodiversity. The datasets are made available at no cost. The initial datasets were created at the state level. More recent datasets have been assembled at regional and national levels. GAP entered an agreement with the Policy Analysis and Science Assistance branch of the USGS to conduct an evaluation to describe the effect that using GAP data has on those who utilize the datasets (GAP users). The evaluation project included multiple components: a discussion regarding use of GAP data conducted with participants at a GAP conference, a literature review of publications that cited use of GAP data, and a survey of GAP users. The findings of the published literature search were used to identify topics to include on the survey. This report summarizes the literature search, the characteristics of the resulting set of publications, the emergent themes from statements made regarding GAP data, and a bibliometric analysis of the publications. We cannot claim that this list includes all publications that have used GAP data. Given the time lapse that is common in the publishing process, more recent datasets may be cited less frequently in this list of publications. Reports or products that used GAP data may be produced but never published in print or released online. In that case, our search strategies would not have located those reports. Authors may have used GAP data but failed to cite it in such a way that the search strategies we used would have located those publications. These are common issues when using a literature search as part of an evaluation project. Although the final list of publications we identified is not comprehensive, this set of publications can be considered a sufficient sample of those citing GAP data and suitable for the descriptive analyses we conducted.
Publishing datasets with eSciDoc and panMetaDocs
NASA Astrophysics Data System (ADS)
Ulbricht, D.; Klump, J.; Bertelmann, R.
2012-04-01
Currently serveral research institutions worldwide undertake considerable efforts to have their scientific datasets published and to syndicate them to data portals as extensively described objects identified by a persistent identifier. This is done to foster the reuse of data, to make scientific work more transparent, and to create a citable entity that can be referenced unambigously in written publications. GFZ Potsdam established a publishing workflow for file based research datasets. Key software components are an eSciDoc infrastructure [1] and multiple instances of the data curation tool panMetaDocs [2]. The eSciDoc repository holds data objects and their associated metadata in container objects, called eSciDoc items. A key metadata element in this context is the publication status of the referenced data set. PanMetaDocs, which is based on PanMetaWorks [3], is a PHP based web application that allows to describe data with any XML-based metadata schema. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and make use of contextual information in a project setting. Access rights can be applied to set visibility of datasets to other project members and allow collaboration on and notifying about datasets (RSS) and interaction with the internal messaging system, that was inherited from panMetaWorks. When a dataset is to be published, panMetaDocs allows to change the publication status of the eSciDoc item from status "private" to "submitted" and prepare the dataset for verification by an external reviewer. After quality checks, the item publication status can be changed to "published". This makes the data and metadata available through the internet worldwide. PanMetaDocs is developed as an eSciDoc application. It is an easy to use graphical user interface to eSciDoc items, their data and metadata. It is also an application supporting a DOI publication agent during the process of publishing scientific datasets as electronic data supplements to research papers. Publication of research manuscripts has an already well established workflow that shares junctures with other processes and involves several parties in the process of dataset publication. Activities of the author, the reviewer, the print publisher and the data publisher have to be coordinated into a common data publication workflow. The case of data publication at GFZ Potsdam displays some specifics, e.g. the DOIDB webservice. The DOIDB is a proxy service at GFZ for the DataCite [4] DOI registration and its metadata store. DOIDB provides a local summary of the dataset DOIs registered through GFZ as a publication agent. An additional use case for the DOIDB is its function to enrich the datacite metadata with additional custom attributes, like a geographic reference in a DIF record. These attributes are at the moment not available in the datacite metadata schema but would be valuable elements for the compilation of data catalogues in the earth sciences and for dissemination of catalogue data via OAI-PMH. [1] http://www.escidoc.org , eSciDoc, FIZ Karlruhe, Germany [2] http://panmetadocs.sf.net , panMetaDocs, GFZ Potsdam, Germany [3] http://metaworks.pangaea.de , panMetaWorks, Dr. R. Huber, MARUM, Univ. Bremen, Germany [4] http://www.datacite.org
Ingwersen, Peter; Chavan, Vishwas
2011-01-01
A professional recognition mechanism is required to encourage expedited publishing of an adequate volume of 'fit-for-use' biodiversity data. As a component of such a recognition mechanism, we propose the development of the Data Usage Index (DUI) to demonstrate to data publishers that their efforts of creating biodiversity datasets have impact by being accessed and used by a wide spectrum of user communities. We propose and give examples of a range of 14 absolute and normalized biodiversity dataset usage indicators for the development of a DUI based on search events and dataset download instances. The DUI is proposed to include relative as well as species profile weighted comparative indicators. We believe that in addition to the recognition to the data publisher and all players involved in the data life cycle, a DUI will also provide much needed yet novel insight into how users use primary biodiversity data. A DUI consisting of a range of usage indicators obtained from the GBIF network and other relevant access points is within reach. The usage of biodiversity datasets leads to the development of a family of indicators in line with well known citation-based measurements of recognition.
Aggarwal, M; Fisher, P; Hüser, A; Kluxen, F M; Parr-Dobrzanski, R; Soufi, M; Strupp, C; Wiemann, C; Billington, R
2015-06-01
Dermal absorption is a key parameter in non-dietary human safety assessments for agrochemicals. Conservative default values and other criteria in the EFSA guidance have substantially increased generation of product-specific in vitro data and in some cases, in vivo data. Therefore, data from 190 GLP- and OECD guideline-compliant human in vitro dermal absorption studies were published, suggesting EFSA defaults and criteria should be revised (Aggarwal et al., 2014). This follow-up article presents data from an additional 171 studies and also the combined dataset. Collectively, the data provide consistent and compelling evidence for revision of EFSA's guidance. This assessment covers 152 agrochemicals, 19 formulation types and representative ranges of spray concentrations. The analysis used EFSA's worst-case dermal absorption definition (i.e., an entire skin residue, except for surface layers of stratum corneum, is absorbed). It confirmed previously proposed default values of 6% for liquid and 2% for solid concentrates, irrespective of active substance loading, and 30% for all spray dilutions, irrespective of formulation type. For concentrates, absorption from solvent-based formulations provided reliable read-across for other formulation types, as did water-based products for solid concentrates. The combined dataset confirmed that absorption does not increase linearly beyond a 5-fold increase in dilution. Finally, despite using EFSA's worst-case definition for absorption, a rationale for routinely excluding the entire stratum corneum residue, and ideally the entire epidermal residue in in vitro studies, is presented. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Associations of Drug Lipophilicity and Extent of Metabolism with Drug-Induced Liver Injury.
McEuen, Kristin; Borlak, Jürgen; Tong, Weida; Chen, Minjun
2017-06-22
Drug-induced liver injury (DILI), although rare, is a frequent cause of adverse drug reactions resulting in warnings and withdrawals of numerous medications. Despite the research community's best efforts, current testing strategies aimed at identifying hepatotoxic drugs prior to human trials are not sufficiently powered to predict the complex mechanisms leading to DILI. In our previous studies, we demonstrated lipophilicity and dose to be associated with increased DILI risk and, and in our latest work, we factored reactive metabolites into the algorithm to predict DILI. Given the inconsistency in determining the potential for drugs to cause DILI, the present study comprehensively assesses the relationship between DILI risk and lipophilicity and the extent of metabolism using a large published dataset of 1036 Food and Drug Administration (FDA)-approved drugs by considering five independent DILI annotations. We found that lipophilicity and the extent of metabolism alone were associated with increased risk for DILI. Moreover, when analyzed in combination with high daily dose (≥100 mg), lipophilicity was statistically significantly associated with the risk of DILI across all datasets ( p < 0.05). Similarly, the combination of extensive hepatic metabolism (≥50%) and high daily dose (≥100 mg) was also strongly associated with an increased risk of DILI among all datasets analyzed ( p < 0.05). Our results suggest that both lipophilicity and the extent of hepatic metabolism can be considered important risk factors for DILI in humans, and that this relationship to DILI risk is much stronger when considered in combination with dose. The proposed paradigm allows the convergence of different published annotations to a more uniform assessment.
Recalculating the quasar luminosity function of the extended Baryon Oscillation Spectroscopic Survey
NASA Astrophysics Data System (ADS)
Caditz, David M.
2017-12-01
Aims: The extended Baryon Oscillation Spectroscopic Survey (eBOSS) of the Sloan Digital Sky Survey provides a uniform sample of over 13 000 variability selected quasi-stellar objects (QSOs) in the redshift range 0.68
Mesoscale brain explorer, a flexible python-based image analysis and visualization tool.
Haupt, Dirk; Vanni, Matthieu P; Bolanos, Federico; Mitelut, Catalin; LeDue, Jeffrey M; Murphy, Tim H
2017-07-01
Imaging of mesoscale brain activity is used to map interactions between brain regions. This work has benefited from the pioneering studies of Grinvald et al., who employed optical methods to image brain function by exploiting the properties of intrinsic optical signals and small molecule voltage-sensitive dyes. Mesoscale interareal brain imaging techniques have been advanced by cell targeted and selective recombinant indicators of neuronal activity. Spontaneous resting state activity is often collected during mesoscale imaging to provide the basis for mapping of connectivity relationships using correlation. However, the information content of mesoscale datasets is vast and is only superficially presented in manuscripts given the need to constrain measurements to a fixed set of frequencies, regions of interest, and other parameters. We describe a new open source tool written in python, termed mesoscale brain explorer (MBE), which provides an interface to process and explore these large datasets. The platform supports automated image processing pipelines with the ability to assess multiple trials and combine data from different animals. The tool provides functions for temporal filtering, averaging, and visualization of functional connectivity relations using time-dependent correlation. Here, we describe the tool and show applications, where previously published datasets were reanalyzed using MBE.
Jabaily, Rachel S; Shepherd, Kelly A; Michener, Pryce S; Bush, Caroline J; Rivero, Rodrigo; Gardner, Andrew G; Sessa, Emily B
2018-05-15
Goodeniaceae is a primarily Australian flowering plant family with a complex taxonomy and evolutionary history. Previous phylogenetic analyses have successfully resolved the backbone topology of the largest clade in the family, Goodenia s.l., but have failed to clarify relationships within the species-rich and enigmatic Goodenia clade C, a prerequisite for taxonomic revision of the group. We used genome skimming to retrieve sequences for chloroplast, mitochondrial, and nuclear markers for 24 taxa representing Goodenia s.l., with a particular focus on Goodenia clade C. We performed extensive hypothesis tests to explore incongruence in clade C and evaluate statistical support for clades within this group, using datasets from all three genomic compartments. The mitochondrial dataset is comparable to the chloroplast dataset in providing resolution within Goodenia clade C, though backbone support values within this clade remain low. The hypothesis tests provided an additional, complementary means of evaluating support for clades. We propose that the major subclades of Goodenia clade C (C1-C3 + Verreauxia) are the result of a rapid radiation, and each represents a distinct lineage. Copyright © 2018. Published by Elsevier Inc.
Outer region scaling using the freestream velocity for nonuniform open channel flow over gravel
NASA Astrophysics Data System (ADS)
Stewart, Robert L.; Fox, James F.
2017-06-01
The theoretical basis for outer region scaling using the freestream velocity for nonuniform open channel flows over gravel is derived and tested for the first time. Owing to the gradual expansion of the flow within the nonuniform case presented, it is hypothesized that the flow can be defined as an equilibrium turbulent boundary layer using the asymptotic invariance principle. The hypothesis is supported using similarity analysis to derive a solution, followed by further testing with experimental datasets. For the latter, 38 newly collected experimental velocity profiles across three nonuniform flows over gravel in a hydraulic flume are tested as are 43 velocity profiles previously published in seven peer-reviewed journal papers that focused on fluid mechanics of nonuniform open channel over gravel. The findings support the nonuniform flows as equilibrium defined by the asymptotic invariance principle, which is reflective of the consistency of the turbulent structure's form and function within the expanding flow. However, roughness impacts the flow structure when comparing across the published experimental datasets. As a secondary objective, we show how previously published mixed scales can be used to assist with freestream velocity scaling of the velocity deficit and thus empirically account for the roughness effects that extend into the outer region of the flow. One broader finding of this study is providing the theoretical context to relax the use of the elusive friction velocity when scaling nonuniform flows in gravel bed rivers; and instead to apply the freestream velocity. A second broader finding highlighted by our results is that scaling of nonuniform flow in gravel bed rivers is still not fully resolved theoretically since mixed scaling relies to some degree on empiricism. As researchers resolve the form and function of macroturbulence in the outer region, we hope to see the closing of this research gap.
National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) scientists have just released a comprehensive dataset of the proteomic analysis of high grade serous ovarian tumor samples, previously genomically analyzed by The Cancer Genome Atlas (TCGA). This is one of the largest public datasets covering the proteome, phosphoproteome and glycoproteome with complementary deep genomic sequencing data on the same tumor.
Rantalainen, Timo; Chivers, Paola; Beck, Belinda R; Robertson, Sam; Hart, Nicolas H; Nimphius, Sophia; Weeks, Benjamin K; McIntyre, Fleur; Hands, Beth; Siafarikas, Aris
Most imaging methods, including peripheral quantitative computed tomography (pQCT), are susceptible to motion artifacts particularly in fidgety pediatric populations. Methods currently used to address motion artifact include manual screening (visual inspection) and objective assessments of the scans. However, previously reported objective methods either cannot be applied on the reconstructed image or have not been tested for distal bone sites. Therefore, the purpose of the present study was to develop and validate motion artifact classifiers to quantify motion artifact in pQCT scans. Whether textural features could provide adequate motion artifact classification performance in 2 adolescent datasets with pQCT scans from tibial and radial diaphyses and epiphyses was tested. The first dataset was split into training (66% of sample) and validation (33% of sample) datasets. Visual classification was used as the ground truth. Moderate to substantial classification performance (J48 classifier, kappa coefficients from 0.57 to 0.80) was observed in the validation dataset with the novel texture-based classifier. In applying the same classifier to the second cross-sectional dataset, a slight-to-fair (κ = 0.01-0.39) classification performance was observed. Overall, this novel textural analysis-based classifier provided a moderate-to-substantial classification of motion artifact when the classifier was specifically trained for the measurement device and population. Classification based on textural features may be used to prescreen obviously acceptable and unacceptable scans, with a subsequent human-operated visual classification of any remaining scans. Copyright © 2017 The International Society for Clinical Densitometry. Published by Elsevier Inc. All rights reserved.
The Berkeley SuperNova Ia Program (BSNIP): Dataset and Initial Analysis
NASA Astrophysics Data System (ADS)
Silverman, Jeffrey; Ganeshalingam, M.; Kong, J.; Li, W.; Filippenko, A.
2012-01-01
I will present spectroscopic data from the Berkeley SuperNova Ia Program (BSNIP), their initial analysis, and the results of attempts to use spectral information to improve cosmological distance determinations to Type Ia supernova (SNe Ia). The dataset consists of 1298 low-redshift (z< 0.2) optical spectra of 582 SNe Ia observed from 1989 through the end of 2008. Many of the SNe have well-calibrated light curves with measured distance moduli as well as spectra that have been corrected for host-galaxy contamination. I will also describe the spectral classification scheme employed (using the SuperNova Identification code, SNID; Blondin & Tonry 2007) which utilizes a newly constructed set of SNID spectral templates. The sheer size of the BSNIP dataset and the consistency of the observation and reduction methods make this sample unique among all other published SN Ia datasets. I will also discuss measurements of the spectral features of about one-third of the spectra which were obtained within 20 days of maximum light. I will briefly describe the adopted method of automated, robust spectral-feature definition and measurement which expands upon similar previous studies. Comparisons of these measurements of SN Ia spectral features to photometric observables will be presented with an eye toward using spectral information to calculate more accurate cosmological distances. Finally, I will comment on related projects which also utilize the BSNIP dataset that are planned for the near future. This research was supported by NSF grant AST-0908886 and the TABASGO Foundation. I am grateful to Marc J. Staley for a Graduate Fellowship.
Howard, B J; Wells, C; Barnett, C L; Howard, D C
2017-02-01
Under the International Atomic Energy Agency (IAEA) MODARIA (Modelling and Data for Radiological Impact Assessments) Programme, there has been an initiative to improve the derivation, provenance and transparency of transfer parameter values for radionuclides from feed to animal products that are for human consumption. A description of the revised MODARIA 2016 cow milk dataset is described in this paper. As previously reported for the MODARIA goat milk dataset, quality control has led to the discounting of some references used in IAEA's Technical Report Series (TRS) report 472 (IAEA, 2010). The number of Concentration Ratio (CR) values has been considerably increased by (i) the inclusion of more literature from agricultural studies which particularly enhanced the stable isotope data of both CR and F m and (ii) by estimating dry matter intake from assumed liveweight. In TRS 472, the data for cow milk were 714 transfer coefficient (F m ) values and 254 CR values describing 31 elements and 26 elements respectively. In the MODARIA 2016 cow milk dataset, F m and CR values are now reported for 43 elements based upon 825 data values for F m and 824 for CR. The MODARIA 2016 cow milk dataset F m values are within an order of magnitude of those reported in TRS 472. Slightly bigger changes are seen in the CR values, but the increase in size of the dataset creates greater confidence in them. Data gaps that still remain are identified for elements with isotopes relevant to radiation protection. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.
Parks, Connie L; Richard, Adam H; Monson, Keith L
2014-04-01
Facial approximation is the technique of developing a representation of the face from the skull of an unknown individual. Facial approximation relies heavily on average craniofacial soft tissue depths. For more than a century, researchers have employed a broad array of tissue depth collection methodologies, a practice which has resulted in a lack of standardization in craniofacial soft tissue depth research. To combat such methodological inconsistencies, Stephan and Simpson 2008 [15] examined and synthesized a large number of previously published soft tissue depth studies. Their comprehensive meta-analysis produced a pooled dataset of averaged tissue depths and a simplified methodology, which the researchers suggest be utilized as a minimum standard protocol for future craniofacial soft tissue depth research. The authors of the present paper collected craniofacial soft tissue depths using three-dimensional models generated from computed tomography scans of living males and females of four self-identified ancestry groups from the United States ranging in age from 18 to 62 years. This paper assesses the differences between: (i) the pooled mean tissue depth values from the sample utilized in this paper and those published by Stephan 2012 [21] and (ii) the mean tissue depth values of two demographically similar subsets of the sample utilized in this paper and those published by Rhine and Moore 1984 [16]. Statistical test results indicate that the tissue depths collected from the sample evaluated in this paper are significantly and consistently larger than those published by Stephan 2012 [21]. Although a lack of published variance data by Rhine and Moore 1984 [16] precluded a direct statistical assessment, a substantive difference was also concluded. Further, the dataset presented in this study is representative of modern American adults and is, therefore, appropriate for use in constructing contemporary facial approximations. Published by Elsevier Ireland Ltd.
Riccomagno, Eva; Shayganpour, Amirreza; Salerno, Marco
2017-01-01
Anodic porous alumina is a known material based on an old industry, yet with emerging applications in nanoscience and nanotechnology. This is promising, but the nanostructured alumina should be fabricated from inexpensive raw material. We fabricated porous alumina from commercial aluminum food plate in 0.4 M aqueous phosphoric acid, aiming to design an effective manufacturing protocol for the material used as nanoporous filler in dental restorative composites, an application demonstrated previously by our group. We identified the critical input parameters of anodization voltage, bath temperature and anodization time, and the main output parameters of pore diameter, pore spacing and oxide thickness. Scanning electron microscopy and grain analysis allowed us to assess the nanostructured material, and the statistical design of experiments was used to optimize its fabrication. We analyzed a preliminary dataset, designed a second dataset aimed at clarifying the correlations between input and output parameters, and ran a confirmation dataset. Anodization conditions close to 125 V, 20 °C, and 7 h were identified as the best for obtaining, in the shortest possible time, pore diameters and spacing of 100–150 nm and 150–275 nm respectively, and thickness of 6–8 µm, which are desirable for the selected application according to previously published results. Our analysis confirmed the linear dependence of pore size on anodization voltage and of thickness on anodization time. The importance of proper control on the experiment was highlighted, since batch effects emerge when the experimental conditions are not exactly reproduced. PMID:28772776
Palmer, Cameron S; Davey, Tamzyn M; Mok, Meng Tuck; McClure, Rod J; Farrow, Nathan C; Gruen, Russell L; Pollard, Cliff W
2013-06-01
Trauma registries are central to the implementation of effective trauma systems. However, differences between trauma registry datasets make comparisons between trauma systems difficult. In 2005, the collaborative Australian and New Zealand National Trauma Registry Consortium began a process to develop a bi-national minimum dataset (BMDS) for use in Australasian trauma registries. This study aims to describe the steps taken in the development and preliminary evaluation of the BMDS. A working party comprising sixteen representatives from across Australasia identified and discussed the collectability and utility of potential BMDS fields. This included evaluating existing national and international trauma registry datasets, as well as reviewing all quality indicators and audit filters in use in Australasian trauma centres. After the working party activities concluded, this process was continued by a number of interested individuals, with broader feedback sought from the Australasian trauma community on a number of occasions. Once the BMDS had reached a suitable stage of development, an email survey was conducted across Australasian trauma centres to assess whether BMDS fields met an ideal minimum standard of field collectability. The BMDS was also compared with three prominent international datasets to assess the extent of dataset overlap. Following this, the BMDS was encapsulated in a data dictionary, which was introduced in late 2010. The finalised BMDS contained 67 data fields. Forty-seven of these fields met a previously published criterion of 80% collectability across respondent trauma institutions; the majority of the remaining fields either could be collected without any change in resources, or could be calculated from other data fields in the BMDS. However, comparability with international registry datasets was poor. Only nine BMDS fields had corresponding, directly comparable fields in all the national and international-level registry datasets evaluated. A draft BMDS has been developed for use in trauma registries across Australia and New Zealand. The email survey provided strong indications of the utility of the fields contained in the BMDS. The BMDS has been adopted as the dataset to be used by an ongoing Australian Trauma Quality Improvement Program. Copyright © 2012 Elsevier Ltd. All rights reserved.
Otegui, Javier; Ariño, Arturo H
2012-08-15
In any data quality workflow, data publishers must become aware of issues in their data so these can be corrected. User feedback mechanisms provide one avenue, while global assessments of datasets provide another. To date, there is no publicly available tool to allow both biodiversity data institutions sharing their data through the Global Biodiversity Information Facility network and its potential users to assess datasets as a whole. Contributing to bridge this gap both for publishers and users, we introduce BIoDiversity DataSets Assessment Tool, an online tool that enables selected diagnostic visualizations on the content of data publishers and/or their individual collections. The online application is accessible at http://www.unav.es/unzyec/mzna/biddsat/ and is supported by all major browsers. The source code is licensed under the GNU GPLv3 license (http://www.gnu.org/licenses/gpl-3.0.txt) and is available at https://github.com/jotegui/BIDDSAT.
Dental age assessment of southern Chinese using the United Kingdom Caucasian reference dataset.
Jayaraman, Jayakumar; Roberts, Graham J; King, Nigel M; Wong, Hai Ming
2012-03-10
Dental age assessment is one the most accurate methods for estimating the age of an unknown person. Demirjian's dataset on a French-Canadian population has been widely tested for its applicability on various ethnic groups including southern Chinese. Following inaccurate results from these studies, investigators are now confronted with using alternate datasets for comparison. Testing the applicability of other reliable datasets which result in accurate findings might limit the need to develop population specific standards. Recently, a Reference Data Set (RDS) similar to the Demirjian was prepared in the United Kingdom (UK) and has been subsequently validated. The advantages of the UK Caucasian RDS includes versatility from including both the maxillary and mandibular dentitions, involvement of a wide age group of subjects for evaluation and the possibility of precise age estimation with the mathematical technique of meta-analysis. The aim of this study was to evaluate the applicability of the United Kingdom Caucasian RDS on southern Chinese subjects. Dental panoramic tomographs (DPT) of 266 subjects (133 males and 133 females) aged 2-21 years that were previously taken for clinical diagnostic purposes were selected and scored by a single calibrated examiner based on Demirjian's classification of tooth developmental stages (A-H). The ages corresponding to each stage of tooth developmental stage were obtained from the UK dataset. Intra-examiner reproducibility was tested and the Cohen kappa (0.88) showed that the level of agreement was 'almost perfect'. The estimated dental age was then compared with the chronological age using a paired t-test, with statistical significance set at p<0.01. The results showed that the UK dataset, underestimated the age of southern Chinese subjects by 0.24 years but the results were not statistically significant. In conclusion, the UK Caucasian RDS may not be suitable for estimating the age of southern Chinese subjects and there is a need for an ethnic specific reference dataset for southern Chinese. Copyright © 2011. Published by Elsevier Ireland Ltd.
MSWEP V2 global 3-hourly 0.1° precipitation: methodology and quantitative appraisal
NASA Astrophysics Data System (ADS)
Beck, H.; Yang, L.; Pan, M.; Wood, E. F.; William, L.
2017-12-01
Here, we present Multi-Source Weighted-Ensemble Precipitation (MSWEP) V2, the first fully global gridded precipitation (P) dataset with a 0.1° spatial resolution. The dataset covers the period 1979-2016, has a 3-hourly temporal resolution, and was derived by optimally merging a wide range of data sources based on gauges (WorldClim, GHCN-D, GSOD, and others), satellites (CMORPH, GridSat, GSMaP, and TMPA 3B42RT), and reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR). MSWEP V2 implements some major improvements over V1, such as (i) the correction of distributional P biases using cumulative distribution function matching, (ii) increasing the spatial resolution from 0.25° to 0.1°, (iii) the inclusion of ocean areas, (iv) the addition of NCEP-CFSR P estimates, (v) the addition of thermal infrared-based P estimates for the pre-TRMM era, (vi) the addition of 0.1° daily interpolated gauge data, (vii) the use of a daily gauge correction scheme that accounts for regional differences in the 24-hour accumulation period of gauges, and (viii) extension of the data record to 2016. The gauge-based assessment of the reanalysis and satellite P datasets, necessary for establishing the merging weights, revealed that the reanalysis datasets strongly overestimate the P frequency for the entire globe, and that the satellite (resp. reanalysis) datasets consistently performed better at low (high) latitudes. Compared to other state-of-the-art P datasets, MSWEP V2 exhibits more plausible global patterns in mean annual P, percentiles, and annual number of dry days, and better resolves the small-scale variability over topographically complex terrain. Other P datasets appear to consistently underestimate P amounts over mountainous regions. Long-term mean P estimates for the global, land, and ocean domains based on MSWEP V2 are 959, 796, and 1026 mm/yr, respectively, in close agreement with the best previous published estimates.
Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study
2015-01-01
Objective This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are “invisible” or not deposited in a known repository. Methods We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article. Results About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects. Conclusion In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a “dataset,” determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets. PMID:26207759
The Role of Datasets on Scientific Influence within Conflict Research
Van Holt, Tracy; Johnson, Jeffery C.; Moates, Shiloh; Carley, Kathleen M.
2016-01-01
We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving “conflict” in the Web of Science (WoS) over a 66-year period (1945–2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed—such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957–1971 where ideas didn’t persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped shape the operationalization of conflict. In fact, 94% of the works on the CP that analyzed data either relied on publically available datasets, or they generated a dataset and made it public. These datasets appear to be important in the development of conflict research, allowing for cross-case comparisons, and comparisons to previous works. PMID:27124569
The Role of Datasets on Scientific Influence within Conflict Research.
Van Holt, Tracy; Johnson, Jeffery C; Moates, Shiloh; Carley, Kathleen M
2016-01-01
We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving "conflict" in the Web of Science (WoS) over a 66-year period (1945-2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed-such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957-1971 where ideas didn't persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped shape the operationalization of conflict. In fact, 94% of the works on the CP that analyzed data either relied on publically available datasets, or they generated a dataset and made it public. These datasets appear to be important in the development of conflict research, allowing for cross-case comparisons, and comparisons to previous works.
2011-01-01
Background A professional recognition mechanism is required to encourage expedited publishing of an adequate volume of 'fit-for-use' biodiversity data. As a component of such a recognition mechanism, we propose the development of the Data Usage Index (DUI) to demonstrate to data publishers that their efforts of creating biodiversity datasets have impact by being accessed and used by a wide spectrum of user communities. Discussion We propose and give examples of a range of 14 absolute and normalized biodiversity dataset usage indicators for the development of a DUI based on search events and dataset download instances. The DUI is proposed to include relative as well as species profile weighted comparative indicators. Conclusions We believe that in addition to the recognition to the data publisher and all players involved in the data life cycle, a DUI will also provide much needed yet novel insight into how users use primary biodiversity data. A DUI consisting of a range of usage indicators obtained from the GBIF network and other relevant access points is within reach. The usage of biodiversity datasets leads to the development of a family of indicators in line with well known citation-based measurements of recognition. PMID:22373200
Graham Reynolds, R; Niemiller, Matthew L; Revell, Liam J
2014-02-01
Snakes in the families Boidae and Pythonidae constitute some of the most spectacular reptiles and comprise an enormous diversity of morphology, behavior, and ecology. While many species of boas and pythons are familiar, taxonomy and evolutionary relationships within these families remain contentious and fluid. A major effort in evolutionary and conservation biology is to assemble a comprehensive Tree-of-Life, or a macro-scale phylogenetic hypothesis, for all known life on Earth. No previously published study has produced a species-level molecular phylogeny for more than 61% of boa species or 65% of python species. Using both novel and previously published sequence data, we have produced a species-level phylogeny for 84.5% of boid species and 82.5% of pythonid species, contextualized within a larger phylogeny of henophidian snakes. We obtained new sequence data for three boid, one pythonid, and two tropidophiid taxa which have never previously been included in a molecular study, in addition to generating novel sequences for seven genes across an additional 12 taxa. We compiled an 11-gene dataset for 127 taxa, consisting of the mitochondrial genes CYTB, 12S, and 16S, and the nuclear genes bdnf, bmp2, c-mos, gpr35, rag1, ntf3, odc, and slc30a1, totaling up to 7561 base pairs per taxon. We analyzed this dataset using both maximum likelihood and Bayesian inference and recovered a well-supported phylogeny for these species. We found significant evidence of discordance between taxonomy and evolutionary relationships in the genera Tropidophis, Morelia, Liasis, and Leiopython, and we found support for elevating two previously suggested boid species. We suggest a revised taxonomy for the boas (13 genera, 58 species) and pythons (8 genera, 40 species), review relationships between our study and the many other molecular phylogenetic studies of henophidian snakes, and present a taxonomic database and alignment which may be easily used and built upon by other researchers. Copyright © 2013 Elsevier Inc. All rights reserved.
Data publication, documentation and user friendly landing pages - improving data discovery and reuse
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland
2016-04-01
Research data are the basis for scientific research and often irreplaceable (e.g. observational data). Storage of such data in appropriate, theme specific or institutional repositories is an essential part of ensuring their long term preservation and access. The free and open access to research data for reuse and scrutiny has been identified as a key issue by the scientific community as well as by research agencies and the public. To ensure the datasets to intelligible and usable for others they must be accompanied by comprehensive data description and standardized metadata for data discovery, and ideally should be published using digital object identifier (DOI). These make datasets citable and ensure their long-term accessibility and are accepted in reference lists of journal articles (http://www.copdess.org/statement-of-commitment/). The GFZ German Research Centre for Geosciences is the national laboratory for Geosciences in Germany and part of the Helmholtz Association, Germany's largest scientific organization. The development and maintenance of data systems is a key component of 'GFZ Data Services' to support state-of-the-art research. The datasets, archived in and published by the GFZ Data Repository cover all geoscientific disciplines and range from large dynamic datasets deriving from global monitoring seismic or geodetic networks with real-time data acquisition, to remotely sensed satellite products, to automatically generated data publications from a database for data from micro meteorological stations, to various model results, to geochemical and rock mechanical analyses from various labs, and field observations. The user-friendly presentation of published datasets via a DOI landing page is as important for reuse as the storage itself, and the required information is highly specific for each scientific discipline. If dataset descriptions are too general, or require the download of a dataset before knowing its suitability, many researchers often decide not to reuse a published dataset. In contrast to large data repositories without thematic specification, theme-specific data repositories have a large expertise in data discovery and opportunity to develop usable, discipline-specific formats and layouts for specific datasets, including consultation to different formats for the data description (e.g., via a Data Report or an article in a Data Journal) with full consideration of international metadata standards.
And, not or: Quality, quantity in scientific publishing
Allesina, Stefano
2017-01-01
Scientists often perceive a trade-off between quantity and quality in scientific publishing: finite amounts of time and effort can be spent to produce few high-quality papers or subdivided to produce many papers of lower quality. Despite this perception, previous studies have indicated the opposite relationship, in which productivity (publishing more papers) is associated with increased paper quality (usually measured by citation accumulation). We examine this question in a novel way, comparing members of the National Academy of Sciences with themselves across years, and using a much larger dataset than previously analyzed. We find that a member’s most highly cited paper in a given year has more citations in more productive years than in in less productive years. Their lowest cited paper each year, on the other hand, has fewer citations in more productive years. To disentangle the effect of the underlying distributions of citations and productivities, we repeat the analysis for hypothetical publication records generated by scrambling each author’s citation counts among their publications. Surprisingly, these artificial histories re-create the above trends almost exactly. Put another way, the observed positive relationship between quantity and quality can be interpreted as a consequence of randomly drawing citation counts for each publication: more productive years yield higher-cited papers because they have more chances to draw a large value. This suggests that citation counts, and the rewards that have come to be associated with them, may be more stochastic than previously appreciated. PMID:28570567
NASA Astrophysics Data System (ADS)
Yun, S.; Koketsu, K.; Aoki, Y.
2014-12-01
The September 4, 2010, Canterbury earthquake with a moment magnitude (Mw) of 7.1 is a crustal earthquake in the South Island, New Zealand. The February 22, 2011, Christchurch earthquake (Mw=6.3) is the biggest aftershock of the 2010 Canterbury earthquake that is located at about 50 km to the east of the mainshock. Both earthquakes occurred on previously unrecognized faults. Field observations indicate that the rupture of the 2010 Canterbury earthquake reached the surface; the surface rupture with a length of about 30 km is located about 4 km south of the epicenter. Also various data including the aftershock distribution and strong motion seismograms suggest a very complex rupture process. For these reasons it is useful to investigate the complex rupture process using multiple data with various sensitivities to the rupture process. While previously published source models are based on one or two datasets, here we infer the rupture process with three datasets, InSAR, strong-motion, and teleseismic data. We first performed point source inversions to derive the focal mechanism of the 2010 Canterbury earthquake. Based on the focal mechanism, the aftershock distribution, the surface fault traces and the SAR interferograms, we assigned several source faults. We then performed the joint inversion to determine the rupture process of the 2010 Canterbury earthquake most suitable for reproducing all the datasets. The obtained slip distribution is in good agreement with the surface fault traces. We also performed similar inversions to reveal the rupture process of the 2011 Christchurch earthquake. Our result indicates steep dip and large up-dip slip. This reveals the observed large vertical ground motion around the source region is due to the rupture process, rather than the local subsurface structure. To investigate the effects of the 3-D velocity structure on characteristic strong motion seismograms of the two earthquakes, we plan to perform the inversion taking 3-D velocity structure of this region into account.
Garrison, Virginia H.; Beets, Jim; Friedlander, Alan M.; Canty, Steven
2011-01-01
In order to estimate (1) the trapping pressure within Virgin Islands National Park (VINP) waters, (2) the effect of fish traps on park marine resources (both fishes and habitats), and (3) the effectiveness of park regulations in protecting marine resources, traps set by fishers were visually observed and contents censused in situ in 1992, 1993, and 1994, around St. John (U.S. Virgin Islands), within and outside of park waters. A total of 1,340 individual fish (56 species and 23 families) were identified and their lengths estimated for the 211 of 285 visually censused traps that contained fish. This dataset includes for each censused trap: location, depth, substrate/habitat, trap type and construction details, in or out of park waters, and species and estimated fork length (in centimeters) of each individual fish in a trap. Analysis and interpretation of this dataset are provided in previously published reports by the author.
Can We Train Machine Learning Methods to Outperform the High-dimensional Propensity Score Algorithm?
Karim, Mohammad Ehsanul; Pang, Menglan; Platt, Robert W
2018-03-01
The use of retrospective health care claims datasets is frequently criticized for the lack of complete information on potential confounders. Utilizing patient's health status-related information from claims datasets as surrogates or proxies for mismeasured and unobserved confounders, the high-dimensional propensity score algorithm enables us to reduce bias. Using a previously published cohort study of postmyocardial infarction statin use (1998-2012), we compare the performance of the algorithm with a number of popular machine learning approaches for confounder selection in high-dimensional covariate spaces: random forest, least absolute shrinkage and selection operator, and elastic net. Our results suggest that, when the data analysis is done with epidemiologic principles in mind, machine learning methods perform as well as the high-dimensional propensity score algorithm. Using a plasmode framework that mimicked the empirical data, we also showed that a hybrid of machine learning and high-dimensional propensity score algorithms generally perform slightly better than both in terms of mean squared error, when a bias-based analysis is used.
A strategy for evaluating pathway analysis methods.
Yu, Chenggang; Woo, Hyung Jun; Yu, Xueping; Oyama, Tatsuya; Wallqvist, Anders; Reifman, Jaques
2017-10-13
Researchers have previously developed a multitude of methods designed to identify biological pathways associated with specific clinical or experimental conditions of interest, with the aim of facilitating biological interpretation of high-throughput data. Before practically applying such pathway analysis (PA) methods, we must first evaluate their performance and reliability, using datasets where the pathways perturbed by the conditions of interest have been well characterized in advance. However, such 'ground truths' (or gold standards) are often unavailable. Furthermore, previous evaluation strategies that have focused on defining 'true answers' are unable to systematically and objectively assess PA methods under a wide range of conditions. In this work, we propose a novel strategy for evaluating PA methods independently of any gold standard, either established or assumed. The strategy involves the use of two mutually complementary metrics, recall and discrimination. Recall measures the consistency of the perturbed pathways identified by applying a particular analysis method to an original large dataset and those identified by the same method to a sub-dataset of the original dataset. In contrast, discrimination measures specificity-the degree to which the perturbed pathways identified by a particular method to a dataset from one experiment differ from those identifying by the same method to a dataset from a different experiment. We used these metrics and 24 datasets to evaluate six widely used PA methods. The results highlighted the common challenge in reliably identifying significant pathways from small datasets. Importantly, we confirmed the effectiveness of our proposed dual-metric strategy by showing that previous comparative studies corroborate the performance evaluations of the six methods obtained by our strategy. Unlike any previously proposed strategy for evaluating the performance of PA methods, our dual-metric strategy does not rely on any ground truth, either established or assumed, of the pathways perturbed by a specific clinical or experimental condition. As such, our strategy allows researchers to systematically and objectively evaluate pathway analysis methods by employing any number of datasets for a variety of conditions.
Diossy, M; Reiniger, L; Sztupinszki, Z; Krzystanek, M; Timms, K M; Neff, C; Solimeno, C; Pruss, D; Eklund, A C; Tóth, E; Kiss, O; Rusz, O; Cserni, G; Zombori, T; Székely, B; Tímár, J; Csabai, I; Szallasi, Z
2018-06-18
Based on its mechanism of action, PARP inhibitor therapy is expected to benefit mainly tumor cases with homologous recombination deficiency (HRD). Therefore, identification of tumor types with increased HRD is important for the optimal use of this class of therapeutic agents. HRD levels can be estimated using various mutational signatures from next generation sequencing data and we used this approach to determine whether breast cancer brain metastases show altered levels of HRD scores relative to their corresponding primary tumor. We used a previously published next generation sequencing dataset of twenty-one matched primary breast cancer/brain metastasis pairs to derive the various mutational signatures/HRD scores strongly associated with HRD. We also performed the myChoice HRD analysis on an independent cohort of seventeen breast cancer patients with matched primary/brain metastasis pairs. All of the mutational signatures indicative of HRD showed a significant increase in the brain metastases relative to their matched primary tumor in the previously published whole exome sequencing dataset. In the independent validation cohort the myChoice HRD assay showed an increased level in 87.5% of the brain metastases relative to the primary tumor, with 56% of brain metastases being HRD positive according to the myChoice criteria. The consistent observation that brain metastases of breast cancer tend to have higher HRD measures may raise the possibility that brain metastases may be more sensitive to PARP inhibitor treatment. This observation warrants further investigation to assess whether this increase is common to other metastatic sites as well, and whether clinical trials should adjust their strategy in the application of HRD measures for the prioritization of patients for PARP inhibitor therapy.
This ScienceHub entry was developed for the published paper: Consoer et al., 2016, Toxicokinetics of perfluorooctane sulfonate in rainow trout (Oncorhynchus mykiss), Environ. Toxicol. Chem. 35:717-727. Individual rainbow trout were exposed to PFOS by bolus injection (elimination studies) or by adding PFOS to incoming water (branchial uptake studies). The trout were fitted with indwelling catheters and urinary cannulae to permit periodic collection of blood and urine. Additional sampling was conducted to evaluate PFOS uptake from and elimination to respired water. Data obtained from each fish was evaluated using a clearance-volume pharmacokinetic model. Modeled kinetic parameters were then averaged to develop summary statistics which were used as a basis for interpreting modeled results and making comparisons to a previous study of rainbow trout exposed to perfluorooctanoate (PFOA; Consoer et al., 2014, Aquat. Toxicol. 156:65-73). The results of this study, combined with that of the previous PFOA study, suggest that PFOA is a substrate for renal transporters in fish while glomerular filtration alone may be sufficient to explain the observed renal elimination of PFOS. These findings demonstrate that models developed to predict the bioaccumulation of perfluoroalkyl acids by fish must account for differences in renal clearance of individual compounds.This dataset is associated with the following publication:Consoer, D., A. Hoffman , P. Fitzsimmons , P. Kosia
Minutes of the CD-ROM Workshop
NASA Technical Reports Server (NTRS)
King, Joseph H.; Grayzeck, Edwin J.
1989-01-01
The workshop described in this document had two goals: (1) to establish guidelines for the CD-ROM as a tool to distribute datasets; and (2) to evaluate current scientific CD-ROM projects as an archive. Workshop attendees were urged to coordinate with European groups to develop CD-ROM, which is already available at low cost in the U.S., as a distribution medium for astronomical datasets. It was noted that NASA has made the CD Publisher at the National Space Science Data Center (NSSDC) available to the scientific community when the Publisher is not needed for NASA work. NSSDC's goal is to provide the Publisher's user with the hardware and software tools needed to design a user's dataset for distribution. This includes producing a master CD and copies. The prerequisite premastering process is described, as well as guidelines for CD-ROM construction. The production of discs was evaluated. CD-ROM projects, guidelines, and problems of the technology were discussed.
2013-01-01
Background Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons. Results We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention. Conclusion CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets. PMID:23617892
Yang, Fang; Chia, Nicholas; White, Bryan A; Schook, Lawrence B
2013-04-23
Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons. We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention. CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets.
BiPACE 2D--graph-based multiple alignment for comprehensive 2D gas chromatography-mass spectrometry.
Hoffmann, Nils; Wilhelm, Mathias; Doebbe, Anja; Niehaus, Karsten; Stoye, Jens
2014-04-01
Comprehensive 2D gas chromatography-mass spectrometry is an established method for the analysis of complex mixtures in analytical chemistry and metabolomics. It produces large amounts of data that require semiautomatic, but preferably automatic handling. This involves the location of significant signals (peaks) and their matching and alignment across different measurements. To date, there exist only a few openly available algorithms for the retention time alignment of peaks originating from such experiments that scale well with increasing sample and peak numbers, while providing reliable alignment results. We describe BiPACE 2D, an automated algorithm for retention time alignment of peaks from 2D gas chromatography-mass spectrometry experiments and evaluate it on three previously published datasets against the mSPA, SWPA and Guineu algorithms. We also provide a fourth dataset from an experiment studying the H2 production of two different strains of Chlamydomonas reinhardtii that is available from the MetaboLights database together with the experimental protocol, peak-detection results and manually curated multiple peak alignment for future comparability with newly developed algorithms. BiPACE 2D is contained in the freely available Maltcms framework, version 1.3, hosted at http://maltcms.sf.net, under the terms of the L-GPL v3 or Eclipse Open Source licenses. The software used for the evaluation along with the underlying datasets is available at the same location. The C.reinhardtii dataset is freely available at http://www.ebi.ac.uk/metabolights/MTBLS37.
Percolation of binary disk systems: Modeling and theory
Meeks, Kelsey; Tencer, John; Pantoya, Michelle L.
2017-01-12
The dispersion and connectivity of particles with a high degree of polydispersity is relevant to problems involving composite material properties and reaction decomposition prediction and has been the subject of much study in the literature. This paper utilizes Monte Carlo models to predict percolation thresholds for a two-dimensional systems containing disks of two different radii. Monte Carlo simulations and spanning probability are used to extend prior models into regions of higher polydispersity than those previously considered. A correlation to predict the percolation threshold for binary disk systems is proposed based on the extended dataset presented in this work and comparedmore » to previously published correlations. Finally, a set of boundary conditions necessary for a good fit is presented, and a condition for maximizing percolation threshold for binary disk systems is suggested.« less
NASA Astrophysics Data System (ADS)
Sottile, G. D.; Echeverria, M. E.; Mancini, M. V.; Bianchi, M. M.; Marcos, M. A.; Bamonte, F. P.
2015-06-01
The Southern Hemisphere Westerly Winds (SWW) constitute an important zonal circulation system that dominates the dynamics of Southern Hemisphere mid-latitude climate. Little is known about climatic changes in the Southern South America in comparison to the Northern Hemisphere due to the low density of proxy records, and adequate chronology and sampling resolution to address environmental changes of the last 2000 years. Since 2009, new pollen and charcoal records from bog and lakes in northern and southern Patagonia at the east side of the Andes have been published with an adequate calibration of pollen assemblages related to modern vegetation and ecological behaviour. In this work we improve the chronological control of some eastern Andean previously published sequences and integrate pollen and charcoal dataset available east of the Andes to interpret possible environmental and SWW variability at centennial time scales. Through the analysis of modern and past hydric balance dynamics we compare these scenarios with other western Andean SWW sensitive proxy records for the last 2000 years. Due to the distinct precipitation regimes that exist between Northern (40-45° S) and Southern Patagonia (48-52° S) pollen sites locations, shifts on latitudinal and strength of the SWW results in large changes on hydric availability on forest and steppe communities. Therefore, we can interpret fossil pollen dataset as changes on paleohydric balance at every single site by the construction of paleohydric indices and comparison to charcoal records during the last 2000 cal yrs BP. Our composite pollen-based Northern and Southern Patagonia indices can be interpreted as changes in latitudinal variation and intensity of the SWW respectively. Dataset integration suggest poleward SWW between 2000 and 750 cal yrs BP and northward-weaker SWW during the Little Ice Age (750-200 cal yrs BP). These SWW variations are synchronous to Patagonian fire activity major shifts. We found an in phase fire regime (in terms of timing of biomass burning) between northern Patagonia Monte shrubland and Southern Patagonia steppe environments. Conversely, there is an antiphase fire regime between Northern and Southern Patagonia forest and forest-steppe ecotone environments. SWW variability may be associated to ENSO variability especially during the last millennia. For the last 200 cal yrs BP we can concluded that the SWW belt were more intense and poleward than the previous interval. Our composite pollen-based SWW indices show the potential of pollen dataset integration to improve the understanding of paleohydric variability especially for the last 2000 millennial in Patagonia.
Longo, S J; Faircloth, B C; Meyer, A; Westneat, M W; Alfaro, M E; Wainwright, P C
2017-08-01
Phylogenetics is undergoing a revolution as large-scale molecular datasets reveal unexpected but repeatable rearrangements of clades that were previously thought to be disparate lineages. One of the most unusual clades of fishes that has been found using large-scale molecular datasets is an expanded Syngnathiformes including traditional long-snouted syngnathiform lineages (Aulostomidae, Centriscidae, Fistulariidae, Solenostomidae, Syngnathidae), as well as a diverse set of largely benthic-associated fishes (Callionymoidei, Dactylopteridae, Mullidae, Pegasidae) that were previously dispersed across three orders. The monophyly of this surprising clade of fishes has been upheld by recent studies utilizing both nuclear and mitogenomic data, but the relationships among major lineages within Syngnathiformes remain ambiguous; previous analyses have inconsistent topologies and are plagued by low support at deep divergences between the major lineages. In this study, we use a dataset of ultraconserved elements (UCEs) to conduct the first phylogenomic study of Syngnathiformes. UCEs have been effective markers for resolving deep phylogenetic relationships in fishes and, combined with increased taxon sampling, we expected UCEs to resolve problematic syngnathiform relationships. Overall, UCEs were effective at resolving relationships within Syngnathiformes at a range of evolutionary timescales. We find consistent support for the monophyly of traditional long-snouted syngnathiform lineages (Aulostomidae, Centriscidae, Fistulariidae, Solenostomidae, Syngnathidae), which better agrees with morphological hypotheses than previously published topologies from molecular data. This result was supported by all Bayesian and maximum likelihood analyses, was robust to differences in matrix completeness and potential sources of bias, and was highly supported in coalescent-based analyses in ASTRAL when matrices were filtered to contain the most phylogenetically informative loci. While Bayesian and maximum likelihood analyses found support for a benthic-associated clade (Callionymidae, Dactylopteridae, Mullidae, and Pegasidae) as sister to the long-snouted clade, this result was not replicated in the ASTRAL analyses. The base of our phylogeny is characterized by short internodes separating major syngnathiform lineages and is consistent with the hypothesis of an ancient rapid radiation at the base of Syngnathiformes. Syngnathiformes therefore present an exciting opportunity to study patterns of morphological variation and functional innovation arising from rapid but ancient radiation. Copyright © 2017 Elsevier Inc. All rights reserved.
Graham, Jennifer L.; Foster, Guy M.; Williams, Thomas J.; Kramer, Ariele R.; Harris, Theodore D.
2017-03-31
Cheney Reservoir, located in south-central Kansas, is one of the primary drinking-water supplies for the city of Wichita and an important recreational resource. Since 1990, cyanobacterial blooms have been present occasionally in Cheney Reservoir, resulting in increased treatment costs and decreased recreational use. Cyanobacteria, the cyanotoxin microcystin, and the taste-and-odor compounds geosmin and 2-methylisoborneol have been measured in Cheney Reservoir by the U.S. Geological Survey, in cooperation with the city of Wichita, for about 16 years. The purpose of this report is to describe the occurrence of cyanobacteria, microcystin, and taste-and-odor compounds in Cheney Reservoir during May 2001 through June 2016 and to update previously published logistic regression models that used continuous water-quality data to estimate the probability of microcystin and geosmin occurrence above relevant thresholds.Cyanobacteria, microcystin, and geosmin were detected in about 84, 52, and 31 percent of samples collected in Cheney Reservoir during May 2001 through June 2016, respectively. 2-methylisoborneol was less common, detected in only 3 percent of samples. Microcystin and geosmin concentrations exceeded advisory values of concern more frequently than cyanobacterial abundance; therefore, cyanobacteria are not a good indicator of the presence of these taste-and-odor compounds in Cheney Reservoir. Broad seasonal patterns in cyanobacteria and microcystin were evident, though abundance and concentration varied by orders of magnitude across years. Cyanobacterial abundances generally peaked in late summer or early fall (August through October), and smaller peaks were observed in winter (January through February). In a typical year, microcystin was first detected in June or July, increased to its seasonal maxima in the summer (July through September), and then decreased. Seasonal patterns in geosmin were less consistent than cyanobacteria and microcystin, but geosmin typically had a small peak during winter (January through March) during most years and a large peak during summer (July through September) during some years. Though the relation between cyanobacterial abundance and microcystin and geosmin concentrations was positive, overall correlations were weak, likely because production is strain-specific and cyanobacterial strain composition may vary substantially over time. Microcystin often was present without taste-and-odor compounds. By comparison, where taste-and-odor compounds were present, microcystin frequently was detected. Taste-and-odor compounds, therefore, may be used as indicators that microcystin may be present; however, microcystin was present without taste-and-odor compounds, so taste or odor alone does not provide sufficient warning to ensure human-health protection.Logistic regression models that estimate the probability of microcystin occurrence at concentrations greater than or equal to 0.1 micrograms per liter and geosmin occurrence at concentrations greater than or equal to 5 nanograms per liter were developed. Models were developed using the complete dataset (January 2003 through June 2016 for microcystin [14-year dataset]; May 2001 through June 2016 for geosmin [16-year dataset]) and an abbreviated 4-year dataset (January 2013 through June 2016 for microcystin and geosmin). Performance of the newly developed models was compared with previously published models that were developed using data collected during May 2001 through December 2009. A seasonal component and chlorophyll fluorescence (a surrogate for algal biomass) were the explanatory variables for microcystin occurrence at concentrations greater than or equal to 0.1 micrograms per liter in all models. All models were relatively robust, though the previously published and 14-year models performed better over time; however, as a tool to estimate microcystin occurrence at concentrations greater than or equal to 0.1 micrograms per liter in a real-time notification system near the Cheney Dam, the 4-year model is most representative of recent (2013 through 2016) conditions. All models for geosmin occurrence at concentrations greater than or equal to 5 nanograms per liter had different explanatory variables and model forms. The previously published and 16-year models were not robust over time, likely because of changing environmental conditions and seasonal patterns in geosmin occurrence. By comparison, the abbreviated 4-year model may be a useful tool to estimate geosmin occurrence at concentrations greater than or equal to 5 nanograms per liter in a real-time notification system near the Cheney Dam. The better performance of the abbreviated 4-year geosmin model during 2013 through 2016 relative to the previously published and 16-year models demonstrates the need for continuous reevaluation of models estimating the probability of occurrence.
Reduced changes in protein compared to mRNA levels across non-proliferating tissues.
Perl, Kobi; Ushakov, Kathy; Pozniak, Yair; Yizhar-Barnea, Ofer; Bhonker, Yoni; Shivatzki, Shaked; Geiger, Tamar; Avraham, Karen B; Shamir, Ron
2017-04-18
The quantitative relations between RNA and protein are fundamental to biology and are still not fully understood. Across taxa, it was demonstrated that the protein-to-mRNA ratio in steady state varies in a direction that lessens the change in protein levels as a result of changes in the transcript abundance. Evidence for this behavior in tissues is sparse. We tested this phenomenon in new data that we produced for the mouse auditory system, and in previously published tissue datasets. A joint analysis of the transcriptome and proteome was performed across four datasets: inner-ear mouse tissues, mouse organ tissues, lymphoblastoid primate samples and human cancer cell lines. We show that the protein levels are more conserved than the mRNA levels in all datasets, and that changes in transcription are associated with translational changes that exert opposite effects on the final protein level, in all tissues except cancer. Finally, we observe that some functions are enriched in the inner ear on the mRNA level but not in protein. We suggest that partial buffering between transcription and translation ensures that proteins can be made rapidly in response to a stimulus. Accounting for the buffering can improve the prediction of protein levels from mRNA levels.
The effect of leverage and/or influential on structure-activity relationships.
Bolboacă, Sorana D; Jäntschi, Lorentz
2013-05-01
In the spirit of reporting valid and reliable Quantitative Structure-Activity Relationship (QSAR) models, the aim of our research was to assess how the leverage (analysis with Hat matrix, h(i)) and the influential (analysis with Cook's distance, D(i)) of QSAR models may reflect the models reliability and their characteristics. The datasets included in this research were collected from previously published papers. Seven datasets which accomplished the imposed inclusion criteria were analyzed. Three models were obtained for each dataset (full-model, h(i)-model and D(i)-model) and several statistical validation criteria were applied to the models. In 5 out of 7 sets the correlation coefficient increased when compounds with either h(i) or D(i) higher than the threshold were removed. Withdrawn compounds varied from 2 to 4 for h(i)-models and from 1 to 13 for D(i)-models. Validation statistics showed that D(i)-models possess systematically better agreement than both full-models and h(i)-models. Removal of influential compounds from training set significantly improves the model and is recommended to be conducted in the process of quantitative structure-activity relationships developing. Cook's distance approach should be combined with hat matrix analysis in order to identify the compounds candidates for removal.
Bouhrara, Mustapha; Reiter, David A; Sexton, Kyle W; Bergeron, Christopher M; Zukley, Linda M; Spencer, Richard G
2017-11-01
We applied our recently introduced Bayesian analytic method to achieve clinically-feasible in-vivo mapping of the proteoglycan water fraction (PgWF) of human knee cartilage with improved spatial resolution and stability as compared to existing methods. Multicomponent driven equilibrium single-pulse observation of T 1 and T 2 (mcDESPOT) datasets were acquired from the knees of two healthy young subjects and one older subject with previous knee injury. Each dataset was processed using Bayesian Monte Carlo (BMC) analysis incorporating a two-component tissue model. We assessed the performance and reproducibility of BMC and of the conventional analysis of stochastic region contraction (SRC) in the estimation of PgWF. Stability of the BMC analysis of PgWF was tested by comparing independent high-resolution (HR) datasets from each of the two young subjects. Unlike SRC, the BMC-derived maps from the two HR datasets were essentially identical. Furthermore, SRC maps showed substantial random variation in estimated PgWF, and mean values that differed from those obtained using BMC. In addition, PgWF maps derived from conventional low-resolution (LR) datasets exhibited partial volume and magnetic susceptibility effects. These artifacts were absent in HR PgWF images. Finally, our analysis showed regional variation in PgWF estimates, and substantially higher values in the younger subjects as compared to the older subject. BMC-mcDESPOT permits HR in-vivo mapping of PgWF in human knee cartilage in a clinically-feasible acquisition time. HR mapping reduces the impact of partial volume and magnetic susceptibility artifacts compared to LR mapping. Finally, BMC-mcDESPOT demonstrated excellent reproducibility in the determination of PgWF. Published by Elsevier Inc.
Spatial distribution of pingos in Northern Asia
Grosse, G.; Jones, Benjamin M.
2010-01-01
Pingos are prominent periglacial landforms in vast regions of the Arctic and Subarctic. They are indicators of modern and past conditions of permafrost, surface geology, hydrology and climate. A first version of a detailed spatial geodatabase of more than 6000 pingo locations in a 3.5 ?? 106 km2 region of Northern Asia was assembled from topographic maps. A first order analysis was carried out with respect to permafrost, landscape characteristics, surface geology, hydrology, climate, and elevation datasets using a Geographic Information System (GIS). Pingo heights in the dataset vary between 2 and 37 m, with a mean height of 4.8 m. About 64% of the pingos occur in continuous permafrost with high ice content and thick sediments; another 19% in continuous permafrost with moderate ice content and thick sediments. The majority of these pingos likely formed through closed system freezing, typical of those located in drained thermokarst lake basins of northern lowlands with continuous permafrost. About 82% of the pingos are located in the tundra bioclimatic zone. Most pingos in the dataset are located in regions with mean annual ground temperatures between -3 and -11 ??C and mean annual air temperatures between -7 and -18 ??C. The dataset confirms that surface geology and hydrology are key factors for pingo formation and occurrence. Based on model predictions for near-future permafrost distribution, hundreds of pingos along the southern margins of permafrost will be located in regions with thawing permafrost by 2100, which ultimately may lead to increased occurrence of pingo collapse. Based on our dataset and previously published estimates of pingo numbers from other regions, we conclude that there are more than 11 000 pingos on Earth. ?? 2010 Author(s).
NASA Astrophysics Data System (ADS)
Minnett, R.; Koppers, A. A.; Tauxe, L.; Constable, C.; Jarboe, N. A.
2011-12-01
The Magnetics Information Consortium (MagIC) provides an archive for the wealth of rock- and paleomagnetic data and interpretations from studies on natural and synthetic samples. As with many fields, most peer-reviewed paleo- and rock magnetic publications only include high level results. However, access to the raw data from which these results were derived is critical for compilation studies and when updating results based on new interpretation and analysis methods. MagIC provides a detailed metadata model with places for everything from raw measurements to their interpretations. Prior to MagIC, these raw data were extremely cumbersome to collect because they mostly existed in a lab's proprietary format on investigator's personal computers or undigitized in field notebooks. MagIC has developed a suite of offline and online tools to enable the paleomagnetic, rock magnetic, and affiliated scientific communities to easily contribute both their previously published data and data supporting an article undergoing peer-review, to retrieve well-annotated published interpretations and raw data, and to analyze and visualize large collections of published data online. Here we present the technology we chose (including VBA in Excel spreadsheets, Python libraries, FastCGI JSON webservices, Oracle procedures, and jQuery user interfaces) and how we implemented it in order to serve the scientific community as seamlessly as possible. These tools are now in use in labs worldwide, have helped archive many valuable legacy studies and datasets, and routinely enable new contributions to the MagIC Database (http://earthref.org/MAGIC/).
Abràmoff, Michael David; Lou, Yiyue; Erginay, Ali; Clarida, Warren; Amelon, Ryan; Folk, James C; Niemeijer, Meindert
2016-10-01
To compare performance of a deep-learning enhanced algorithm for automated detection of diabetic retinopathy (DR), to the previously published performance of that algorithm, the Iowa Detection Program (IDP)-without deep learning components-on the same publicly available set of fundus images and previously reported consensus reference standard set, by three US Board certified retinal specialists. We used the previously reported consensus reference standard of referable DR (rDR), defined as International Clinical Classification of Diabetic Retinopathy moderate, severe nonproliferative (NPDR), proliferative DR, and/or macular edema (ME). Neither Messidor-2 images, nor the three retinal specialists setting the Messidor-2 reference standard were used for training IDx-DR version X2.1. Sensitivity, specificity, negative predictive value, area under the curve (AUC), and their confidence intervals (CIs) were calculated. Sensitivity was 96.8% (95% CI: 93.3%-98.8%), specificity was 87.0% (95% CI: 84.2%-89.4%), with 6/874 false negatives, resulting in a negative predictive value of 99.0% (95% CI: 97.8%-99.6%). No cases of severe NPDR, PDR, or ME were missed. The AUC was 0.980 (95% CI: 0.968-0.992). Sensitivity was not statistically different from published IDP sensitivity, which had a CI of 94.4% to 99.3%, but specificity was significantly better than the published IDP specificity CI of 55.7% to 63.0%. A deep-learning enhanced algorithm for the automated detection of DR, achieves significantly better performance than a previously reported, otherwise essentially identical, algorithm that does not employ deep learning. Deep learning enhanced algorithms have the potential to improve the efficiency of DR screening, and thereby to prevent visual loss and blindness from this devastating disease.
VIMS spectral mapping observations of Titan during the Cassini prime mission
Barnes, J.W.; Soderblom, J.M.; Brown, R.H.; Buratti, B.J.; Sotin, Christophe; Baines, K.H.; Clark, R.N.; Jaumann, R.; McCord, T.B.; Nelson, R.; Le, Mouelic S.; Rodriguez, S.; Griffith, C.; Penteado, P.; Tosi, F.; Pitman, K.M.; Soderblom, L.; Stephan, K.; Hayne, P.; Vixie, G.; Bibring, J.-P.; Bellucci, G.; Capaccioni, F.; Cerroni, P.; Coradini, A.; Cruikshank, D.P.; Drossart, P.; Formisano, V.; Langevin, Y.; Matson, D.L.; Nicholson, P.D.; Sicardy, B.
2009-01-01
This is a data paper designed to facilitate the use of and comparisons to Cassini/visual and infrared mapping spectrometer (VIMS) spectral mapping data of Saturn's moon Titan. We present thumbnail orthographic projections of flyby mosaics from each Titan encounter during the Cassini prime mission, 2004 July 1 through 2008 June 30. For each flyby we also describe the encounter geometry, and we discuss the studies that have previously been published using the VIMS dataset. The resulting compliation of metadata provides a complementary big-picture overview of the VIMS data in the public archive, and should be a useful reference for future Titan studies. ?? 2009 Elsevier Ltd.
NASA Astrophysics Data System (ADS)
Salamuniccar, G.; Loncaric, S.
2008-03-01
The Catalogue from our previous work was merged with the date of Barlow, Rodionova, Boyce, and Kuzmin. The resulting ground truth catalogue with 57,633 craters was registered, using MOLA data, with THEMIS-DIR, MDIM, and MOC data-sets.
National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) scientists have released a dataset of proteins and phosphopeptides identified through deep proteomic and phosphoproteomic analysis of breast tumor samples, previously genomically analyzed by The Cancer Genome Atlas (TCGA).
SPANG: a SPARQL client supporting generation and reuse of queries for distributed RDF databases.
Chiba, Hirokazu; Uchiyama, Ikuo
2017-02-08
Toward improved interoperability of distributed biological databases, an increasing number of datasets have been published in the standardized Resource Description Framework (RDF). Although the powerful SPARQL Protocol and RDF Query Language (SPARQL) provides a basis for exploiting RDF databases, writing SPARQL code is burdensome for users including bioinformaticians. Thus, an easy-to-use interface is necessary. We developed SPANG, a SPARQL client that has unique features for querying RDF datasets. SPANG dynamically generates typical SPARQL queries according to specified arguments. It can also call SPARQL template libraries constructed in a local system or published on the Web. Further, it enables combinatorial execution of multiple queries, each with a distinct target database. These features facilitate easy and effective access to RDF datasets and integrative analysis of distributed data. SPANG helps users to exploit RDF datasets by generation and reuse of SPARQL queries through a simple interface. This client will enhance integrative exploitation of biological RDF datasets distributed across the Web. This software package is freely available at http://purl.org/net/spang .
Pathway analysis of genome-wide association datasets of personality traits.
Kim, H-N; Kim, B-H; Cho, J; Ryu, S; Shin, H; Sung, J; Shin, C; Cho, N H; Sung, Y A; Choi, B-O; Kim, H-L
2015-04-01
Although several genome-wide association (GWA) studies of human personality have been recently published, genetic variants that are highly associated with certain personality traits remain unknown, due to difficulty reproducing results. To further investigate these genetic variants, we assessed biological pathways using GWA datasets. Pathway analysis using GWA data was performed on 1089 Korean women whose personality traits were measured with the Revised NEO Personality Inventory for the 5-factor model of personality. A total of 1042 pathways containing 8297 genes were included in our study. Of these, 14 pathways were highly enriched with association signals that were validated in 1490 independent samples. These pathways include association of: Neuroticism with axon guidance [L1 cell adhesion molecule (L1CAM) interactions]; Extraversion with neuronal system and voltage-gated potassium channels; Agreeableness with L1CAM interaction, neurotransmitter receptor binding and downstream transmission in postsynaptic cells; and Conscientiousness with the interferon-gamma and platelet-derived growth factor receptor beta polypeptide pathways. Several genes that contribute to top-ranked pathways in this study were previously identified in GWA studies or by pathway analysis in schizophrenia or other neuropsychiatric disorders. Here we report the first pathway analysis of all five personality traits. Importantly, our analysis identified novel pathways that contribute to understanding the etiology of personality traits. © 2015 The Authors. Genes, Brain and Behavior published by International Behavioural and Neural Genetics Society and John Wiley & Sons Ltd.
Species longevity in North American fossil mammals.
Prothero, Donald R
2014-08-01
Species longevity in the fossil record is related to many paleoecological variables and is important to macroevolutionary studies, yet there are very few reliable data on average species durations in Cenozoic fossil mammals. Many of the online databases (such as the Paleobiology Database) use only genera of North American Cenozoic mammals and there are severe problems because key groups (e.g. camels, oreodonts, pronghorns and proboscideans) have no reliable updated taxonomy, with many invalid genera and species and/or many undescribed genera and species. Most of the published datasets yield species duration estimates of approximately 2.3-4.3 Myr for larger mammals, with small mammals tending to have shorter species durations. My own compilation of all the valid species durations in families with updated taxonomy (39 families, containing 431 genera and 998 species, averaging 2.3 species per genus) yields a mean duration of 3.21 Myr for larger mammals. This breaks down to 4.10-4.39 Myr for artiodactyls, 3.14-3.31 Myr for perissodactyls and 2.63-2.95 Myr for carnivorous mammals (carnivorans plus creodonts). These averages are based on a much larger, more robust dataset than most previous estimates, so they should be more reliable for any studies that need species longevity to be accurately estimated. © 2013 International Society of Zoological Sciences, Institute of Zoology/Chinese Academy of Sciences and Wiley Publishing Asia Pty Ltd.
A comparison of public datasets for acceleration-based fall detection.
Igual, Raul; Medrano, Carlos; Plaza, Inmaculada
2015-09-01
Falls are one of the leading causes of mortality among the older population, being the rapid detection of a fall a key factor to mitigate its main adverse health consequences. In this context, several authors have conducted studies on acceleration-based fall detection using external accelerometers or smartphones. The published detection rates are diverse, sometimes close to a perfect detector. This divergence may be explained by the difficulties in comparing different fall detection studies in a fair play since each study uses its own dataset obtained under different conditions. In this regard, several datasets have been made publicly available recently. This paper presents a comparison, to the best of our knowledge for the first time, of these public fall detection datasets in order to determine whether they have an influence on the declared performances. Using two different detection algorithms, the study shows that the performances of the fall detection techniques are affected, to a greater or lesser extent, by the specific datasets used to validate them. We have also found large differences in the generalization capability of a fall detector depending on the dataset used for training. In fact, the performance decreases dramatically when the algorithms are tested on a dataset different from the one used for training. Other characteristics of the datasets like the number of training samples also have an influence on the performance while algorithms seem less sensitive to the sampling frequency or the acceleration range. Copyright © 2015 IPEM. Published by Elsevier Ltd. All rights reserved.
ASSISTments Dataset from Multiple Randomized Controlled Experiments
ERIC Educational Resources Information Center
Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil
2016-01-01
In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…
CamMedNP: building the Cameroonian 3D structural natural products database for virtual screening.
Ntie-Kang, Fidele; Mbah, James A; Mbaze, Luc Meva'a; Lifongo, Lydia L; Scharfe, Michael; Hanna, Joelle Ngo; Cho-Ngwa, Fidelis; Onguéné, Pascal Amoa; Owono Owono, Luc C; Megnassan, Eugene; Sippl, Wolfgang; Efange, Simon M N
2013-04-16
Computer-aided drug design (CADD) often involves virtual screening (VS) of large compound datasets and the availability of such is vital for drug discovery protocols. We present CamMedNP - a new database beginning with more than 2,500 compounds of natural origin, along with some of their derivatives which were obtained through hemisynthesis. These are pure compounds which have been previously isolated and characterized using modern spectroscopic methods and published by several research teams spread across Cameroon. In the present study, 224 distinct medicinal plant species belonging to 55 plant families from the Cameroonian flora have been considered. About 80 % of these have been previously published and/or referenced in internationally recognized journals. For each compound, the optimized 3D structure, drug-like properties, plant source, collection site and currently known biological activities are given, as well as literature references. We have evaluated the "drug-likeness" of this database using Lipinski's "Rule of Five". A diversity analysis has been carried out in comparison with the ChemBridge diverse database. CamMedNP could be highly useful for database screening and natural product lead generation programs.
Data Publication: A Partnership between Scientists, Data Managers and Librarians
NASA Astrophysics Data System (ADS)
Raymond, L.; Chandler, C.; Lowry, R.; Urban, E.; Moncoiffe, G.; Pissierssens, P.; Norton, C.; Miller, H.
2012-04-01
Current literature on the topic of data publication suggests that success is best achieved when there is a partnership between scientists, data managers, and librarians. The Marine Biological Laboratory/Woods Hole Oceanographic Institution (MBLWHOI) Library and the Biological and Chemical Oceanography Data Management Office (BCO-DMO) have developed tools and processes to automate the ingestion of metadata from BCO-DMO for deposit with datasets into the Institutional Repository (IR) Woods Hole Open Access Server (WHOAS). The system also incorporates functionality for BCO-DMO to request a Digital Object Identifier (DOI) from the Library. This partnership allows the Library to work with a trusted data repository to ensure high quality data while the data repository utilizes library services and is assured of a permanent archive of the copy of the data extracted from the repository database. The assignment of persistent identifiers enables accurate data citation. The Library can assign a DOI to appropriate datasets deposited in WHOAS. A primary activity is working with authors to deposit datasets associated with published articles. The DOI would ideally be assigned before submission and be included in the published paper so readers can link directly to the dataset, but DOIs are also being assigned to datasets related to articles after publication. WHOAS metadata records link the article to the datasets and the datasets to the article. The assignment of DOIs has enabled another important collaboration with Elsevier, publisher of educational and professional science journals. Elsevier can now link from articles in the Science Direct database to the datasets available from WHOAS that are related to that article. The data associated with the article are freely available from WHOAS and accompanied by a Dublin Core metadata record. In addition, the Library has worked with researchers to deposit datasets in WHOAS that are not appropriate for national, international, or domain specific data repositories. These datasets currently include audio, text and image files. This research is being conducted by a team of librarians, data managers and scientists that are collaborating with representatives from the Scientific Committee on Oceanic Research (SCOR) and the International Oceanographic Data and Information Exchange (IODE) of the Intergovernmental Oceanographic Commission (IOC). The goal is to identify best practices for tracking data provenance and clearly attributing credit to data collectors/providers.
Translation of Genotype to Phenotype by a Hierarchy of Cell Subsystems.
Yu, Michael Ku; Kramer, Michael; Dutkowski, Janusz; Srivas, Rohith; Licon, Katherine; Kreisberg, Jason; Ng, Cherie T; Krogan, Nevan; Sharan, Roded; Ideker, Trey
2016-02-24
Accurately translating genotype to phenotype requires accounting for the functional impact of genetic variation at many biological scales. Here we present a strategy for genotype-phenotype reasoning based on existing knowledge of cellular subsystems. These subsystems and their hierarchical organization are defined by the Gene Ontology or a complementary ontology inferred directly from previously published datasets. Guided by the ontology's hierarchical structure, we organize genotype data into an "ontotype," that is, a hierarchy of perturbations representing the effects of genetic variation at multiple cellular scales. The ontotype is then interpreted using logical rules generated by machine learning to predict phenotype. This approach substantially outperforms previous, non-hierarchical methods for translating yeast genotype to cell growth phenotype, and it accurately predicts the growth outcomes of two new screens of 2,503 double gene knockouts impacting DNA repair or nuclear lumen. Ontotypes also generalize to larger knockout combinations, setting the stage for interpreting the complex genetics of disease.
Verhagen, Lilly M; Zomer, Aldert; Maes, Mailis; Villalba, Julian A; Del Nogal, Berenice; Eleveld, Marc; van Hijum, Sacha Aft; de Waard, Jacobus H; Hermans, Peter Wm
2013-02-01
Tuberculosis (TB) continues to cause a high toll of disease and death among children worldwide. The diagnosis of childhood TB is challenged by the paucibacillary nature of the disease and the difficulties in obtaining specimens. Whereas scientific and clinical research efforts to develop novel diagnostic tools have focused on TB in adults, childhood TB has been relatively neglected. Blood transcriptional profiling has improved our understanding of disease pathogenesis of adult TB and may offer future leads for diagnosis and treatment. No studies applying gene expression profiling of children with TB have been published so far. We identified a 116-gene signature set that showed an average prediction error of 11% for TB vs. latent TB infection (LTBI) and for TB vs. LTBI vs. healthy controls (HC) in our dataset. A minimal gene set of only 9 genes showed the same prediction error of 11% for TB vs. LTBI in our dataset. Furthermore, this minimal set showed a significant discriminatory value for TB vs. LTBI for all previously published adult studies using whole blood gene expression, with average prediction errors between 17% and 23%. In order to identify a robust representative gene set that would perform well in populations of different genetic backgrounds, we selected ten genes that were highly discriminative between TB, LTBI and HC in all literature datasets as well as in our dataset. Functional annotation of these genes highlights a possible role for genes involved in calcium signaling and calcium metabolism as biomarkers for active TB. These ten genes were validated by quantitative real-time polymerase chain reaction in an additional cohort of 54 Warao Amerindian children with LTBI, HC and non-TB pneumonia. Decision tree analysis indicated that five of the ten genes were sufficient to classify 78% of the TB cases correctly with no LTBI subjects wrongly classified as TB (100% specificity). Our data justify the further exploration of our signature set as biomarkers for potential childhood TB diagnosis. We show that, as the identification of different biomarkers in ethnically distinct cohorts is apparent, it is important to cross-validate newly identified markers in all available cohorts.
2013-01-01
Background Tuberculosis (TB) continues to cause a high toll of disease and death among children worldwide. The diagnosis of childhood TB is challenged by the paucibacillary nature of the disease and the difficulties in obtaining specimens. Whereas scientific and clinical research efforts to develop novel diagnostic tools have focused on TB in adults, childhood TB has been relatively neglected. Blood transcriptional profiling has improved our understanding of disease pathogenesis of adult TB and may offer future leads for diagnosis and treatment. No studies applying gene expression profiling of children with TB have been published so far. Results We identified a 116-gene signature set that showed an average prediction error of 11% for TB vs. latent TB infection (LTBI) and for TB vs. LTBI vs. healthy controls (HC) in our dataset. A minimal gene set of only 9 genes showed the same prediction error of 11% for TB vs. LTBI in our dataset. Furthermore, this minimal set showed a significant discriminatory value for TB vs. LTBI for all previously published adult studies using whole blood gene expression, with average prediction errors between 17% and 23%. In order to identify a robust representative gene set that would perform well in populations of different genetic backgrounds, we selected ten genes that were highly discriminative between TB, LTBI and HC in all literature datasets as well as in our dataset. Functional annotation of these genes highlights a possible role for genes involved in calcium signaling and calcium metabolism as biomarkers for active TB. These ten genes were validated by quantitative real-time polymerase chain reaction in an additional cohort of 54 Warao Amerindian children with LTBI, HC and non-TB pneumonia. Decision tree analysis indicated that five of the ten genes were sufficient to classify 78% of the TB cases correctly with no LTBI subjects wrongly classified as TB (100% specificity). Conclusions Our data justify the further exploration of our signature set as biomarkers for potential childhood TB diagnosis. We show that, as the identification of different biomarkers in ethnically distinct cohorts is apparent, it is important to cross-validate newly identified markers in all available cohorts. PMID:23375113
Natural image sequences constrain dynamic receptive fields and imply a sparse code.
Häusler, Chris; Susemihl, Alex; Nawrot, Martin P
2013-11-06
In their natural environment, animals experience a complex and dynamic visual scenery. Under such natural stimulus conditions, neurons in the visual cortex employ a spatially and temporally sparse code. For the input scenario of natural still images, previous work demonstrated that unsupervised feature learning combined with the constraint of sparse coding can predict physiologically measured receptive fields of simple cells in the primary visual cortex. This convincingly indicated that the mammalian visual system is adapted to the natural spatial input statistics. Here, we extend this approach to the time domain in order to predict dynamic receptive fields that can account for both spatial and temporal sparse activation in biological neurons. We rely on temporal restricted Boltzmann machines and suggest a novel temporal autoencoding training procedure. When tested on a dynamic multi-variate benchmark dataset this method outperformed existing models of this class. Learning features on a large dataset of natural movies allowed us to model spatio-temporal receptive fields for single neurons. They resemble temporally smooth transformations of previously obtained static receptive fields and are thus consistent with existing theories. A neuronal spike response model demonstrates how the dynamic receptive field facilitates temporal and population sparseness. We discuss the potential mechanisms and benefits of a spatially and temporally sparse representation of natural visual input. Copyright © 2013 The Authors. Published by Elsevier B.V. All rights reserved.
Limb-Enhancer Genie: An accessible resource of accurate enhancer predictions in the developing limb
Monti, Remo; Barozzi, Iros; Osterwalder, Marco; ...
2017-08-21
Epigenomic mapping of enhancer-associated chromatin modifications facilitates the genome-wide discovery of tissue-specific enhancers in vivo. However, reliance on single chromatin marks leads to high rates of false-positive predictions. More sophisticated, integrative methods have been described, but commonly suffer from limited accessibility to the resulting predictions and reduced biological interpretability. Here we present the Limb-Enhancer Genie (LEG), a collection of highly accurate, genome-wide predictions of enhancers in the developing limb, available through a user-friendly online interface. We predict limb enhancers using a combination of > 50 published limb-specific datasets and clusters of evolutionarily conserved transcription factor binding sites, taking advantage ofmore » the patterns observed at previously in vivo validated elements. By combining different statistical models, our approach outperforms current state-of-the-art methods and provides interpretable measures of feature importance. Our results indicate that including a previously unappreciated score that quantifies tissue-specific nuclease accessibility significantly improves prediction performance. We demonstrate the utility of our approach through in vivo validation of newly predicted elements. Moreover, we describe general features that can guide the type of datasets to include when predicting tissue-specific enhancers genome-wide, while providing an accessible resource to the general biological community and facilitating the functional interpretation of genetic studies of limb malformations.« less
Phylogenomics provides strong evidence for relationships of butterflies and moths.
Kawahara, Akito Y; Breinholt, Jesse W
2014-08-07
Butterflies and moths constitute some of the most popular and charismatic insects. Lepidoptera include approximately 160 000 described species, many of which are important model organisms. Previous studies on the evolution of Lepidoptera did not confidently place butterflies, and many relationships among superfamilies in the megadiverse clade Ditrysia remain largely uncertain. We generated a molecular dataset with 46 taxa, combining 33 new transcriptomes with 13 available genomes, transcriptomes and expressed sequence tags (ESTs). Using HaMStR with a Lepidoptera-specific core-orthologue set of single copy loci, we identified 2696 genes for inclusion into the phylogenomic analysis. Nucleotides and amino acids of the all-gene, all-taxon dataset yielded nearly identical, well-supported trees. Monophyly of butterflies (Papilionoidea) was strongly supported, and the group included skippers (Hesperiidae) and the enigmatic butterfly-moths (Hedylidae). Butterflies were placed sister to the remaining obtectomeran Lepidoptera, and the latter was grouped with greater than or equal to 87% bootstrap support. Establishing confident relationships among the four most diverse macroheteroceran superfamilies was previously challenging, but we recovered 100% bootstrap support for the following relationships: ((Geometroidea, Noctuoidea), (Bombycoidea, Lasiocampoidea)). We present the first robust, transcriptome-based tree of Lepidoptera that strongly contradicts historical placement of butterflies, and provide an evolutionary framework for genomic, developmental and ecological studies on this diverse insect order. © 2014 The Author(s) Published by the Royal Society. All rights reserved.
Lu, Ruipeng; Mucaki, Eliseos J; Rogan, Peter K
2017-03-17
Data from ChIP-seq experiments can derive the genome-wide binding specificities of transcription factors (TFs) and other regulatory proteins. We analyzed 765 ENCODE ChIP-seq peak datasets of 207 human TFs with a novel motif discovery pipeline based on recursive, thresholded entropy minimization. This approach, while obviating the need to compensate for skewed nucleotide composition, distinguishes true binding motifs from noise, quantifies the strengths of individual binding sites based on computed affinity and detects adjacent cofactor binding sites that coordinate with the targets of primary, immunoprecipitated TFs. We obtained contiguous and bipartite information theory-based position weight matrices (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The reliability and accuracy of these iPWMs were determined via four independent validation methods, including the detection of experimentally proven binding sites, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. We also predict previously unreported TF coregulatory interactions (e.g. TF complexes). These iPWMs constitute a powerful tool for predicting the effects of sequence variants in known binding sites, performing mutation analysis on regulatory SNPs and predicting previously unrecognized binding sites and target genes. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Hagopian, Louis P.; Rooker, Griffin W.; Zarcone, Jennifer R.; Bonner, Andrew C.; Arevalo, Alexander R.
2017-01-01
Hagopian, Rooker, and Zarcone (2015) evaluated a model for subtyping automatically reinforced self-injurious behavior (SIB) based on its sensitivity to changes in functional analysis conditions and the presence of self-restraint. The current study tested the generality of the model by applying it to all datasets of automatically reinforced SIB published from 1982 to 2015. We identified 49 datasets that included sufficient data to permit subtyping. Similar to the original study, Subtype-1 SIB was generally amenable to treatment using reinforcement alone, whereas Subtype-2 SIB was not. Conclusions could not be drawn about Subtype-3 SIB due to the small number of datasets. Nevertheless, the findings support the generality of the model and suggest that sensitivity of SIB to disruption by alternative reinforcement is an important dimension of automatically reinforced SIB. Findings also suggest that automatically reinforced SIB should no longer be considered a single category and that additional research is needed to better understand and treat Subtype-2 SIB. PMID:28032344
Jayaraman, Jayakumar; Wong, Hai Ming; King, Nigel M; Roberts, Graham J
2013-07-01
Estimation of age of an individual can be performed by evaluating the pattern of dental development. A dataset for age estimation based on the dental maturity of a French-Canadian population was published over 35 years ago and has become the most widely accepted dataset. The applicability of this dataset has been tested on different population groups. To estimate the observed differences between Chronological age (CA) and Dental age (DA) when the French Canadian dataset was used to estimate the age of different population groups. A systematic search of literature for papers utilizing the French Canadian dataset for age estimation was performed. All language articles from PubMed, Embase and Cochrane databases were electronically searched for terms 'Demirjian' and 'Dental age' published between January 1973 and December 2011. A hand search of articles was also conducted. A total of 274 studies were identified from which 34 studies were included for qualitative analysis and 12 studies were included for quantitative assessment and meta-analysis. When synthesizing the estimation results from different population groups, on average, the Demirjian dataset overestimated the age of females by 0.65 years (-0.10 years to +2.82 years) and males by 0.60 years (-0.23 years to +3.04 years). The French Canadian dataset overestimates the age of the subjects by more than six months and hence this dataset should be used only with considerable caution when estimating age of group of subjects of any global population. Copyright © 2013 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
Hyde, Craig L.; Nagle, Mike W.; Tian, Chao; Chen, Xing; Paciga, Sara A.; Wendland, Jens R.; Tung, Joyce; Hinds, David A.; Perlis, Roy H.; Winslow, Ashley R.
2016-01-01
Despite strong evidence supporting the heritability of Major Depressive Disorder, previous genome-wide studies were unable to identify risk loci among individuals of European descent. We used self-reported data from 75,607 individuals reporting clinical diagnosis of depression and 231,747 reporting no history of depression through 23andMe, and meta-analyzed these results with published MDD GWAS results. We identified five independent variants from four regions associated with self-report of clinical diagnosis or treatment for depression. Loci with pval<1.0×10−5 in the meta-analysis were further analyzed in a replication dataset (45,773 cases and 106,354 controls) from 23andMe. A total of 17 independent SNPs from 15 regions reached genome-wide significance after joint-analysis over all three datasets. Some of these loci were also implicated in GWAS of related psychiatric traits. These studies provide evidence for large-scale consumer genomic data as a powerful and efficient complement to traditional means of ascertainment for neuropsychiatric disease genomics. PMID:27479909
Scalable metagenomic taxonomy classification using a reference genome database
Ames, Sasha K.; Hysom, David A.; Gardner, Shea N.; Lloyd, G. Scott; Gokhale, Maya B.; Allen, Jonathan E.
2013-01-01
Motivation: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents. Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge. Results: A method is presented to shift computational costs to an off-line computation by creating a taxonomy/genome index that supports scalable metagenomic classification. Scalable performance is demonstrated on real and simulated data to show accurate classification in the presence of novel organisms on samples that include viruses, prokaryotes, fungi and protists. Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample. Availability: Software was implemented in C++ and is freely available at http://sourceforge.net/projects/lmat Contact: allen99@llnl.gov Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23828782
Taylor, Charles J.; Nelson, Hugh L.
2008-01-01
Geospatial data needed to visualize and evaluate the hydrogeologic framework and distribution of karst features in the Interior Low Plateaus physiographic region of the central United States were compiled during 2004-2007 as part of the Ground-Water Resources Program Karst Hydrology Initiative (KHI) project. Because of the potential usefulness to environmental and water-resources regulators, private consultants, academic researchers, and others, the geospatial data files created during the KHI project are being made available to the public as a provisional regional karst dataset. To enhance accessibility and visualization, the geospatial data files have been compiled as ESRI ArcReader data folders and user interactive Published Map Files (.pmf files), all of which are catalogued by the boundaries of surface watersheds using U.S. Geological Survey (USGS) eight-digit hydrologic unit codes (HUC-8s). Specific karst features included in the dataset include mapped sinkhole locations, sinking (or disappearing) streams, internally drained catchments, karst springs inventoried in the USGS National Water Information System (NWIS) database, relic stream valleys, and karst flow paths obtained from results of previously reported water-tracer tests.
Anonymizing 1:M microdata with high utility
Gong, Qiyuan; Luo, Junzhou; Yang, Ming; Ni, Weiwei; Li, Xiao-Bai
2016-01-01
Preserving privacy and utility during data publishing and data mining is essential for individuals, data providers and researchers. However, studies in this area typically assume that one individual has only one record in a dataset, which is unrealistic in many applications. Having multiple records for an individual leads to new privacy leakages. We call such a dataset a 1:M dataset. In this paper, we propose a novel privacy model called (k, l)-diversity that addresses disclosure risks in 1:M data publishing. Based on this model, we develop an efficient algorithm named 1:M-Generalization to preserve privacy and data utility, and compare it with alternative approaches. Extensive experiments on real-world data show that our approach outperforms the state-of-the-art technique, in terms of data utility and computational cost. PMID:28603388
Symmetrical and overloaded effect of diffusion in information filtering
NASA Astrophysics Data System (ADS)
Zhu, Xuzhen; Tian, Hui; Chen, Guilin; Cai, Shimin
2017-10-01
In physical dynamics, mass diffusion theory has been applied to design effective information filtering models on bipartite network. In previous works, researchers unilaterally believe objects' similarities are determined by single directional mass diffusion from the collected object to the uncollected, meanwhile, inadvertently ignore adverse influence of diffusion overload. It in some extent veils the essence of diffusion in physical dynamics and hurts the recommendation accuracy and diversity. After delicate investigation, we argue that symmetrical diffusion effectively discloses essence of mass diffusion, and high diffusion overload should be published. Accordingly, in this paper, we propose an symmetrical and overload penalized diffusion based model (SOPD), which shows excellent performances in extensive experiments on benchmark datasets Movielens and Netflix.
Sethuraman, Sunantha; Thomas, Merin; Gay, Lauren A; Renne, Rolf
2018-05-29
Ribonomics experiments involving crosslinking and immuno-precipitation (CLIP) of Ago proteins have expanded the understanding of the miRNA targetome of several organisms. These techniques, collectively referred to as CLIP-seq, have been applied to identifying the mRNA targets of miRNAs expressed by Kaposi's Sarcoma-associated herpes virus (KSHV) and Epstein-Barr virus (EBV). However, these studies focused on identifying only those RNA targets of KSHV and EBV miRNAs that are known to encode proteins. Recent studies have demonstrated that long non-coding RNAs (lncRNAs) are also targeted by miRNAs. In this study, we performed a systematic re-analysis of published datasets from KSHV- and EBV-driven cancers. We used CLIP-seq data from lymphoma cells or EBV-transformed B cells, and a crosslinking, ligation and sequencing of hybrids dataset from KSHV-infected endothelial cells, to identify novel lncRNA targets of viral miRNAs. Here, we catalog the lncRNA targetome of KSHV and EBV miRNAs, and provide a detailed in silico analysis of lncRNA-miRNA binding interactions. Viral miRNAs target several hundred lncRNAs, including a subset previously shown to be aberrantly expressed in human malignancies. In addition, we identified thousands of lncRNAs to be putative targets of human miRNAs, suggesting that miRNA-lncRNA interactions broadly contribute to the regulation of gene expression.
Predicting novel substrates for enzymes with minimal experimental effort with active learning.
Pertusi, Dante A; Moura, Matthew E; Jeffryes, James G; Prabhu, Siddhant; Walters Biggs, Bradley; Tyo, Keith E J
2017-11-01
Enzymatic substrate promiscuity is more ubiquitous than previously thought, with significant consequences for understanding metabolism and its application to biocatalysis. This realization has given rise to the need for efficient characterization of enzyme promiscuity. Enzyme promiscuity is currently characterized with a limited number of human-selected compounds that may not be representative of the enzyme's versatility. While testing large numbers of compounds may be impractical, computational approaches can exploit existing data to determine the most informative substrates to test next, thereby more thoroughly exploring an enzyme's versatility. To demonstrate this, we used existing studies and tested compounds for four different enzymes, developed support vector machine (SVM) models using these datasets, and selected additional compounds for experiments using an active learning approach. SVMs trained on a chemically diverse set of compounds were discovered to achieve maximum accuracies of ~80% using ~33% fewer compounds than datasets based on all compounds tested in existing studies. Active learning-selected compounds for testing resolved apparent conflicts in the existing training data, while adding diversity to the dataset. The application of these algorithms to wide arrays of metabolic enzymes would result in a library of SVMs that can predict high-probability promiscuous enzymatic reactions and could prove a valuable resource for the design of novel metabolic pathways. Copyright © 2017 International Metabolic Engineering Society. Published by Elsevier Inc. All rights reserved.
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses
Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi
2018-01-01
Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.
Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi
2018-02-27
Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.
Efficient sequential and parallel algorithms for record linkage.
Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar
2014-01-01
Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Our sequential and parallel algorithms have been tested on a real dataset of 1,083,878 records and synthetic datasets ranging in size from 50,000 to 9,000,000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
Watson, Nathanial E; Parsons, Brendon A; Synovec, Robert E
2016-08-12
Performance of tile-based Fisher Ratio (F-ratio) data analysis, recently developed for discovery-based studies using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC-TOFMS), is evaluated with a metabolomics dataset that had been previously analyzed in great detail, but while taking a brute force approach. The previously analyzed data (referred to herein as the benchmark dataset) were intracellular extracts from Saccharomyces cerevisiae (yeast), either metabolizing glucose (repressed) or ethanol (derepressed), which define the two classes in the discovery-based analysis to find metabolites that are statistically different in concentration between the two classes. Beneficially, this previously analyzed dataset provides a concrete means to validate the tile-based F-ratio software. Herein, we demonstrate and validate the significant benefits of applying tile-based F-ratio analysis. The yeast metabolomics data are analyzed more rapidly in about one week versus one year for the prior studies with this dataset. Furthermore, a null distribution analysis is implemented to statistically determine an adequate F-ratio threshold, whereby the variables with F-ratio values below the threshold can be ignored as not class distinguishing, which provides the analyst with confidence when analyzing the hit table. Forty-six of the fifty-four benchmarked changing metabolites were discovered by the new methodology while consistently excluding all but one of the benchmarked nineteen false positive metabolites previously identified. Copyright © 2016 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Porritt, R. W.; Becker, T. W.; Auer, L.; Boschi, L.
2017-12-01
We present a whole-mantle, variable resolution, shear-wave tomography model based on newly available and existing seismological datasets including regional body-wave delay times and multi-mode Rayleigh and Love wave phase delays. Our body wave dataset includes 160,000 S wave delays used in the DNA13 regional tomographic model focused on the western and central US, 86,000 S and SKS delays measured on stations in western South America (Porritt et al., in prep), and 3,900,000 S+ phases measured by correlation between data observed at stations in the IRIS global networks (IU, II) and stations in the continuous US, against synthetic data generated with IRIS Syngine. The surface wave dataset includes fundamental mode and overtone Rayleigh wave data from Schaeffer and Levedev (2014), ambient noise derived Rayleigh wave and Love wave measurements from Ekstrom (2013), newly computed fundamental mode ambient noise Rayleigh wave phase delays for the continuous US up to July 2017, and other, previously published, measurements. These datasets, along with a data-adaptive parameterization utilized for the SAVANI model (Auer et al., 2014), should allow significantly finer-scale imaging than previous global models, rivaling that of regional-scale approaches, under the USArray footprint in the continuous US, while seamlessly integrating into a global model. We parameterize the model for both vertically (vSV) and horizontally (vSH) polarized shear velocities by accounting for the different sensitivities of the various phases and wave types. The resulting, radially anisotropic model should allow for a range of new geodynamic analysis, including estimates of mantle flow induced topography or seismic anisotropy, without generating artifacts due to edge effects, or requiring assumptions about the structure of the region outside the well resolved model space. Our model shows a number of features, including indications of the effects of edge-driven convection in the Cordillera and along the eastern margin and larger-scale convection due to the subduction of the Farallon slab and along the edge of the Laurentia cratonic margin.
Dataset from Dick et al published in Sawyer et al 2016
Dataset is a time course description of lindane disappearance in blood plasma after dermal exposure in human volunteersThis dataset is associated with the following publication:Sawyer, M.E., M.V. Evans , C. Wilson, L.J. Beesley, L. Leon, C. Eklund , E. Croom, and R. Pegram. Development of a Human Physiologically Based Pharmacokinetics (PBPK) Model For Dermal Permeability for Lindane. TOXICOLOGY LETTERS. Elsevier Science Ltd, New York, NY, USA, 14(245): pp106-109, (2016).
Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration were allowed to equilibrate with known mass of nanoparticles. The mixture was then ultracentrifuged and sampled for analysis. This dataset is associated with the following publication:Sahle-Demessie, E., A. Zhao, C. Han, B. Hann, and H. Grecsek. Interaction of engineered nanomaterials with hydrophobic organic pollutants.. Journal of Nanotechnology. Hindawi Publishing Corporation, New York, NY, USA, 27(28): 284003, (2016).
Data Basin: Expanding Access to Conservation Data, Tools, and People
NASA Astrophysics Data System (ADS)
Comendant, T.; Strittholt, J.; Frost, P.; Ward, B. C.; Bachelet, D. M.; Osborne-Gowey, J.
2009-12-01
Mapping and spatial analysis are a fundamental part of problem solving in conservation science, yet spatial data are widely scattered, difficult to locate, and often unavailable. Valuable time and resources are wasted locating and gaining access to important biological, cultural, and economic datasets, scientific analysis, and experts. As conservation problems become more serious and the demand to solve them grows more urgent, a new way to connect science and practice is needed. To meet this need, an open-access, web tool called Data Basin (www.databasin.org) has been created by the Conservation Biology Institute in partnership with ESRI and the Wilburforce Foundation. Users of Data Basin can gain quick access to datasets, experts, groups, and tools to help solve real-world problems. Individuals and organizations can perform essential tasks such as exploring and downloading from a vast library of conservation datasets, uploading existing datasets, connecting to other external data sources, create groups, and produce customized maps that can be easily shared. Data Basin encourages sharing and publishing, but also provides privacy and security for sensitive information when needed. Users can publish projects within Data Basin to tell more complete and rich stories of discovery and solutions. Projects are an ideal way to publish collections of datasets, maps and other information on the internet to reach wider audiences. Data Basin also houses individual centers that provide direct access to data, maps, and experts focused on specific geographic areas or conservation topics. Current centers being developed include the Boreal Information Centre, the Data Basin Climate Center, and proposed Aquatic and Forest Conservation Centers.
Enabling Open Research Data Discovery through a Recommender System
NASA Astrophysics Data System (ADS)
Devaraju, Anusuriya; Jayasinghe, Gaya; Klump, Jens; Hogan, Dominic
2017-04-01
Government agencies, universities, research and nonprofit organizations are increasingly publishing their datasets to promote transparency, induce new research and generate economic value through the development of new products or services. The datasets may be downloaded from various data portals (data repositories) which are general or domain-specific. The Registry of Research Data Repository (re3data.org) lists more than 2500 such data repositories from around the globe. Data portals allow keyword search and faceted navigation to facilitate discovery of research datasets. However, the volume and variety of datasets have made finding relevant datasets more difficult. Common dataset search mechanisms may be time consuming, may produce irrelevant results and are primarily suitable for users who are familiar with the general structure and contents of the respective database. Therefore, we need new approaches to support research data discovery. Recommender systems offer new possibilities for users to find datasets that are relevant to their research interests. This study presents a recommender system developed for the CSIRO Data Access Portal (DAP, http://data.csiro.au). The datasets hosted on the portal are diverse, published by researchers from 13 business units in the organisation. The goal of the study is not to replace the current search mechanisms on the data portal, but rather to extend the data discovery through an exploratory search, in this case by building a recommender system. We adopted a hybrid recommendation approach, comprising content-based filtering and item-item collaborative filtering. The content-based filtering computes similarities between datasets based on metadata such as title, keywords, descriptions, fields of research, location, contributors, etc. The collaborative filtering utilizes user search behaviour and download patterns derived from the server logs to determine similar datasets. Similarities above are then combined with different degrees of importance (weights) to determine the overall data similarity. We determined the similarity weights based on a survey involving 150 users of the portal. The recommender results for a given dataset are accessible programmatically via a RESTful web service. An offline evaluation involving data users demonstrates the ability of the recommender system to discover relevant and 'novel' datasets.
Defining and identifying Sleeping Beauties in science
Ke, Qing; Ferrara, Emilio; Radicchi, Filippo; Flammini, Alessandro
2015-01-01
A Sleeping Beauty (SB) in science refers to a paper whose importance is not recognized for several years after publication. Its citation history exhibits a long hibernation period followed by a sudden spike of popularity. Previous studies suggest a relative scarcity of SBs. The reliability of this conclusion is, however, heavily dependent on identification methods based on arbitrary threshold parameters for sleeping time and number of citations, applied to small or monodisciplinary bibliographic datasets. Here we present a systematic, large-scale, and multidisciplinary analysis of the SB phenomenon in science. We introduce a parameter-free measure that quantifies the extent to which a specific paper can be considered an SB. We apply our method to 22 million scientific papers published in all disciplines of natural and social sciences over a time span longer than a century. Our results reveal that the SB phenomenon is not exceptional. There is a continuous spectrum of delayed recognition where both the hibernation period and the awakening intensity are taken into account. Although many cases of SBs can be identified by looking at monodisciplinary bibliographic data, the SB phenomenon becomes much more apparent with the analysis of multidisciplinary datasets, where we can observe many examples of papers achieving delayed yet exceptional importance in disciplines different from those where they were originally published. Our analysis emphasizes a complex feature of citation dynamics that so far has received little attention, and also provides empirical evidence against the use of short-term citation metrics in the quantification of scientific impact. PMID:26015563
Sma3s: a three-step modular annotator for large sequence datasets.
Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J
2014-08-01
Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Pathway Activity Profiling (PAPi): from the metabolite profile to the metabolic pathway activity.
Aggio, Raphael B M; Ruggiero, Katya; Villas-Bôas, Silas Granato
2010-12-01
Metabolomics is one of the most recent omics-technologies and uses robust analytical techniques to screen low molecular mass metabolites in biological samples. It has evolved very quickly during the last decade. However, metabolomics datasets are considered highly complex when used to relate metabolite levels to metabolic pathway activity. Despite recent developments in bioinformatics, which have improved the quality of metabolomics data, there is still no straightforward method capable of correlating metabolite level to the activity of different metabolic pathways operating within the cells. Thus, this kind of analysis still depends on extremely laborious and time-consuming processes. Here, we present a new algorithm Pathway Activity Profiling (PAPi) with which we are able to compare metabolic pathway activities from metabolite profiles. The applicability and potential of PAPi was demonstrated using a previously published data from the yeast Saccharomyces cerevisiae. PAPi was able to support the biological interpretations of the previously published observations and, in addition, generated new hypotheses in a straightforward manner. However, PAPi is time consuming to perform manually. Thus, we also present here a new R-software package (PAPi) which implements the PAPi algorithm and facilitates its usage to quickly compare metabolic pathways activities between different experimental conditions. Using the identified metabolites and their respective abundances as input, the PAPi package calculates pathways' Activity Scores, which represents the potential metabolic pathways activities and allows their comparison between conditions. PAPi also performs principal components analysis and analysis of variance or t-test to investigate differences in activity level between experimental conditions. In addition, PAPi generates comparative graphs highlighting up- and down-regulated pathway activity. These datasets are available in http://www.4shared.com/file/hTWyndYU/extra.html and http://www.4shared.com/file/VbQIIDeu/intra.html. PAPi package is available in: http://www.4shared.com/file/s0uIYWIg/PAPi_10.html s.villas-boas@auckland.ac.nz Supplementary data are available at Bioinformatics online.
NASA Astrophysics Data System (ADS)
O'Connell, D.; Ruan, D.; Thomas, D. H.; Dou, T. H.; Lewis, J. H.; Santhanam, A.; Lee, P.; Low, D. A.
2018-02-01
Breathing motion modeling requires observation of tissues at sufficiently distinct respiratory states for proper 4D characterization. This work proposes a method to improve sampling of the breathing cycle with limited imaging dose. We designed and tested a prospective free-breathing acquisition protocol with a simulation using datasets from five patients imaged with a model-based 4DCT technique. Each dataset contained 25 free-breathing fast helical CT scans with simultaneous breathing surrogate measurements. Tissue displacements were measured using deformable image registration. A correspondence model related tissue displacement to the surrogate. Model residual was computed by comparing predicted displacements to image registration results. To determine a stopping criteria for the prospective protocol, i.e. when the breathing cycle had been sufficiently sampled, subsets of N scans where 5 ⩽ N ⩽ 9 were used to fit reduced models for each patient. A previously published metric was employed to describe the phase coverage, or ‘spread’, of the respiratory trajectories of each subset. Minimum phase coverage necessary to achieve mean model residual within 0.5 mm of the full 25-scan model was determined and used as the stopping criteria. Using the patient breathing traces, a prospective acquisition protocol was simulated. In all patients, phase coverage greater than the threshold necessary for model accuracy within 0.5 mm of the 25 scan model was achieved in six or fewer scans. The prospectively selected respiratory trajectories ranked in the (97.5 ± 4.2)th percentile among subsets of the originally sampled scans on average. Simulation results suggest that the proposed prospective method provides an effective means to sample the breathing cycle with limited free-breathing scans. One application of the method is to reduce the imaging dose of a previously published model-based 4DCT protocol to 25% of its original value while achieving mean model residual within 0.5 mm.
Rogers, L J; Douglas, R R
1984-02-01
In this paper (the second in a series), we consider a (generic) pair of datasets, which have been analyzed by the techniques of the previous paper. Thus, their "stable subspaces" have been established by comparative factor analysis. The pair of datasets must satisfy two confirmable conditions. The first is the "Inclusion Condition," which requires that the stable subspace of one of the datasets is nearly identical to a subspace of the other dataset's stable subspace. On the basis of that, we have assumed the pair to have similar generating signals, with stochastically independent generators. The second verifiable condition is that the (presumed same) generating signals have distinct ratios of variances for the two datasets. Under these conditions a small elaboration of some elementary linear algebra reduces the rotation problem to several eigenvalue-eigenvector problems. Finally, we emphasize that an analysis of each dataset by the method of Douglas and Rogers (1983) is an essential prerequisite for the useful application of the techniques in this paper. Nonempirical methods of estimating the number of factors simply will not suffice, as confirmed by simulations reported in the previous paper.
Drakesmith, M; Caeyenberghs, K; Dutt, A; Lewis, G; David, A S; Jones, D K
2015-09-01
Graph theory (GT) is a powerful framework for quantifying topological features of neuroimaging-derived functional and structural networks. However, false positive (FP) connections arise frequently and influence the inferred topology of networks. Thresholding is often used to overcome this problem, but an appropriate threshold often relies on a priori assumptions, which will alter inferred network topologies. Four common network metrics (global efficiency, mean clustering coefficient, mean betweenness and smallworldness) were tested using a model tractography dataset. It was found that all four network metrics were significantly affected even by just one FP. Results also show that thresholding effectively dampens the impact of FPs, but at the expense of adding significant bias to network metrics. In a larger number (n=248) of tractography datasets, statistics were computed across random group permutations for a range of thresholds, revealing that statistics for network metrics varied significantly more than for non-network metrics (i.e., number of streamlines and number of edges). Varying degrees of network atrophy were introduced artificially to half the datasets, to test sensitivity to genuine group differences. For some network metrics, this atrophy was detected as significant (p<0.05, determined using permutation testing) only across a limited range of thresholds. We propose a multi-threshold permutation correction (MTPC) method, based on the cluster-enhanced permutation correction approach, to identify sustained significant effects across clusters of thresholds. This approach minimises requirements to determine a single threshold a priori. We demonstrate improved sensitivity of MTPC-corrected metrics to genuine group effects compared to an existing approach and demonstrate the use of MTPC on a previously published network analysis of tractography data derived from a clinical population. In conclusion, we show that there are large biases and instability induced by thresholding, making statistical comparisons of network metrics difficult. However, by testing for effects across multiple thresholds using MTPC, true group differences can be robustly identified. Copyright © 2015. Published by Elsevier Inc.
Wide-Open: Accelerating public data release by automating detection of overdue datasets
Poon, Hoifung; Howe, Bill
2017-01-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819
Wide-Open: Accelerating public data release by automating detection of overdue datasets.
Grechkin, Maxim; Poon, Hoifung; Howe, Bill
2017-06-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
An Application of Hydraulic Tomography to a Large-Scale Fractured Granite Site, Mizunami, Japan.
Zha, Yuanyuan; Yeh, Tian-Chyi J; Illman, Walter A; Tanaka, Tatsuya; Bruines, Patrick; Onoe, Hironori; Saegusa, Hiromitsu; Mao, Deqiang; Takeuchi, Shinji; Wen, Jet-Chau
2016-11-01
While hydraulic tomography (HT) is a mature aquifer characterization technology, its applications to characterize hydrogeology of kilometer-scale fault and fracture zones are rare. This paper sequentially analyzes datasets from two new pumping tests as well as those from two previous pumping tests analyzed by Illman et al. (2009) at a fractured granite site in Mizunami, Japan. Results of this analysis show that datasets from two previous pumping tests at one side of a fault zone as used in the previous study led to inaccurate mapping of fracture and fault zones. Inclusion of the datasets from the two new pumping tests (one of which was conducted on the other side of the fault) yields locations of the fault zone consistent with those based on geological mapping. The new datasets also produce a detailed image of the irregular fault zone, which is not available from geological investigation alone and the previous study. As a result, we conclude that if prior knowledge about geological structures at a field site is considered during the design of HT surveys, valuable non-redundant datasets about the fracture and fault zones can be collected. Only with these non-redundant data sets, can HT then be a viable and robust tool for delineating fracture and fault distributions over kilometer scales, even when only a limited number of boreholes are available. In essence, this paper proves that HT is a new tool for geologists, geophysicists, and engineers for mapping large-scale fracture and fault zone distributions. © 2016, National Ground Water Association.
Efficient sequential and parallel algorithms for record linkage
Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar
2014-01-01
Background and objective Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Methods Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Results Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). Conclusions We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm. PMID:24154837
2011-01-01
Background This article aims to update the existing systematic review evidence elicited by Mickenautsch et al. up to 18 January 2008 (published in the European Journal of Paediatric Dentistry in 2009) and addressing the review question of whether, in the same dentition and same cavity class, glass-ionomer cement (GIC) restored cavities show less recurrent carious lesions on cavity margins than cavities restored with amalgam. Methods The systematic literature search was extended beyond the original search date and a further hand-search and reference check was done. The quality of accepted trials was assessed, using updated quality criteria, and the risk of bias was investigated in more depth than previously reported. In addition, the focus of quantitative synthesis was shifted to single datasets extracted from the accepted trials. Results The database search (up to 10 August 2010) identified 1 new trial, in addition to the 9 included in the original systematic review, and 11 further trials were included after a hand-search and reference check. Of these 21 trials, 11 were excluded and 10 were accepted for data extraction and quality assessment. Thirteen dichotomous datasets of primary outcomes and 4 datasets with secondary outcomes were extracted. Meta-analysis and cumulative meta-analysis were used in combining clinically homogenous datasets. The overall results of the computed datasets suggest that GIC has a higher caries-preventive effect than amalgam for restorations in permanent teeth. No difference was found for restorations in the primary dentition. Conclusion This outcome is in agreement with the conclusions of the original systematic review. Although the findings of the trials identified in this update may be considered to be less affected by attrition- and publication bias, their risk of selection- and detection/performance bias is high. Thus, verification of the currently available results requires further high-quality randomised control trials. PMID:21396097
Ismaili, Abd R A; Vestergaard, Mark B; Hansen, Adam E; Larsson, Henrik B W; Johannesen, Helle H; Law, Ian; Henriksen, Otto M
2018-01-01
The aim of the study was to investigate the components of day-to-day variability of repeated phase contrast mapping (PCM) magnetic resonance imaging measurements of global cerebral blood flow (gCBF). Two dataset were analyzed. In Dataset 1 duplicated PCM measurements of total brain flow were performed in 11 healthy young volunteers on two separate days applying a strictly standardized setup. For comparison PCM measurements obtained from a previously published study (Dataset 2) were analyzed in order to assess long-term variability in an aged population in a less strictly controlled setup. Global CBF was calculated by normalizing total brain flow to brain volume. On each day measurements of hemoglobin, caffeine and glucose were obtained. Linear mixed models were applied to estimate coefficients of variation (CV) of total (CVt), between-subject (CVb), within-subject day-to-day (CVw), and intra-session residual variability (CVr). In Dataset 1 CVt, CVb, CVw and CVr were estimated to be 11%, 9.4%, 4% and 4.2%, respectively, and to 8.8%, 7.2%, 2.7% and 4.3%, respectively, when adjusting for hemoglobin and plasma caffeine. In Dataset 2 CVt, CVb and CVw were estimated to be 25.4%, 19.2%, and 15.0%, respectively, and decreased to 16.6%, 8.2% and 12.5%, respectively, when adjusting for the same covariates. Our results suggest that short-term day-to-day variability of gCBF is relatively low compared to between-subject variability when studied in standardized conditions, whereas long-term variability in an aged population appears to be much larger when studied in less a standardized setup. The results further showed that from 20% to 35% of the total variability in gCBF can be attributed to the effects of hemoglobin and caffeine.
Boyd, Philip W.; Rynearson, Tatiana A.; Armstrong, Evelyn A.; Fu, Feixue; Hayashi, Kendra; Hu, Zhangxi; Hutchins, David A.; Kudela, Raphael M.; Litchman, Elena; Mulholland, Margaret R.; Passow, Uta; Strzepek, Robert F.; Whittaker, Kerry A.; Yu, Elizabeth; Thomas, Mridul K.
2013-01-01
“It takes a village to finish (marine) science these days” Paraphrased from Curtis Huttenhower (the Human Microbiome project) The rapidity and complexity of climate change and its potential effects on ocean biota are challenging how ocean scientists conduct research. One way in which we can begin to better tackle these challenges is to conduct community-wide scientific studies. This study provides physiological datasets fundamental to understanding functional responses of phytoplankton growth rates to temperature. While physiological experiments are not new, our experiments were conducted in many laboratories using agreed upon protocols and 25 strains of eukaryotic and prokaryotic phytoplankton isolated across a wide range of marine environments from polar to tropical, and from nearshore waters to the open ocean. This community-wide approach provides both comprehensive and internally consistent datasets produced over considerably shorter time scales than conventional individual and often uncoordinated lab efforts. Such datasets can be used to parameterise global ocean model projections of environmental change and to provide initial insights into the magnitude of regional biogeographic change in ocean biota in the coming decades. Here, we compare our datasets with a compilation of literature data on phytoplankton growth responses to temperature. A comparison with prior published data suggests that the optimal temperatures of individual species and, to a lesser degree, thermal niches were similar across studies. However, a comparison of the maximum growth rate across studies revealed significant departures between this and previously collected datasets, which may be due to differences in the cultured isolates, temporal changes in the clonal isolates in cultures, and/or differences in culture conditions. Such methodological differences mean that using particular trait measurements from the prior literature might introduce unknown errors and bias into modelling projections. Using our community-wide approach we can reduce such protocol-driven variability in culture studies, and can begin to address more complex issues such as the effect of multiple environmental drivers on ocean biota. PMID:23704890
Retrospective analysis of natural products provides insights for future discovery trends.
Pye, Cameron R; Bertin, Matthew J; Lokey, R Scott; Gerwick, William H; Linington, Roger G
2017-05-30
Understanding of the capacity of the natural world to produce secondary metabolites is important to a broad range of fields, including drug discovery, ecology, biosynthesis, and chemical biology, among others. Both the absolute number and the rate of discovery of natural products have increased significantly in recent years. However, there is a perception and concern that the fundamental novelty of these discoveries is decreasing relative to previously known natural products. This study presents a quantitative examination of the field from the perspective of both number of compounds and compound novelty using a dataset of all published microbial and marine-derived natural products. This analysis aimed to explore a number of key questions, such as how the rate of discovery of new natural products has changed over the past decades, how the average natural product structural novelty has changed as a function of time, whether exploring novel taxonomic space affords an advantage in terms of novel compound discovery, and whether it is possible to estimate how close we are to having described all of the chemical space covered by natural products. Our analyses demonstrate that most natural products being published today bear structural similarity to previously published compounds, and that the range of scaffolds readily accessible from nature is limited. However, the analysis also shows that the field continues to discover appreciable numbers of natural products with no structural precedent. Together, these results suggest that the development of innovative discovery methods will continue to yield compounds with unique structural and biological properties.
Privacy preserving data publishing of categorical data through k-anonymity and feature selection.
Aristodimou, Aristos; Antoniades, Athos; Pattichis, Constantinos S
2016-03-01
In healthcare, there is a vast amount of patients' data, which can lead to important discoveries if combined. Due to legal and ethical issues, such data cannot be shared and hence such information is underused. A new area of research has emerged, called privacy preserving data publishing (PPDP), which aims in sharing data in a way that privacy is preserved while the information lost is kept at a minimum. In this Letter, a new anonymisation algorithm for PPDP is proposed, which is based on k-anonymity through pattern-based multidimensional suppression (kPB-MS). The algorithm uses feature selection for reducing the data dimensionality and then combines attribute and record suppression for obtaining k-anonymity. Five datasets from different areas of life sciences [RETINOPATHY, Single Proton Emission Computed Tomography imaging, gene sequencing and drug discovery (two datasets)], were anonymised with kPB-MS. The produced anonymised datasets were evaluated using four different classifiers and in 74% of the test cases, they produced similar or better accuracies than using the full datasets.
Szabolcsi, Zoltán; Farkas, Zsuzsa; Borbély, Andrea; Bárány, Gusztáv; Varga, Dániel; Heinrich, Attila; Völgyi, Antónia; Pamjav, Horolma
2015-11-01
When the DNA profile from a crime-scene matches that of a suspect, the weight of DNA evidence depends on the unbiased estimation of the match probability of the profiles. For this reason, it is required to establish and expand the databases that reflect the actual allele frequencies in the population applied. 21,473 complete DNA profiles from Databank samples were used to establish the allele frequency database to represent the population of Hungarian suspects. We used fifteen STR loci (PowerPlex ESI16) including five, new ESS loci. The aim was to calculate the statistical, forensic efficiency parameters for the Databank samples and compare the newly detected data to the earlier report. The population substructure caused by relatedness may influence the frequency of profiles estimated. As our Databank profiles were considered non-random samples, possible relationships between the suspects can be assumed. Therefore, population inbreeding effect was estimated using the FIS calculation. The overall inbreeding parameter was found to be 0.0106. Furthermore, we tested the impact of the two allele frequency datasets on 101 randomly chosen STR profiles, including full and partial profiles. The 95% confidence interval estimates for the profile frequencies (pM) resulted in a tighter range when we used the new dataset compared to the previously published ones. We found that the FIS had less effect on frequency values in the 21,473 samples than the application of minimum allele frequency. No genetic substructure was detected by STRUCTURE analysis. Due to the low level of inbreeding effect and the high number of samples, the new dataset provides unbiased and precise estimates of LR for statistical interpretation of forensic casework and allows us to use lower allele frequencies. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Hourly mass and snow energy balance measurements from Mammoth Mountain, CA USA, 2011-2017
NASA Astrophysics Data System (ADS)
Bair, Edward H.; Davis, Robert E.; Dozier, Jeff
2018-03-01
The mass and energy balance of the snowpack govern its evolution. Direct measurement of these fluxes is essential for modeling the snowpack, yet there are few sites where all the relevant measurements are taken. Mammoth Mountain, CA USA, is home to the Cold Regions Research and Engineering Laboratory and University of California - Santa Barbara Energy Site (CUES), one of five energy balance monitoring sites in the western US. There is a ski patrol study site on Mammoth Mountain, called the Sesame Street Snow Study Plot, with automated snow and meteorological instruments where new snow is hand-weighed to measure its water content. There is also a site at Mammoth Pass with automated precipitation instruments. For this dataset, we present a clean and continuous hourly record of selected measurements from the three sites covering the 2011-2017 water years. Then, we model the snow mass balance at CUES and compare model runs to snow pillow measurements. The 2011-2017 period was marked by exceptional variability in precipitation, even for an area that has high year-to-year variability. The driest year on record, and one of the wettest years, occurred during this time period, making it ideal for studying climatic extremes. This dataset complements a previously published dataset from CUES containing a smaller subset of daily measurements. In addition to the hand-weighed SWE, novel measurements include hourly broadband snow albedo corrected for terrain and other measurement biases. This dataset is available with a digital object identifier: https://doi.org/10.21424/R4159Q.
The interfacial character of antibody paratopes: analysis of antibody-antigen structures.
Nguyen, Minh N; Pradhan, Mohan R; Verma, Chandra; Zhong, Pingyu
2017-10-01
In this study, computational methods are applied to investigate the general properties of antigen engaging residues of a paratope from a non-redundant dataset of 403 antibody-antigen complexes to dissect the contribution of hydrogen bonds, hydrophobic, van der Waals contacts and ionic interactions, as well as role of water molecules in the antigen-antibody interface. Consistent with previous reports using smaller datasets, we found that Tyr, Trp, Ser, Asn, Asp, Thr, Arg, Gly, His contribute substantially to the interactions between antibody and antigen. Furthermore, antibody-antigen interactions can be mediated by interfacial waters. However, there is no reported comprehensive analysis for a large number of structured waters that engage in higher ordered structures at the antibody-antigen interface. From our dataset, we have found the presence of interfacial waters in 242 complexes. We present evidence that suggests a compelling role of these interfacial waters in interactions of antibodies with a range of antigens differing in shape complementarity. Finally, we carry out 296 835 pairwise 3D structure comparisons of 771 structures of contact residues of antibodies with their interfacial water molecules from our dataset using CLICK method. A heuristic clustering algorithm is used to obtain unique structural similarities, and found to separate into 368 different clusters. These clusters are used to identify structural motifs of contact residues of antibodies for epitope binding. This clustering database of contact residues is freely accessible at http://mspc.bii.a-star.edu.sg/minhn/pclick.html. minhn@bii.a-star.edu.sg, chandra@bii.a-star.edu.sg or zhong_pingyu@immunol.a-star.edu.sg. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia.
Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan
2016-07-01
Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA-TF-gene network relevant to SZ, including an EGR1-miR-124-3p-SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. © The Author 2015. Published by Oxford University Press on behalf of the Maryland Psychiatric Research Center. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Stelpflug, Scott C.; Sekhon, Rajandeep S.; Vaillancourt, Brieanne; ...
2015-12-30
Comprehensive and systematic transcriptome profiling provides valuable insight into biological and developmental processes that occur throughout the life cycle of a plant. We have enhanced our previously published microarray-based gene atlas of maize ( Zea mays L.) inbred B73 to now include 79 distinct replicated samples that have been interrogated using RNA sequencing (RNA-seq). The current version of the atlas includes 50 original array-based gene atlas samples, a time-course of 12 stalk and leaf samples postflowering, and an additional set of 17 samples from the maize seedling and adult root system. The entire dataset contains 4.6 billion mapped reads, withmore » an average of 20.5 million mapped reads per biological replicate, allowing for detection of genes with lower transcript abundance. As the new root samples represent key additions to the previously examined tissues, we highlight insights into the root transcriptome, which is represented by 28,894 (73.2%) annotated genes in maize. Additionally, we observed remarkable expression differences across both the longitudinal (four zones) and radial gradients (cortical parenchyma and stele) of the primary root supported by fourfold differential expression of 9353 and 4728 genes, respectively. Among the latter were 1110 genes that encode transcription factors, some of which are orthologs of previously characterized transcription factors known to regulate root development in Arabidopsis thaliana (L.) Heynh., while most are novel, and represent attractive targets for reverse genetics approaches to determine their roles in this important organ. As a result, this comprehensive transcriptome dataset is a powerful tool toward understanding maize development, physiology, and phenotypic diversity.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stelpflug, Scott C.; Sekhon, Rajandeep S.; Vaillancourt, Brieanne
Comprehensive and systematic transcriptome profiling provides valuable insight into biological and developmental processes that occur throughout the life cycle of a plant. We have enhanced our previously published microarray-based gene atlas of maize ( Zea mays L.) inbred B73 to now include 79 distinct replicated samples that have been interrogated using RNA sequencing (RNA-seq). The current version of the atlas includes 50 original array-based gene atlas samples, a time-course of 12 stalk and leaf samples postflowering, and an additional set of 17 samples from the maize seedling and adult root system. The entire dataset contains 4.6 billion mapped reads, withmore » an average of 20.5 million mapped reads per biological replicate, allowing for detection of genes with lower transcript abundance. As the new root samples represent key additions to the previously examined tissues, we highlight insights into the root transcriptome, which is represented by 28,894 (73.2%) annotated genes in maize. Additionally, we observed remarkable expression differences across both the longitudinal (four zones) and radial gradients (cortical parenchyma and stele) of the primary root supported by fourfold differential expression of 9353 and 4728 genes, respectively. Among the latter were 1110 genes that encode transcription factors, some of which are orthologs of previously characterized transcription factors known to regulate root development in Arabidopsis thaliana (L.) Heynh., while most are novel, and represent attractive targets for reverse genetics approaches to determine their roles in this important organ. As a result, this comprehensive transcriptome dataset is a powerful tool toward understanding maize development, physiology, and phenotypic diversity.« less
Ulmer, Megan; Li, Jun; Yaspan, Brian L; Ozel, Ayse Bilge; Richards, Julia E; Moroi, Sayoko E; Hawthorne, Felicia; Budenz, Donald L; Friedman, David S; Gaasterland, Douglas; Haines, Jonathan; Kang, Jae H; Lee, Richard; Lichter, Paul; Liu, Yutao; Pasquale, Louis R; Pericak-Vance, Margaret; Realini, Anthony; Schuman, Joel S; Singh, Kuldev; Vollrath, Douglas; Weinreb, Robert; Wollstein, Gadi; Zack, Donald J; Zhang, Kang; Young, Terri; Allingham, R Rand; Wiggs, Janey L; Ashley-Koch, Allison; Hauser, Michael A
2012-07-03
To investigate the effects of central corneal thickness (CCT)-associated variants on primary open-angle glaucoma (POAG) risk using single nucleotide polymorphisms (SNP) data from the Glaucoma Genes and Environment (GLAUGEN) and National Eye Institute (NEI) Glaucoma Human Genetics Collaboration (NEIGHBOR) consortia. A replication analysis of previously reported CCT SNPs was performed in a CCT dataset (n = 1117) and these SNPs were then tested for association with POAG using a larger POAG dataset (n = 6470). Then a CCT genome-wide association study (GWAS) was performed. Top SNPs from this analysis were selected and tested for association with POAG. cDNA libraries from fetal and adult brain and ocular tissue samples were generated and used for candidate gene expression analysis. Association with one of 20 previously published CCT SNPs was replicated: rs12447690, near the ZNF469 gene (P = 0.001; β = -5.08 μm/allele). None of these SNPs were significantly associated with POAG. In the CCT GWAS, no SNPs reached genome-wide significance. After testing 50 candidate SNPs for association with POAG, one SNP was identified, rs7481514 within the neurotrimin (NTM) gene, that was significantly associated with POAG in a low-tension subset (P = 0.00099; Odds Ratio [OR] = 1.28). Additionally, SNPs in the CNTNAP4 gene showed suggestive association with POAG (top SNP = rs1428758; P = 0.018; OR = 0.84). NTM and CNTNAP4 were shown to be expressed in ocular tissues. The results suggest previously reported CCT loci are not significantly associated with POAG susceptibility. By performing a quantitative analysis of CCT and a subsequent analysis of POAG, SNPs in two cell adhesion molecules, NTM and CNTNAP4, were identified and may increase POAG susceptibility in a subset of cases.
Fast Construction of Near Parsimonious Hybridization Networks for Multiple Phylogenetic Trees.
Mirzaei, Sajad; Wu, Yufeng
2016-01-01
Hybridization networks represent plausible evolutionary histories of species that are affected by reticulate evolutionary processes. An established computational problem on hybridization networks is constructing the most parsimonious hybridization network such that each of the given phylogenetic trees (called gene trees) is "displayed" in the network. There have been several previous approaches, including an exact method and several heuristics, for this NP-hard problem. However, the exact method is only applicable to a limited range of data, and heuristic methods can be less accurate and also slow sometimes. In this paper, we develop a new algorithm for constructing near parsimonious networks for multiple binary gene trees. This method is more efficient for large numbers of gene trees than previous heuristics. This new method also produces more parsimonious results on many simulated datasets as well as a real biological dataset than a previous method. We also show that our method produces topologically more accurate networks for many datasets.
Similarity of markers identified from cancer gene expression studies: observations from GEO.
Shi, Xingjie; Shen, Shihao; Liu, Jin; Huang, Jian; Zhou, Yong; Ma, Shuangge
2014-09-01
Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Giesbrecht, Melissa; Crooks, Valorie A; Castleden, Heather; Schuurman, Nadine; Skinner, Mark W; Williams, Allison M
2016-09-01
In 2010, Castleden and colleagues published a paper in this journal using the concept of 'place' as an analytic tool to understand the nature of palliative care provision in a rural region in British Columbia, Canada. This publication was based upon pilot data collected for a larger research project that has since been completed. With the addition of 40 semi-structured interviews with users and providers of palliative care in four other rural communities located across Canada, we revisit Castleden and colleagues' (2010) original framework. Applying the concept of place to the full dataset confirmed the previously published findings, but also revealed two new place-based dimensions related to experiences of rural palliative care in Canada: (1) borders and boundaries; and (2) 'making' place for palliative care progress. These new findings offer a refined understanding of the complex interconnections between various dimensions of place and palliative care in rural Canada. Copyright © 2016 Elsevier Ltd. All rights reserved.
Takata, Atsushi; Miyake, Noriko; Tsurusaki, Yoshinori; Fukai, Ryoko; Miyatake, Satoko; Koshimizu, Eriko; Kushima, Itaru; Okada, Takashi; Morikawa, Mako; Uno, Yota; Ishizuka, Kanako; Nakamura, Kazuhiko; Tsujii, Masatsugu; Yoshikawa, Takeo; Toyota, Tomoko; Okamoto, Nobuhiko; Hiraki, Yoko; Hashimoto, Ryota; Yasuda, Yuka; Saitoh, Shinji; Ohashi, Kei; Sakai, Yasunari; Ohga, Shouichi; Hara, Toshiro; Kato, Mitsuhiro; Nakamura, Kazuyuki; Ito, Aiko; Seiwa, Chizuru; Shirahata, Emi; Osaka, Hitoshi; Matsumoto, Ayumi; Takeshita, Saoko; Tohyama, Jun; Saikusa, Tomoko; Matsuishi, Toyojiro; Nakamura, Takumi; Tsuboi, Takashi; Kato, Tadafumi; Suzuki, Toshifumi; Saitsu, Hirotomo; Nakashima, Mitsuko; Mizuguchi, Takeshi; Tanaka, Fumiaki; Mori, Norio; Ozaki, Norio; Matsumoto, Naomichi
2018-01-16
Recent studies have established important roles of de novo mutations (DNMs) in autism spectrum disorders (ASDs). Here, we analyze DNMs in 262 ASD probands of Japanese origin and confirm the "de novo paradigm" of ASDs across ethnicities. Based on this consistency, we combine the lists of damaging DNMs in our and published ASD cohorts (total number of trios, 4,244) and perform integrative bioinformatics analyses. Besides replicating the findings of previous studies, our analyses highlight ATP-binding genes and fetal cerebellar/striatal circuits. Analysis of individual genes identified 61 genes enriched for damaging DNMs, including ten genes for which our dataset now contributes to statistical significance. Screening of compounds altering the expression of genes hit by damaging DNMs reveals a global downregulating effect of valproic acid, a known risk factor for ASDs, whereas cardiac glycosides upregulate these genes. Collectively, our integrative approach provides deeper biological and potential medical insights into ASDs. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rawle, Rachel A.; Hamerly, Timothy; Tripet, Brian P.
Studies of interspecies interactions are inherently difficult due to the complex mechanisms which enable these relationships. A model system for studying interspecies interactions is the marine hyperthermophiles Ignicoccus hospitalis and Nanoarchaeum equitans. Recent independently-conducted ‘omics’ analyses have generated insights into the molecular factors modulating this association. However, significant questions remain about the nature of the interactions between these archaea. We jointly analyzed multiple levels of omics datasets obtained from published, independent transcriptomics, proteomics, and metabolomics analyses. DAVID identified functionally-related groups enriched when I. hospitalis is grown alone or in co-culture with N. equitans. Enriched molecular pathways were subsequently visualized usingmore » interaction maps generated using STRING. Key findings of our multi-level omics analysis indicated that I. hospitalis provides precursors to N. equitans for energy metabolism. Analysis indicated an overall reduction in diversity of metabolic precursors in the I. hospitalis–N. equitans co-culture, which has been connected to the differential use of ribosomal subunits and was previously unnoticed. We also identified differences in precursors linked to amino acid metabolism, NADH metabolism, and carbon fixation, providing new insights into the metabolic adaptions of I. hospitalis enabling the growth of N. equitans. In conclusion, this multi-omics analysis builds upon previously identified cellular patterns while offering new insights into mechanisms that enable the I. hospitalis–N. equitans association. This study applies statistical and visualization techniques to a mixed-source omics dataset to yield a more global insight into a complex system, that was not readily discernable from separate omics studies.« less
Phenotypic Association Analyses With Copy Number Variation in Recurrent Depressive Disorder.
Rucker, James J H; Tansey, Katherine E; Rivera, Margarita; Pinto, Dalila; Cohen-Woods, Sarah; Uher, Rudolf; Aitchison, Katherine J; Craddock, Nick; Owen, Michael J; Jones, Lisa; Jones, Ian; Korszun, Ania; Barnes, Michael R; Preisig, Martin; Mors, Ole; Maier, Wolfgang; Rice, John; Rietschel, Marcella; Holsboer, Florian; Farmer, Anne E; Craig, Ian W; Scherer, Stephen W; McGuffin, Peter; Breen, Gerome
2016-02-15
Defining the molecular genomic basis of the likelihood of developing depressive disorder is a considerable challenge. We previously associated rare, exonic deletion copy number variants (CNV) with recurrent depressive disorder (RDD). Sex chromosome abnormalities also have been observed to co-occur with RDD. In this reanalysis of our RDD dataset (N = 3106 cases; 459 screened control samples and 2699 population control samples), we further investigated the role of larger CNVs and chromosomal abnormalities in RDD and performed association analyses with clinical data derived from this dataset. We found an enrichment of Turner's syndrome among cases of depression compared with the frequency observed in a large population sample (N = 34,910) of live-born infants collected in Denmark (two-sided p = .023, odds ratio = 7.76 [95% confidence interval = 1.79-33.6]), a case of diploid/triploid mosaicism, and several cases of uniparental isodisomy. In contrast to our previous analysis, large deletion CNVs were no more frequent in cases than control samples, although deletion CNVs in cases contained more genes than control samples (two-sided p = .0002). After statistical correction for multiple comparisons, our data do not support a substantial role for CNVs in RDD, although (as has been observed in similar samples) occasional cases may harbor large variants with etiological significance. Genetic pleiotropy and sample heterogeneity suggest that very large sample sizes are required to study conclusively the role of genetic variation in mood disorders. Copyright © 2016 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.
Identification of DNA-Binding Proteins Using Structural, Electrostatic and Evolutionary Features
Nimrod, Guy; Szilágyi, András; Leslie, Christina; Ben-Tal, Nir
2009-01-01
Summary DNA binding proteins (DBPs) often take part in various crucial processes of the cell's life cycle. Therefore, the identification and characterization of these proteins are of great importance. We present here a random forests classifier for identifying DBPs among proteins with known three-dimensional structures. First, clusters of evolutionarily conserved regions (patches) on the protein's surface are detected using the PatchFinder algorithm; previous studies showed that these regions are typically the proteins' functionally important regions. Next, we train a classifier using features like the electrostatic potential, cluster-based amino acid conservation patterns and the secondary structure content of the patches, as well as features of the whole protein including its dipole moment. Using 10-fold cross validation on a dataset of 138 DNA-binding proteins and 110 proteins which do not bind DNA, the classifier achieved a sensitivity and a specificity of 0.90, which is overall better than the performance of previously published methods. Furthermore, when we tested 5 different methods on 11 new DBPs which did not appear in the original dataset, only our method annotated all correctly. The resulting classifier was applied to a collection of 757 proteins of known structure and unknown function. Of these proteins, 218 were predicted to bind DNA, and we anticipate that some of them interact with DNA using new structural motifs. The use of complementary computational tools supports the notion that at least some of them do bind DNA. PMID:19233205
Arnold, L. Rick
2010-01-01
These datasets were compiled in support of U.S. Geological Survey Scientific-Investigations Report 2010-5082-Hydrogeology and Steady-State Numerical Simulation of Groundwater Flow in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. The datasets were developed by the U.S. Geological Survey in cooperation with the Lost Creek Ground Water Management District and the Colorado Geological Survey. The four datasets are described as follows and methods used to develop the datasets are further described in Scientific-Investigations Report 2010-5082: (1) ds507_regolith_data: This point dataset contains geologic information concerning regolith (unconsolidated sediment) thickness and top-of-bedrock altitude at selected well and test-hole locations in and near the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Data were compiled from published reports, consultant reports, and from lithologic logs of wells and test holes on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources. (2) ds507_regthick_contours: This dataset consists of contours showing generalized lines of equal regolith thickness overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness was contoured manually on the basis of information provided in the dataset ds507_regolith_data. (3) ds507_regthick_grid: This dataset consists of raster-based generalized thickness of regolith overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness in this dataset was derived from contours presented in the dataset ds507_regthick_contours. (4) ds507_welltest_data: This point dataset contains estimates of aquifer transmissivity and hydraulic conductivity at selected well locations in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. This dataset also contains hydrologic information used to estimate transmissivity from specific capacity at selected well locations. Data were compiled from published reports, consultant reports, and from well-test records on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources.
Schure, Mark R; Davis, Joe M
2017-11-10
Orthogonality metrics (OMs) for three and higher dimensional separations are proposed as extensions of previously developed OMs, which were used to evaluate the zone utilization of two-dimensional (2D) separations. These OMs include correlation coefficients, dimensionality, information theory metrics and convex-hull metrics. In a number of these cases, lower dimensional subspace metrics exist and can be readily calculated. The metrics are used to interpret previously generated experimental data. The experimental datasets are derived from Gilar's peptide data, now modified to be three dimensional (3D), and a comprehensive 3D chromatogram from Moore and Jorgenson. The Moore and Jorgenson chromatogram, which has 25 identifiable 3D volume elements or peaks, displayed good orthogonality values over all dimensions. However, OMs based on discretization of the 3D space changed substantially with changes in binning parameters. This example highlights the importance in higher dimensions of having an abundant number of retention times as data points, especially for methods that use discretization. The Gilar data, which in a previous study produced 21 2D datasets by the pairing of 7 one-dimensional separations, was reinterpreted to produce 35 3D datasets. These datasets show a number of interesting properties, one of which is that geometric and harmonic means of lower dimensional subspace (i.e., 2D) OMs correlate well with the higher dimensional (i.e., 3D) OMs. The space utilization of the Gilar 3D datasets was ranked using OMs, with the retention times of the datasets having the largest and smallest OMs presented as graphs. A discussion concerning the orthogonality of higher dimensional techniques is given with emphasis on molecular diversity in chromatographic separations. In the information theory work, an inconsistency is found in previous studies of orthogonality using the 2D metric often identified as %O. A new choice of metric is proposed, extended to higher dimensions, characterized by mixes of ordered and random retention times, and applied to the experimental datasets. In 2D, the new metric always equals or exceeds the original one. However, results from both the original and new methods are given. Copyright © 2017 Elsevier B.V. All rights reserved.
Publishing NASA Metadata as Linked Open Data for Semantic Mashups
NASA Astrophysics Data System (ADS)
Wilson, Brian; Manipon, Gerald; Hua, Hook
2014-05-01
Data providers are now publishing more metadata in more interoperable forms, e.g. Atom or RSS 'casts', as Linked Open Data (LOD), or as ISO Metadata records. A major effort on the part of the NASA's Earth Science Data and Information System (ESDIS) project is the aggregation of metadata that enables greater data interoperability among scientific data sets regardless of source or application. Both the Earth Observing System (EOS) ClearingHOuse (ECHO) and the Global Change Master Directory (GCMD) repositories contain metadata records for NASA (and other) datasets and provided services. These records contain typical fields for each dataset (or software service) such as the source, creation date, cognizant institution, related access URL's, and domain and variable keywords to enable discovery. Under a NASA ACCESS grant, we demonstrated how to publish the ECHO and GCMD dataset and services metadata as LOD in the RDF format. Both sets of metadata are now queryable at SPARQL endpoints and available for integration into "semantic mashups" in the browser. It is straightforward to reformat sets of XML metadata, including ISO, into simple RDF and then later refine and improve the RDF predicates by reusing known namespaces such as Dublin core, georss, etc. All scientific metadata should be part of the LOD world. In addition, we developed an "instant" drill-down and browse interface that provides faceted navigation so that the user can discover and explore the 25,000 datasets and 3000 services. The available facets and the free-text search box appear in the left panel, and the instantly updated results for the dataset search appear in the right panel. The user can constrain the value of a metadata facet simply by clicking on a word (or phrase) in the "word cloud" of values for each facet. The display section for each dataset includes the important metadata fields, a full description of the dataset, potentially some related URL's, and a "search" button that points to an OpenSearch GUI that is pre-configured to search for granules within the dataset. We will present our experiences with converting NASA metadata into LOD, discuss the challenges, illustrate some of the enabled mashups, and demonstrate the latest version of the "instant browse" interface for navigating multiple metadata collections.
NASA Astrophysics Data System (ADS)
Copas, K.; Legind, J. K.; Hahn, A.; Braak, K.; Høftt, M.; Noesgaard, D.; Robertson, T.; Méndez Hernández, F.; Schigel, D.; Ko, C.
2017-12-01
GBIF—the Global Biodiversity Information Facility—has recently demonstrated a system that tracks publications back to individual datasets, giving data providers demonstrable evidence of the benefit and utility of sharing data to support an array of scholarly topics and practical applications. GBIF is an open-data network and research infrastructure funded by the world's governments. Its community consists of more than 90 formal participants and almost 1,000 data-publishing institutions, which currently make tens of thousands of datasets containing nearly 800 million species occurrence records freely and publicly available for discovery, use and reuse across a wide range of biodiversity-related research and policy investigations. Starting in 2015 with the help of DataONE, GBIF introduced DOIs as persistent identifiers for the datasets shared through its network. This enhancement soon extended to the assignment of DOIs to user downloads from GBIF.org, which typically filter the available records with a variety of taxonomic, geographic, temporal and other search terms. Despite the lack of widely accepted standards for citing data among researchers and publications, this technical infrastructure is beginning to take hold and support open, transparent, persistent and repeatable use and reuse of species occurrence data. These `download DOIs' provide canonical references for the search results researchers process and use in peer-reviewed articles—a practice GBIF encourages by confirming new DOIs with each download and offering guidelines on citation. GBIF has recently started linking these citation results back to dataset and publisher pages, offering more consistent, traceable evidence of the value of sharing data to support others' research. GBIF's experience may be a useful model for other repositories to follow.
Rescuing Paleomagnetic Data from Deep-Sea Cores Through the IEDA-CCNY Data Internship Program
NASA Astrophysics Data System (ADS)
Ismail, A.; Randel, C.; Palumbo, R. V.; Carter, M.; Cai, Y.; Kent, D. V.; Lehnert, K.; Block, K. A.
2016-12-01
Paleomagnetic data provides essential information for evaluating the chronostratigraphy of sedimentary cores. Lamont research vessels Vema and Robert Conrad collected over 10,000 deep-sea sediment cores around the world from 1953 to 1989. 10% of these cores have been sampled for paleomagnetic analyses at Lamont. Over the years, only 10% of these paleomagnetic records have been published. Moreover, data listings were only rarely made available in older publications because electronic appendices were not available and cyberinfrastructure was not in place for publishing and preserving these data. As a result, the majority of these datasets exist only as fading computer printouts in binders on the investigator's bookshelf. This summer, undergraduate students from the NSF-funded IEDA-CCNY Data Internship Program started digitizing this enormous dataset under the supervision of Dennis Kent, the current custodian of the data and one of the investigators who oversaw some of the data collection process, and an active leader in the field. Undergraduate students worked on digitizing paper records, proof-reading and organizing the data sheets for future integration into an appropriate repository. Through observing and plotting the data, the students learned about how sediment cores and paleomagnetic data are collected and used in research, and the best practices in data publishing and preservation from IEDA (Interdisciplinary Earth Data Alliance) team members. The students also compared different optical character recognition (OCR) softwares and established an efficient workflow to digitize these datasets. These datasets will eventually be incorporated in the Magnetics Information Consortium (MagIC), so that they can be easily compared with similar datasets and have the potential to generate new findings. Through this data rescue project, the students had the opportunity to learn about an important field of scientific research and interact with world-class scientists.
Inadequate Reference Datasets Biased toward Short Non-epitopes Confound B-cell Epitope Prediction*
Rahman, Kh. Shamsur; Chowdhury, Erfan Ullah; Sachse, Konrad; Kaltenboeck, Bernhard
2016-01-01
X-ray crystallography has shown that an antibody paratope typically binds 15–22 amino acids (aa) of an epitope, of which 2–5 randomly distributed amino acids contribute most of the binding energy. In contrast, researchers typically choose for B-cell epitope mapping short peptide antigens in antibody binding assays. Furthermore, short 6–11-aa epitopes, and in particular non-epitopes, are over-represented in published B-cell epitope datasets that are commonly used for development of B-cell epitope prediction approaches from protein antigen sequences. We hypothesized that such suboptimal length peptides result in weak antibody binding and cause false-negative results. We tested the influence of peptide antigen length on antibody binding by analyzing data on more than 900 peptides used for B-cell epitope mapping of immunodominant proteins of Chlamydia spp. We demonstrate that short 7–12-aa peptides of B-cell epitopes bind antibodies poorly; thus, epitope mapping with short peptide antigens falsely classifies many B-cell epitopes as non-epitopes. We also show in published datasets of confirmed epitopes and non-epitopes a direct correlation between length of peptide antigens and antibody binding. Elimination of short, ≤11-aa epitope/non-epitope sequences improved datasets for evaluation of in silico B-cell epitope prediction. Achieving up to 86% accuracy, protein disorder tendency is the best indicator of B-cell epitope regions for chlamydial and published datasets. For B-cell epitope prediction, the most effective approach is plotting disorder of protein sequences with the IUPred-L scale, followed by antibody reactivity testing of 16–30-aa peptides from peak regions. This strategy overcomes the well known inaccuracy of in silico B-cell epitope prediction from primary protein sequences. PMID:27189949
Observational evidence of seasonality in the timing of loop current eddy separation
NASA Astrophysics Data System (ADS)
Hall, Cody A.; Leben, Robert R.
2016-12-01
Observational datasets, reports and analyses over the time period from 1978 through 1992 are reviewed to derive pre-altimetry Loop Current (LC) eddy separation dates. The reanalysis identified 20 separation events in the 15-year record. Separation dates are estimated to be accurate to approximately ± 1.5 months and sufficient to detect statistically significant LC eddy separation seasonality, which was not the case for previously published records because of the misidentification of separation events and their timing. The reanalysis indicates that previously reported LC eddy separation dates, determined for the time period before the advent of continuous altimetric monitoring in the early 1990s, are inaccurate because of extensive reliance on satellite sea surface temperature (SST) imagery. Automated LC tracking techniques are used to derive LC eddy separation dates in three different altimetry-based sea surface height (SSH) datasets over the time period from 1993 through 2012. A total of 28-30 LC eddy separation events were identified in the 20-year record. Variations in the number and dates of eddy separation events are attributed to the different mean sea surfaces and objective-analysis smoothing procedures used to produce the SSH datasets. Significance tests on various altimetry and pre-altimetry/altimetry combined date lists consistently show that the seasonal distribution of separation events is not uniform at the 95% confidence level. Randomization tests further show that the seasonal peak in LC eddy separation events in August and September is highly unlikely to have occurred by chance. The other seasonal peak in February and March is less significant, but possibly indicates two seasons of enhanced probability of eddy separation centered near the spring and fall equinoxes. This is further quantified by objectively dividing the seasonal distribution into two seasons using circular statistical techniques and a k-means clustering algorithm. The estimated spring and fall centers are March 2nd and August 23rd, respectively, with season boundaries in May and December.
Arnold, David T; Rowen, Donna; Versteegh, Matthijs M; Morley, Anna; Hooper, Clare E; Maskell, Nicholas A
2015-01-23
In order to estimate utilities for cancer studies where the EQ-5D was not used, the EORTC QLQ-C30 can be used to estimate EQ-5D using existing mapping algorithms. Several mapping algorithms exist for this transformation, however, algorithms tend to lose accuracy in patients in poor health states. The aim of this study was to test all existing mapping algorithms of QLQ-C30 onto EQ-5D, in a dataset of patients with malignant pleural mesothelioma, an invariably fatal malignancy where no previous mapping estimation has been published. Health related quality of life (HRQoL) data where both the EQ-5D and QLQ-C30 were used simultaneously was obtained from the UK-based prospective observational SWAMP (South West Area Mesothelioma and Pemetrexed) trial. In the original trial 73 patients with pleural mesothelioma were offered palliative chemotherapy and their HRQoL was assessed across five time points. This data was used to test the nine available mapping algorithms found in the literature, comparing predicted against observed EQ-5D values. The ability of algorithms to predict the mean, minimise error and detect clinically significant differences was assessed. The dataset had a total of 250 observations across 5 timepoints. The linear regression mapping algorithms tested generally performed poorly, over-estimating the predicted compared to observed EQ-5D values, especially when observed EQ-5D was below 0.5. The best performing algorithm used a response mapping method and predicted the mean EQ-5D with accuracy with an average root mean squared error of 0.17 (Standard Deviation; 0.22). This algorithm reliably discriminated between clinically distinct subgroups seen in the primary dataset. This study tested mapping algorithms in a population with poor health states, where they have been previously shown to perform poorly. Further research into EQ-5D estimation should be directed at response mapping methods given its superior performance in this study.
Can we set a global threshold age to define mature forests?
Martin, Philip; Jung, Martin; Brearley, Francis Q; Ribbons, Relena R; Lines, Emily R; Jacob, Aerin L
2016-01-01
Globally, mature forests appear to be increasing in biomass density (BD). There is disagreement whether these increases are the result of increases in atmospheric CO2 concentrations or a legacy effect of previous land-use. Recently, it was suggested that a threshold of 450 years should be used to define mature forests and that many forests increasing in BD may be younger than this. However, the study making these suggestions failed to account for the interactions between forest age and climate. Here we revisit the issue to identify: (1) how climate and forest age control global forest BD and (2) whether we can set a threshold age for mature forests. Using data from previously published studies we modelled the impacts of forest age and climate on BD using linear mixed effects models. We examined the potential biases in the dataset by comparing how representative it was of global mature forests in terms of its distribution, the climate space it occupied, and the ages of the forests used. BD increased with forest age, mean annual temperature and annual precipitation. Importantly, the effect of forest age increased with increasing temperature, but the effect of precipitation decreased with increasing temperatures. The dataset was biased towards northern hemisphere forests in relatively dry, cold climates. The dataset was also clearly biased towards forests <250 years of age. Our analysis suggests that there is not a single threshold age for forest maturity. Since climate interacts with forest age to determine BD, a threshold age at which they reach equilibrium can only be determined locally. We caution against using BD as the only determinant of forest maturity since this ignores forest biodiversity and tree size structure which may take longer to recover. Future research should address the utility and cost-effectiveness of different methods for determining whether forests should be classified as mature.
Evaluation of a Traffic Sign Detector by Synthetic Image Data for Advanced Driver Assistance Systems
NASA Astrophysics Data System (ADS)
Hanel, A.; Kreuzpaintner, D.; Stilla, U.
2018-05-01
Recently, several synthetic image datasets of street scenes have been published. These datasets contain various traffic signs and can therefore be used to train and test machine learning-based traffic sign detectors. In this contribution, selected datasets are compared regarding ther applicability for traffic sign detection. The comparison covers the process to produce the synthetic images and addresses the virtual worlds, needed to produce the synthetic images, and their environmental conditions. The comparison covers variations in the appearance of traffic signs and the labeling strategies used for the datasets, as well. A deep learning traffic sign detector is trained with multiple training datasets with different ratios between synthetic and real training samples to evaluate the synthetic SYNTHIA dataset. A test of the detector on real samples only has shown that an overall accuracy and ROC AUC of more than 95 % can be achieved for both a small rate of synthetic samples and a large rate of synthetic samples in the training dataset.
ENSO activity during the last climate cycle using IFA
NASA Astrophysics Data System (ADS)
Leduc, Guillaume; Vidal, Laurence; Thirumalai, Kaustubh
2017-04-01
The El Niño / Southern Oscillation (ENSO) is the principal mode of interannual climate variability and affects key climate parameters such as low-latitude rainfall variability. Anticipating future ENSO variability under anthropogenic forcing is vital due to its profound socioeconomic impact. Fossil corals suggest that 20th century ENSO variance is particularly high as compared to other time periods of the Holocene (Cobb et al., 2013, Science), the Last Glacial Maximum (Ford et al., 2015, Science) and the last glacial period (Tudhope et al., 2001, Science). Yet, recent climate modeling experiments suggest an increase in the frequency of both El Niño (Cai et al., 2014, Nature Climate Change) and La Niña (Cai et al., 2015, Nature Climate Change) events. We have expanded an Individual Foraminifera Analysis (IFA) dataset using the thermocline-dwelling N. dutertrei on a marine core collected in the Panama Basin (Leduc et al., 2009, Paleoceanography), that has proven to be a skillful way to reconstruct the ENSO (Thirumalai et al., 2013, Paleoceanography). Our new IFA dataset comprehensively covers the Holocene, the last deglaciation and Termination II (MIS5/6) time windows. We will also use previously published data from the Marine Isotope Stage 3 (MIS3). Our dataset confirms variable ENSO intensity during the Holocene and weaker activity during LGM than during the Holocene. As a next step, ENSO activity will be discussed with respect to the contrasting climatic background of the analysed time windows (millenial-scale variability, Terminations).
Phylogeny of the bears (Ursidae) based on nuclear and mitochondrial genes.
Yu, Li; Li, Qing-wei; Ryder, O A; Zhang, Ya-ping
2004-08-01
The taxomic classification and phylogenetic relationships within the bear family remain argumentative subjects in recent years. Prior investigation has been concentrated on the application of different mitochondrial (mt) sequence data, herein we employ two nuclear single-copy gene segments, the partial exon 1 from gene encoding interphotoreceptor retinoid binding protein (IRBP) and the complete intron 1 from transthyretin (TTR) gene, in conjunction with previously published mt data, to clarify these enigmatic problems. The combined analyses of nuclear IRBP and TTR datasets not only corroborated prior hypotheses, positioning the spectacled bear most basally and grouping the brown and polar bear together but also provided new insights into the bear phylogeny, suggesting the sister-taxa association of sloth bear and sun bear with strong support. Analyses based on combination of nuclear and mt genes differed from nuclear analysis in recognizing the sloth bears as the earliest diverging species among the subfamily ursine representatives while the exact placement of the sun bear did not resolved. Asiatic and American black bears clustered as sister group in all analyses with moderate levels of bootstrap support and high posterior probabilities. Comparisons between the nuclear and mtDNA findings suggested that our combined nuclear dataset have the resolving power comparable to mtDNA dataset for the phylogenetic interpretation of the bear family. As can be seen from present study, the unanimous phylogeny for this recently derived family was still not produced and additional independent genetic markers were in need.
Intelligent Noninvasive Diagnosis of Aneuploidy: Raw Values and Highly Imbalanced Dataset.
Neocleous, Andreas C; Nicolaides, Kypros H; Schizas, Christos N
2017-09-01
The objective of this paper is to introduce a noninvasive diagnosis procedure for aneuploidy and to minimize the social and financial cost of prenatal diagnosis tests that are performed for fetal aneuploidies in an early stage of pregnancy. We propose a method by using artificial neural networks trained with data from singleton pregnancy cases, while undergoing first trimester screening. Three different datasets 1 with a total of 122 362 euploid and 967 aneuploid cases were used in this study. The data for each case contained markers collected from the mother and the fetus. This study, unlike previous studies published by the authors for a similar problem differs in three basic principles: 1) the training of the artificial neural networks is done by using the markers' values in their raw form (unprocessed), 2) a balanced training dataset is created and used by selecting only a representative number of euploids for the training phase, and 3) emphasis is given to the financials and suggest hierarchy and necessity of the available tests. The proposed artificial neural networks models were optimized in the sense of reaching a minimum false positive rate and at the same time securing a 100% detection rate for Trisomy 21. These systems correctly identify other aneuploidies (Trisomies 13&18, Turner, and Triploid syndromes) at a detection rate greater than 80%. In conclusion, we demonstrate that artificial neural network systems can contribute in providing noninvasive, effective early screening for fetal aneuploidies with results that compare favorably to other existing methods.
Spherical: an iterative workflow for assembling metagenomic datasets.
Hitch, Thomas C A; Creevey, Christopher J
2018-01-24
The consensus emerging from the study of microbiomes is that they are far more complex than previously thought, requiring better assemblies and increasingly deeper sequencing. However, current metagenomic assembly techniques regularly fail to incorporate all, or even the majority in some cases, of the sequence information generated for many microbiomes, negating this effort. This can especially bias the information gathered and the perceived importance of the minor taxa in a microbiome. We propose a simple but effective approach, implemented in Python, to address this problem. Based on an iterative methodology, our workflow (called Spherical) carries out successive rounds of assemblies with the sequencing reads not yet utilised. This approach also allows the user to reduce the resources required for very large datasets, by assembling random subsets of the whole in a "divide and conquer" manner. We demonstrate the accuracy of Spherical using simulated data based on completely sequenced genomes and the effectiveness of the workflow at retrieving lost information for taxa in three published metagenomics studies of varying sizes. Our results show that Spherical increased the amount of reads utilized in the assembly by up to 109% compared to the base assembly. The additional contigs assembled by the Spherical workflow resulted in a significant (P < 0.05) changes in the predicted taxonomic profile of all datasets analysed. Spherical is implemented in Python 2.7 and freely available for use under the MIT license. Source code and documentation is hosted publically at: https://github.com/thh32/Spherical .
Chase, Mark W.; Kim, Joo-Hwan
2013-01-01
Phylogenetic analysis aims to produce a bifurcating tree, which disregards conflicting signals and displays only those that are present in a large proportion of the data. However, any character (or tree) conflict in a dataset allows the exploration of support for various evolutionary hypotheses. Although data-display network approaches exist, biologists cannot easily and routinely use them to compute rooted phylogenetic networks on real datasets containing hundreds of taxa. Here, we constructed an original neighbour-net for a large dataset of Asparagales to highlight the aspects of the resulting network that will be important for interpreting phylogeny. The analyses were largely conducted with new data collected for the same loci as in previous studies, but from different species accessions and greater sampling in many cases than in published analyses. The network tree summarised the majority data pattern in the characters of plastid sequences before tree building, which largely confirmed the currently recognised phylogenetic relationships. Most conflicting signals are at the base of each group along the Asparagales backbone, which helps us to establish the expectancy and advance our understanding of some difficult taxa relationships and their phylogeny. The network method should play a greater role in phylogenetic analyses than it has in the past. To advance the understanding of evolutionary history of the largest order of monocots Asparagales, absolute diversification times were estimated for family-level clades using relaxed molecular clock analyses. PMID:23544071
An update on sORFs.org: a repository of small ORFs identified by ribosome profiling.
Olexiouk, Volodimir; Van Criekinge, Wim; Menschaert, Gerben
2018-01-04
sORFs.org (http://www.sorfs.org) is a public repository of small open reading frames (sORFs) identified by ribosome profiling (RIBO-seq). This update elaborates on the major improvements implemented since its initial release. sORFs.org now additionally supports three more species (zebrafish, rat and Caenorhabditis elegans) and currently includes 78 RIBO-seq datasets, a vast increase compared to the three that were processed in the initial release. Therefore, a novel pipeline was constructed that also enables sORF detection in RIBO-seq datasets comprising solely elongating RIBO-seq data while previously, matching initiating RIBO-seq data was necessary to delineate the sORFs. Furthermore, a novel noise filtering algorithm was designed, able to distinguish sORFs with true ribosomal activity from simulated noise, consequently reducing the false positive identification rate. The inclusion of other species also led to the development of an inner BLAST pipeline, assessing sequence similarity between sORFs in the repository. Building on the proof of concept model in the initial release of sORFs.org, a full PRIDE-ReSpin pipeline was now released, reprocessing publicly available MS-based proteomics PRIDE datasets, reporting on true translation events. Next to reporting those identified peptides, sORFs.org allows visual inspection of the annotated spectra within the Lorikeet MS/MS viewer, thus enabling detailed manual inspection and interpretation. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
An Integrated Systems Genetics and Omics Toolkit to Probe Gene Function.
Li, Hao; Wang, Xu; Rukina, Daria; Huang, Qingyao; Lin, Tao; Sorrentino, Vincenzo; Zhang, Hongbo; Bou Sleiman, Maroun; Arends, Danny; McDaid, Aaron; Luan, Peiling; Ziari, Naveed; Velázquez-Villegas, Laura A; Gariani, Karim; Kutalik, Zoltan; Schoonjans, Kristina; Radcliffe, Richard A; Prins, Pjotr; Morgenthaler, Stephan; Williams, Robert W; Auwerx, Johan
2018-01-24
Identifying genetic and environmental factors that impact complex traits and common diseases is a high biomedical priority. Here, we developed, validated, and implemented a series of multi-layered systems approaches, including (expression-based) phenome-wide association, transcriptome-/proteome-wide association, and (reverse-) mediation analysis, in an open-access web server (systems-genetics.org) to expedite the systems dissection of gene function. We applied these approaches to multi-omics datasets from the BXD mouse genetic reference population, and identified and validated associations between genes and clinical and molecular phenotypes, including previously unreported links between Rpl26 and body weight, and Cpt1a and lipid metabolism. Furthermore, through mediation and reverse-mediation analysis we established regulatory relations between genes, such as the co-regulation of BCKDHA and BCKDHB protein levels, and identified targets of transcription factors E2F6, ZFP277, and ZKSCAN1. Our multifaceted toolkit enabled the identification of gene-gene and gene-phenotype links that are robust and that translate well across populations and species, and can be universally applied to any populations with multi-omics datasets. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Geological applications of machine learning on hyperspectral remote sensing data
NASA Astrophysics Data System (ADS)
Tse, C. H.; Li, Yi-liang; Lam, Edmund Y.
2015-02-01
The CRISM imaging spectrometer orbiting Mars has been producing a vast amount of data in the visible to infrared wavelengths in the form of hyperspectral data cubes. These data, compared with those obtained from previous remote sensing techniques, yield an unprecedented level of detailed spectral resolution in additional to an ever increasing level of spatial information. A major challenge brought about by the data is the burden of processing and interpreting these datasets and extract the relevant information from it. This research aims at approaching the challenge by exploring machine learning methods especially unsupervised learning to achieve cluster density estimation and classification, and ultimately devising an efficient means leading to identification of minerals. A set of software tools have been constructed by Python to access and experiment with CRISM hyperspectral cubes selected from two specific Mars locations. A machine learning pipeline is proposed and unsupervised learning methods were implemented onto pre-processed datasets. The resulting data clusters are compared with the published ASTER spectral library and browse data products from the Planetary Data System (PDS). The result demonstrated that this approach is capable of processing the huge amount of hyperspectral data and potentially providing guidance to scientists for more detailed studies.
Atkinson, Jonathan A; Lobet, Guillaume; Noll, Manuel; Meyer, Patrick E; Griffiths, Marcus; Wells, Darren M
2017-10-01
Genetic analyses of plant root systems require large datasets of extracted architectural traits. To quantify such traits from images of root systems, researchers often have to choose between automated tools (that are prone to error and extract only a limited number of architectural traits) or semi-automated ones (that are highly time consuming). We trained a Random Forest algorithm to infer architectural traits from automatically extracted image descriptors. The training was performed on a subset of the dataset, then applied to its entirety. This strategy allowed us to (i) decrease the image analysis time by 73% and (ii) extract meaningful architectural traits based on image descriptors. We also show that these traits are sufficient to identify the quantitative trait loci that had previously been discovered using a semi-automated method. We have shown that combining semi-automated image analysis with machine learning algorithms has the power to increase the throughput of large-scale root studies. We expect that such an approach will enable the quantification of more complex root systems for genetic studies. We also believe that our approach could be extended to other areas of plant phenotyping. © The Authors 2017. Published by Oxford University Press.
Automated quantification of surface water inundation in wetlands using optical satellite imagery
DeVries, Ben; Huang, Chengquan; Lang, Megan W.; Jones, John W.; Huang, Wenli; Creed, Irena F.; Carroll, Mark L.
2017-01-01
We present a fully automated and scalable algorithm for quantifying surface water inundation in wetlands. Requiring no external training data, our algorithm estimates sub-pixel water fraction (SWF) over large areas and long time periods using Landsat data. We tested our SWF algorithm over three wetland sites across North America, including the Prairie Pothole Region, the Delmarva Peninsula and the Everglades, representing a gradient of inundation and vegetation conditions. We estimated SWF at 30-m resolution with accuracies ranging from a normalized root-mean-square-error of 0.11 to 0.19 when compared with various high-resolution ground and airborne datasets. SWF estimates were more sensitive to subtle inundated features compared to previously published surface water datasets, accurately depicting water bodies, large heterogeneously inundated surfaces, narrow water courses and canopy-covered water features. Despite this enhanced sensitivity, several sources of errors affected SWF estimates, including emergent or floating vegetation and forest canopies, shadows from topographic features, urban structures and unmasked clouds. The automated algorithm described in this article allows for the production of high temporal resolution wetland inundation data products to support a broad range of applications.
Antibiotic Resistome: Improving Detection and Quantification Accuracy for Comparative Metagenomics.
Elbehery, Ali H A; Aziz, Ramy K; Siam, Rania
2016-04-01
The unprecedented rise of life-threatening antibiotic resistance (AR), combined with the unparalleled advances in DNA sequencing of genomes and metagenomes, has pushed the need for in silico detection of the resistance potential of clinical and environmental metagenomic samples through the quantification of AR genes (i.e., genes conferring antibiotic resistance). Therefore, determining an optimal methodology to quantitatively and accurately assess AR genes in a given environment is pivotal. Here, we optimized and improved existing AR detection methodologies from metagenomic datasets to properly consider AR-generating mutations in antibiotic target genes. Through comparative metagenomic analysis of previously published AR gene abundance in three publicly available metagenomes, we illustrate how mutation-generated resistance genes are either falsely assigned or neglected, which alters the detection and quantitation of the antibiotic resistome. In addition, we inspected factors influencing the outcome of AR gene quantification using metagenome simulation experiments, and identified that genome size, AR gene length, total number of metagenomics reads and selected sequencing platforms had pronounced effects on the level of detected AR. In conclusion, our proposed improvements in the current methodologies for accurate AR detection and resistome assessment show reliable results when tested on real and simulated metagenomic datasets.
NASA Astrophysics Data System (ADS)
Aldeen Yousra, S.; Mazleena, Salleh
2018-05-01
Recent advancement in Information and Communication Technologies (ICT) demanded much of cloud services to sharing users’ private data. Data from various organizations are the vital information source for analysis and research. Generally, this sensitive or private data information involves medical, census, voter registration, social network, and customer services. Primary concern of cloud service providers in data publishing is to hide the sensitive information of individuals. One of the cloud services that fulfill the confidentiality concerns is Privacy Preserving Data Mining (PPDM). The PPDM service in Cloud Computing (CC) enables data publishing with minimized distortion and absolute privacy. In this method, datasets are anonymized via generalization to accomplish the privacy requirements. However, the well-known privacy preserving data mining technique called K-anonymity suffers from several limitations. To surmount those shortcomings, I propose a new heuristic anonymization framework for preserving the privacy of sensitive datasets when publishing on cloud. The advantages of K-anonymity, L-diversity and (α, k)-anonymity methods for efficient information utilization and privacy protection are emphasized. Experimental results revealed the superiority and outperformance of the developed technique than K-anonymity, L-diversity, and (α, k)-anonymity measure.
panMetaDocs and DataSync - providing a convenient way to share and publish research data
NASA Astrophysics Data System (ADS)
Ulbricht, D.; Klump, J. F.
2013-12-01
In recent years research institutions, geological surveys and funding organizations started to build infrastructures to facilitate the re-use of research data from previous work. At present, several intermeshed activities are coordinated to make data systems of the earth sciences interoperable and recorded data discoverable. Driven by governmental authorities, ISO19115/19139 emerged as metadata standards for discovery of data and services. Established metadata transport protocols like OAI-PMH and OGC-CSW are used to disseminate metadata to data portals. With the persistent identifiers like DOI and IGSN research data and corresponding physical samples can be given unambiguous names and thus become citable. In summary, these activities focus primarily on 'ready to give away'-data, already stored in an institutional repository and described with appropriate metadata. Many datasets are not 'born' in this state but are produced in small and federated research projects. To make access and reuse of these 'small data' easier, these data should be centrally stored and version controlled from the very beginning of activities. We developed DataSync [1] as supplemental application to the panMetaDocs [2] data exchange platform as a data management tool for small science projects. DataSync is a JAVA-application that runs on a local computer and synchronizes directory trees into an eSciDoc-repository [3] by creating eSciDoc-objects via eSciDocs' REST API. DataSync can be installed on multiple computers and is in this way able to synchronize files of a research team over the internet. XML Metadata can be added as separate files that are managed together with data files as versioned eSciDoc-objects. A project-customized instance of panMetaDocs is provided to show a web-based overview of the previously uploaded file collection and to allow further annotation with metadata inside the eSciDoc-repository. PanMetaDocs is a PHP based web application to assist the creation of metadata in any XML-based metadata schema. To reduce manual entries of metadata to a minimum and make use of contextual information in a project setting, metadata fields can be populated with static or dynamic content. Access rights can be defined to control visibility and access to stored objects. Notifications about recently updated datasets are available by RSS and e-mail and the entire inventory can be harvested via OAI-PMH. panMetaDocs is optimized to be harvested by panFMP [4]. panMetaDocs is able to mint dataset DOIs though DataCite and uses eSciDocs' REST API to transfer eSciDoc-objects from a non-public 'pending'-status to the published status 'released', which makes data and metadata of the published object available worldwide through the internet. The application scenario presented here shows the adoption of open source applications to data sharing and publication of data. An eSciDoc-repository is used as storage for data and metadata. DataSync serves as a file ingester and distributor, whereas panMetaDocs' main function is to annotate the dataset files with metadata to make them ready for publication and sharing with your own team, or with the scientific community.
NASA Astrophysics Data System (ADS)
Collins, M. S.; Hertzberg, J. E.; Mekik, F.; Schmidt, M. W.
2017-12-01
Based on the thermodynamics of solid-solution substitution of Mg for Ca in biogenic calcite, magnesium to calcium ratios in planktonic foraminifera have been proposed as a means by which variations in habitat water temperatures can be reconstructed. Doing this accurately has been a problem, however, as we demonstrate that various calibration equations provide disparate temperature estimates from the same Mg/Ca dataset. We examined both new and published data to derive a globally applicable temperature-Mg/Ca relationship and from this relationship to accurately predict habitat depth for Neogloboquadrina dutertrei - a deep chlorophyll maximum dweller. N. dutertrei samples collected from Atlantic core tops were analyzed for trace element compositions at Texas A&M University, and the measured Mg/Ca ratios were used to predict habitat temperatures using multiple pre-existing calibration equations. When combining Atlantic and previously published Pacific Mg/Ca datasets for N. dutertrei, a notable dissolution effect was evident. To overcome this issue, we used the G. menardii Fragmentation Index (MFI) to account for dissolution and generated a multi-basin temperature equation using multiple linear regression to predict habitat temperature. However, the correlations between Mg/Ca and temperature, as well as the calculated MFI percent dissolved, suggest that N. dutertrei Mg/Ca ratios are affected equally by both variables. While correcting for dissolution makes habitat depth estimation more accurate, the lack of a definitively strong correlation between Mg/Ca and temperature is likely an effect of variable habitat depth for this species because most calibration equations have assumed a uniform habitat depth for this taxon.
Pharmacists' interventions on intravenous to oral conversion for potassium.
Charpiat, B; Bedouch, P; Conort, O; Juste, M; Rose, F X; Roubille, R; Allenet, B
2014-06-01
Guidelines recommend use of the oral route whenever possible to treat or prevent hypokalemia. Although a myriad of papers have been published regarding intravenous to oral (IV to PO) therapy conversion programs and about clinical pharmacy services provided in hospitals, little is known on the role of hospital pharmacists in promoting the oral route for potassium administration. The aim of this work was to describe the frequency of interventions related to IV to PO potassium therapy conversions performed by hospital pharmacists. Setting French hospitals recording pharmacist's interventions on the website tool of the French Society of Clinical Pharmacy. From the pharmacist's interventions (PI) dataset recorded we extracted all interventions related to potassium IV to PO conversion. We assessed the acceptance rate of these PI by prescribers. Additional free text information in the dataset was analysed. IV to PO potassium therapy conversions related to potassium chloride. From January 2007 to December 2010, 87 hospitals recorded 1,868 PIs concerning IV to PO therapy conversion. Among these, 16 (<1 %) concerned potassium chloride. They were recorded by four hospitals (4.6 %) with respectively 12, 2, 1 and 1 PIs. Six PIs were accepted by physicians and the prescriptions were modified. PIs to promote the administration of potassium by the oral route are extremely rare. Our results and scarce previously published data reveal that this field of practice remains almost unexplored. These findings highlight an important gap in the field of intravenous to oral therapy programs. This situation must be regarded as unsatisfactory and should lead to setting up more education and research programs.
Evolutionary Wavelet Neural Network ensembles for breast cancer and Parkinson's disease prediction.
Khan, Maryam Mahsal; Mendes, Alexandre; Chalup, Stephan K
2018-01-01
Wavelet Neural Networks are a combination of neural networks and wavelets and have been mostly used in the area of time-series prediction and control. Recently, Evolutionary Wavelet Neural Networks have been employed to develop cancer prediction models. The present study proposes to use ensembles of Evolutionary Wavelet Neural Networks. The search for a high quality ensemble is directed by a fitness function that incorporates the accuracy of the classifiers both independently and as part of the ensemble itself. The ensemble approach is tested on three publicly available biomedical benchmark datasets, one on Breast Cancer and two on Parkinson's disease, using a 10-fold cross-validation strategy. Our experimental results show that, for the first dataset, the performance was similar to previous studies reported in literature. On the second dataset, the Evolutionary Wavelet Neural Network ensembles performed better than all previous methods. The third dataset is relatively new and this study is the first to report benchmark results.
Evolutionary Wavelet Neural Network ensembles for breast cancer and Parkinson’s disease prediction
Mendes, Alexandre; Chalup, Stephan K.
2018-01-01
Wavelet Neural Networks are a combination of neural networks and wavelets and have been mostly used in the area of time-series prediction and control. Recently, Evolutionary Wavelet Neural Networks have been employed to develop cancer prediction models. The present study proposes to use ensembles of Evolutionary Wavelet Neural Networks. The search for a high quality ensemble is directed by a fitness function that incorporates the accuracy of the classifiers both independently and as part of the ensemble itself. The ensemble approach is tested on three publicly available biomedical benchmark datasets, one on Breast Cancer and two on Parkinson’s disease, using a 10-fold cross-validation strategy. Our experimental results show that, for the first dataset, the performance was similar to previous studies reported in literature. On the second dataset, the Evolutionary Wavelet Neural Network ensembles performed better than all previous methods. The third dataset is relatively new and this study is the first to report benchmark results. PMID:29420578
Bovendorp, Ricardo S; Villar, Nacho; de Abreu-Junior, Edson F; Bello, Carolina; Regolin, André L; Percequillo, Alexandre R; Galetti, Mauro
2017-08-01
The contribution of small mammal ecology to the understanding of macroecological patterns of biodiversity, population dynamics, and community assembly has been hindered by the absence of large datasets of small mammal communities from tropical regions. Here we compile the largest dataset of inventories of small mammal communities for the Neotropical region. The dataset reviews small mammal communities from the Atlantic forest of South America, one of the regions with the highest diversity of small mammals and a global biodiversity hotspot, though currently covering less than 12% of its original area due to anthropogenic pressures. The dataset comprises 136 references from 300 locations covering seven vegetation types of tropical and subtropical Atlantic forests of South America, and presents data on species composition, richness, and relative abundance (captures/trap-nights). One paper was published more than 70 yr ago, but 80% of them were published after 2000. The dataset comprises 53,518 individuals of 124 species of small mammals, including 30 species of marsupials and 94 species of rodents. Species richness averaged 8.2 species (1-21) per site. Only two species occurred in more than 50% of the sites (the common opossum, Didelphis aurita and black-footed pigmy rice rat Oligoryzomys nigripes). Mean species abundance varied 430-fold, from 4.3 to 0.01 individuals/trap-night. The dataset also revealed a hyper-dominance of 22 species that comprised 78.29% of all individuals captured, with only seven species representing 44% of all captures. The information contained on this dataset can be applied in the study of macroecological patterns of biodiversity, communities, and populations, but also to evaluate the ecological consequences of fragmentation and defaunation, and predict disease outbreaks, trophic interactions and community dynamics in this biodiversity hotspot. © 2017 by the Ecological Society of America.
Flux Tower Eddy Covariance and Meteorological Measurements for Barrow, Alaska: 2012-2016
Dengel, Sigrid; Torn, Margaret; Billesbach, David
2017-08-24
The dataset contains half-hourly eddy covariance flux measurements and determinations, companion meteorological measurements, and ancillary data from the flux tower (US-NGB) on the Barrow Environmental Observatory at Barrow (Utqiagvik), Alaska for the period 2012 through 2016. Data have been processed using EddyPro software and screened by the contributor. The flux tower sits in an Arctic coastal tundra ecosystem. This dataset updates a previous dataset by reprocessing a longer period of record in the same manner. Related dataset "Eddy-Covariance and auxiliary measurements, NGEE-Barrow, 2012-2013" DOI:10.5440/1124200.
Retrospective analysis of natural products provides insights for future discovery trends
Pye, Cameron R.; Bertin, Matthew J.; Lokey, R. Scott; Gerwick, William H.
2017-01-01
Understanding of the capacity of the natural world to produce secondary metabolites is important to a broad range of fields, including drug discovery, ecology, biosynthesis, and chemical biology, among others. Both the absolute number and the rate of discovery of natural products have increased significantly in recent years. However, there is a perception and concern that the fundamental novelty of these discoveries is decreasing relative to previously known natural products. This study presents a quantitative examination of the field from the perspective of both number of compounds and compound novelty using a dataset of all published microbial and marine-derived natural products. This analysis aimed to explore a number of key questions, such as how the rate of discovery of new natural products has changed over the past decades, how the average natural product structural novelty has changed as a function of time, whether exploring novel taxonomic space affords an advantage in terms of novel compound discovery, and whether it is possible to estimate how close we are to having described all of the chemical space covered by natural products. Our analyses demonstrate that most natural products being published today bear structural similarity to previously published compounds, and that the range of scaffolds readily accessible from nature is limited. However, the analysis also shows that the field continues to discover appreciable numbers of natural products with no structural precedent. Together, these results suggest that the development of innovative discovery methods will continue to yield compounds with unique structural and biological properties. PMID:28461474
Differential privacy based on importance weighting
Ji, Zhanglong
2014-01-01
This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights are importance sampling weights, but to protect privacy, they are regularized and have noise added. The weights allow statistical queries to be answered approximately while provably guaranteeing differential privacy. We derive an expression for the asymptotic variance of the approximate answers. Experiments show that the new mechanism performs well even when the privacy budget is small, and when the public and private datasets are drawn from different populations. PMID:24482559
Prevalence of deleterious ATM germline mutations in gastric cancer patients.
Huang, Dong-Sheng; Tao, Hou-Quan; He, Xu-Jun; Long, Ming; Yu, Sheng; Xia, Ying-Jie; Wei, Zhang; Xiong, Zikai; Jones, Sian; He, Yiping; Yan, Hai; Wang, Xiaoyue
2015-12-01
Besides CDH1, few hereditary gastric cancer predisposition genes have been previously reported. In this study, we discovered two germline ATM mutations (p.Y1203fs and p.N1223S) in a Chinese family with a history of gastric cancer by screening 83 cancer susceptibility genes. Using a published exome sequencing dataset, we found deleterious germline mutations of ATM in 2.7% of 335 gastric cancer patients of different ethnic origins. The frequency of deleterious ATM mutations in gastric cancer patients is significantly higher than that in general population (p=0.0000435), suggesting an association of ATM mutations with gastric cancer predisposition. We also observed biallelic inactivation of ATM in tumors of two gastric cancer patients. Further evaluation of ATM mutations in hereditary gastric cancer will facilitate genetic testing and risk assessment.
Caritat, Patrice de; Reimann, Clemens; Smith, David; Wang, Xueqiu
2017-01-01
During the last 10-20 years, Geological Surveys around the world have undertaken a major effort towards delivering fully harmonized and tightly quality-controlled low-density multi-element soil geochemical maps and datasets of vast regions including up to whole continents. Concentrations of between 45 and 60 elements commonly have been determined in a variety of different regolith types (e.g., sediment, soil). The multi-element datasets are published as complete geochemical atlases and made available to the general public. Several other geochemical datasets covering smaller areas but generally at a higher spatial density are also available. These datasets may, however, not be found by superficial internet-based searches because the elements are not mentioned individually either in the title or in the keyword lists of the original references. This publication attempts to increase the visibility and discoverability of these fundamental background datasets covering large areas up to whole continents.
Toward a complete dataset of drug-drug interaction information from publicly available sources.
Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D
2015-06-01
Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms, integrating high-quality PDDI information from the merged dataset into Wikidata, and making the combined dataset accessible as Semantic Web Linked Data. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
The phylogeny and evolutionary history of tyrannosauroid dinosaurs.
Brusatte, Stephen L; Carr, Thomas D
2016-02-02
Tyrannosauroids--the group of carnivores including Tyrannosaurs rex--are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work.
The phylogeny and evolutionary history of tyrannosauroid dinosaurs
Brusatte, Stephen L.; Carr, Thomas D.
2016-01-01
Tyrannosauroids—the group of carnivores including Tyrannosaurs rex—are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work. PMID:26830019
The phylogeny and evolutionary history of tyrannosauroid dinosaurs
NASA Astrophysics Data System (ADS)
Brusatte, Stephen L.; Carr, Thomas D.
2016-02-01
Tyrannosauroids—the group of carnivores including Tyrannosaurs rex—are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work.
The metagenomic data life-cycle: standards and best practices
ten Hoopen, Petra; Finn, Robert D.; Bongo, Lars Ailo; Corre, Erwan; Meyer, Folker; Mitchell, Alex; Pelletier, Eric; Pesole, Graziano; Santamaria, Monica; Willassen, Nils Peder
2017-01-01
Abstract Metagenomics data analyses from independent studies can only be compared if the analysis workflows are described in a harmonized way. In this overview, we have mapped the landscape of data standards available for the description of essential steps in metagenomics: (i) material sampling, (ii) material sequencing, (iii) data analysis, and (iv) data archiving and publishing. Taking examples from marine research, we summarize essential variables used to describe material sampling processes and sequencing procedures in a metagenomics experiment. These aspects of metagenomics dataset generation have been to some extent addressed by the scientific community, but greater awareness and adoption is still needed. We emphasize the lack of standards relating to reporting how metagenomics datasets are analysed and how the metagenomics data analysis outputs should be archived and published. We propose best practice as a foundation for a community standard to enable reproducibility and better sharing of metagenomics datasets, leading ultimately to greater metagenomics data reuse and repurposing. PMID:28637310
Struwe, Weston B; Agravat, Sanjay; Aoki-Kinoshita, Kiyoko F; Campbell, Matthew P; Costello, Catherine E; Dell, Anne; Ten Feizi; Haslam, Stuart M; Karlsson, Niclas G; Khoo, Kay-Hooi; Kolarich, Daniel; Liu, Yan; McBride, Ryan; Novotny, Milos V; Packer, Nicolle H; Paulson, James C; Rapp, Erdmann; Ranzinger, Rene; Rudd, Pauline M; Smith, David F; Tiemeyer, Michael; Wells, Lance; York, William S; Zaia, Joseph; Kettner, Carsten
2016-09-01
The minimum information required for a glycomics experiment (MIRAGE) project was established in 2011 to provide guidelines to aid in data reporting from all types of experiments in glycomics research including mass spectrometry (MS), liquid chromatography, glycan arrays, data handling and sample preparation. MIRAGE is a concerted effort of the wider glycomics community that considers the adaptation of reporting guidelines as an important step towards critical evaluation and dissemination of datasets as well as broadening of experimental techniques worldwide. The MIRAGE Commission published reporting guidelines for MS data and here we outline guidelines for sample preparation. The sample preparation guidelines include all aspects of sample generation, purification and modification from biological and/or synthetic carbohydrate material. The application of MIRAGE sample preparation guidelines will lead to improved recording of experimental protocols and reporting of understandable and reproducible glycomics datasets. © The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Juicebox.js Provides a Cloud-Based Visualization System for Hi-C Data.
Robinson, James T; Turner, Douglass; Durand, Neva C; Thorvaldsdóttir, Helga; Mesirov, Jill P; Aiden, Erez Lieberman
2018-02-28
Contact mapping experiments such as Hi-C explore how genomes fold in 3D. Here, we introduce Juicebox.js, a cloud-based web application for exploring the resulting datasets. Like the original Juicebox application, Juicebox.js allows users to zoom in and out of such datasets using an interface similar to Google Earth. Juicebox.js also has many features designed to facilitate data reproducibility and sharing. Furthermore, Juicebox.js encodes the exact state of the browser in a shareable URL. Creating a public browser for a new Hi-C dataset does not require coding and can be accomplished in under a minute. The web app also makes it possible to create interactive figures online that can complement or replace ordinary journal figures. When combined with Juicer, this makes the entire process of data analysis transparent, insofar as every step from raw reads to published figure is publicly available as open source code. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.
The metagenomic data life-cycle: standards and best practices
DOE Office of Scientific and Technical Information (OSTI.GOV)
ten Hoopen, Petra; Finn, Robert D.; Bongo, Lars Ailo
Metagenomics data analyses from independent studies can only be compared if the analysis workflows are described in a harmonised way. In this overview, we have mapped the landscape of data standards available for the description of essential steps in metagenomics: (1) material sampling, (2) material sequencing (3) data analysis and (4) data archiving & publishing. Taking examples from marine research, we summarise essential variables used to describe material sampling processes and sequencing procedures in a metagenomics experiment. These aspects of metagenomics dataset generation have been to some extent addressed by the scientific community but greater awareness and adoption is stillmore » needed. We emphasise the lack of standards relating to reporting how metagenomics datasets are analysed and how the metagenomics data analysis outputs should be archived and published. We propose best practice as a foundation for a community standard to enable reproducibility and better sharing of metagenomics datasets, leading ultimately to greater metagenomics data reuse and repurposing.« less
Exploring Global Exposure Factors Resources URLs
The dataset is a compilation of hyperlinks (URLs) for resources (databases, compendia, published articles, etc.) useful for exposure assessment specific to consumer product use.This dataset is associated with the following publication:Zaleski, R., P. Egeghy, and P. Hakkinen. Exploring Global Exposure Factors Resources for Use in Consumer Exposure Assessments. International Journal of Environmental Research and Public Health. Molecular Diversity Preservation International, Basel, SWITZERLAND, 13(7): 744, (2016).
Concentrations of different polyaromatic hydrocarbons in water before and after interaction with nanomaterials. The results show the capacity of engineer nanomaterials for adsorbing different organic pollutants. This dataset is associated with the following publication:Sahle-Demessie, E., A. Zhao, C. Han, B. Hann, and H. Grecsek. Interaction of engineered nanomaterials with hydrophobic organic pollutants.. Journal of Nanotechnology. Hindawi Publishing Corporation, New York, NY, USA, 27(28): 284003, (2016).
Loftus, Stacie K
2018-05-01
The number of melanocyte- and melanoma-derived next generation sequence genome-scale datasets have rapidly expanded over the past several years. This resource guide provides a summary of publicly available sources of melanocyte cell derived whole genome, exome, mRNA and miRNA transcriptome, chromatin accessibility and epigenetic datasets. Also highlighted are bioinformatic resources and tools for visualization and data queries which allow researchers a genome-scale view of the melanocyte. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.
McCarron, David A; Kazaks, Alexandra G; Geerling, Joel C; Stern, Judith S; Graudal, Niels A
2013-10-01
The recommendation to restrict dietary sodium for management of hypertensive cardiovascular disease assumes that sodium intake exceeds physiologic need, that it can be significantly reduced, and that the reduction can be maintained over time. In contrast, neuroscientists have identified neural circuits in vertebrate animals that regulate sodium appetite within a narrow physiologic range. This study further validates our previous report that sodium intake, consistent with the neuroscience, tracks within a narrow range, consistent over time and across cultures. Peer-reviewed publications reporting 24-hour urinary sodium excretion (UNaV) in a defined population that were not included in our 2009 publication were identified from the medical literature. These datasets were combined with those in our previous report of worldwide dietary sodium consumption. The new data included 129 surveys, representing 50,060 participants. The mean value and range of 24-hour UNaV in each of these datasets were within 1 SD of our previous estimate. The combined mean and normal range of sodium intake of the 129 datasets were nearly identical to that we previously reported (mean = 158.3±22.5 vs. 162.4±22.4 mmol/d). Merging the previous and new datasets (n = 190) yielded sodium consumption of 159.4±22.3 mmol/d (range = 114-210 mmol/d; 2,622-4,830mg/d). Human sodium intake, as defined by 24-hour UNaV, is characterized by a narrow range that is remarkably reproducible over at least 5 decades and across 45 countries. As documented here, this range is determined by physiologic needs rather than environmental factors. Future guidelines should be based on this biologically determined range.
Rawle, Rachel A.; Hamerly, Timothy; Tripet, Brian P.; ...
2017-06-04
Studies of interspecies interactions are inherently difficult due to the complex mechanisms which enable these relationships. A model system for studying interspecies interactions is the marine hyperthermophiles Ignicoccus hospitalis and Nanoarchaeum equitans. Recent independently-conducted ‘omics’ analyses have generated insights into the molecular factors modulating this association. However, significant questions remain about the nature of the interactions between these archaea. We jointly analyzed multiple levels of omics datasets obtained from published, independent transcriptomics, proteomics, and metabolomics analyses. DAVID identified functionally-related groups enriched when I. hospitalis is grown alone or in co-culture with N. equitans. Enriched molecular pathways were subsequently visualized usingmore » interaction maps generated using STRING. Key findings of our multi-level omics analysis indicated that I. hospitalis provides precursors to N. equitans for energy metabolism. Analysis indicated an overall reduction in diversity of metabolic precursors in the I. hospitalis–N. equitans co-culture, which has been connected to the differential use of ribosomal subunits and was previously unnoticed. We also identified differences in precursors linked to amino acid metabolism, NADH metabolism, and carbon fixation, providing new insights into the metabolic adaptions of I. hospitalis enabling the growth of N. equitans. In conclusion, this multi-omics analysis builds upon previously identified cellular patterns while offering new insights into mechanisms that enable the I. hospitalis–N. equitans association. This study applies statistical and visualization techniques to a mixed-source omics dataset to yield a more global insight into a complex system, that was not readily discernable from separate omics studies.« less
Rawle, Rachel A; Hamerly, Timothy; Tripet, Brian P; Giannone, Richard J; Wurch, Louie; Hettich, Robert L; Podar, Mircea; Copié, Valerie; Bothner, Brian
2017-09-01
Studies of interspecies interactions are inherently difficult due to the complex mechanisms which enable these relationships. A model system for studying interspecies interactions is the marine hyperthermophiles Ignicoccus hospitalis and Nanoarchaeum equitans. Recent independently-conducted 'omics' analyses have generated insights into the molecular factors modulating this association. However, significant questions remain about the nature of the interactions between these archaea. We jointly analyzed multiple levels of omics datasets obtained from published, independent transcriptomics, proteomics, and metabolomics analyses. DAVID identified functionally-related groups enriched when I. hospitalis is grown alone or in co-culture with N. equitans. Enriched molecular pathways were subsequently visualized using interaction maps generated using STRING. Key findings of our multi-level omics analysis indicated that I. hospitalis provides precursors to N. equitans for energy metabolism. Analysis indicated an overall reduction in diversity of metabolic precursors in the I. hospitalis-N. equitans co-culture, which has been connected to the differential use of ribosomal subunits and was previously unnoticed. We also identified differences in precursors linked to amino acid metabolism, NADH metabolism, and carbon fixation, providing new insights into the metabolic adaptions of I. hospitalis enabling the growth of N. equitans. This multi-omics analysis builds upon previously identified cellular patterns while offering new insights into mechanisms that enable the I. hospitalis-N. equitans association. Our study applies statistical and visualization techniques to a mixed-source omics dataset to yield a more global insight into a complex system, that was not readily discernable from separate omics studies. Copyright © 2017 Elsevier B.V. All rights reserved.
Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset
NASA Astrophysics Data System (ADS)
Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.
2017-12-01
Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.
Evolving hard problems: Generating human genetics datasets with a complex etiology.
Himmelstein, Daniel S; Greene, Casey S; Moore, Jason H
2011-07-07
A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.
DigOut: viewing differential expression genes as outliers.
Yu, Hui; Tu, Kang; Xie, Lu; Li, Yuan-Yuan
2010-12-01
With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.
The Transition from Paper to Digital: Lessons for Medical Specialty Societies
Miller, Donald W.
2008-01-01
Medical specialty societies often serve their membership by publishing paper forms that may simultaneously include practice guidelines, dataset specifications, and suggested layouts. Many times these forms become de facto standards for the specialty but transform poorly to the logic, structure, preciseness, and flexibility needed in modern electronic medical records. This paper analyzes one such form - a prenatal record published by the American College of Obstetricians and Gynecologists - with the intent to elucidate lessons for other specialty societies who might craft their recommendations to be effectively incorporated within modern electronic medical records. Lessons learned include separating datasets from guidelines/recommendations, specifying, codifying, and qualifying atomic data elements, and leaving graphic design to professionals. PMID:18998856
Reference datasets for 2-treatment, 2-sequence, 2-period bioequivalence studies.
Schütz, Helmut; Labes, Detlew; Fuglsang, Anders
2014-11-01
It is difficult to validate statistical software used to assess bioequivalence since very few datasets with known results are in the public domain, and the few that are published are of moderate size and balanced. The purpose of this paper is therefore to introduce reference datasets of varying complexity in terms of dataset size and characteristics (balance, range, outlier presence, residual error distribution) for 2-treatment, 2-period, 2-sequence bioequivalence studies and to report their point estimates and 90% confidence intervals which companies can use to validate their installations. The results for these datasets were calculated using the commercial packages EquivTest, Kinetica, SAS and WinNonlin, and the non-commercial package R. The results of three of these packages mostly agree, but imbalance between sequences seems to provoke questionable results with one package, which illustrates well the need for proper software validation.
NASA Astrophysics Data System (ADS)
Déau, Estelle; Flandes, Alberto; Spilker, Linda J.; Petazzoni, Jérôme
2013-11-01
Typical variations in the opposition effect morphology of laboratory samples at optical wavelengths are investigated to probe the role of the textural properties of the surface (roughness, porosity and grain size). A previously published dataset of 34 laboratory phase curves is re-analyzed and fit with several morphological models. The retrieved morphological parameters that characterize the opposition surge, amplitude, width and slope (A, HWHM and S respectively) are correlated to the single scattering albedo, the roughness, the porosity and the grain size of the samples. To test the universality of the laboratory samples’ trends, we use previously published phase curves of planetary surfaces, including the Moon, satellites and rings of the giant planets. The morphological parameters of the surge (A and HWHM) for planetary surfaces are found to have a non-monotonic variation with the single scattering albedo, similar to that observed in asteroids (Belskaya, I.N., Shevchenko, V.G. [2000]. Icarus 147, 94-105), which is unexplained so far. The morphological parameters of the surge (A and HWHM) for laboratory samples seem to exhibit the same non-monotonic variation with single scattering albedo. While the non-monotonic variation with albedo was already observed by Nelson et al. (Nelson, R.M., Hapke, B.W., Smythe, W.D., Hale, A.S., Piatek, J.L. [2004]. Planetary regolith microstructure: An unexpected opposition effect result. In: Mackwell, S., Stansbery, E. (Eds.), Proc. Lunar Sci. Conf. 35, p. 1089), we report here the same variation for the angular width.
Li, Kai; Rollins, Jason; Yan, Erjia
2018-01-01
Clarivate Analytics's Web of Science (WoS) is the world's leading scientific citation search and analytical information platform. It is used as both a research tool supporting a broad array of scientific tasks across diverse knowledge domains as well as a dataset for large-scale data-intensive studies. WoS has been used in thousands of published academic studies over the past 20 years. It is also the most enduring commercial legacy of Eugene Garfield. Despite the central position WoS holds in contemporary research, the quantitative impact of WoS has not been previously examined by rigorous scientific studies. To better understand how this key piece of Eugene Garfield's heritage has contributed to science, we investigated the ways in which WoS (and associated products and features) is mentioned in a sample of 19,478 English-language research and review papers published between 1997 and 2017, as indexed in WoS databases. We offered descriptive analyses of the distribution of the papers across countries, institutions and knowledge domains. We also used natural language processingtechniques to identify the verbs and nouns in the abstracts of these papers that are grammatically connected to WoS-related phrases. This is the first study to empirically investigate the documentation of the use of the WoS platform in published academic papers in both scientometric and linguistic terms.
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong
2018-01-04
Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Dimitriadis, Stavros I; Salis, Christos; Linden, David
2018-04-01
Limitations of the manual scoring of polysomnograms, which include data from electroencephalogram (EEG), electro-oculogram (EOG), electrocardiogram (ECG) and electromyogram (EMG) channels have long been recognized. Manual staging is resource intensive and time consuming, and thus considerable effort must be spent to ensure inter-rater reliability. As a result, there is a great interest in techniques based on signal processing and machine learning for a completely Automatic Sleep Stage Classification (ASSC). In this paper, we present a single-EEG-sensor ASSC technique based on the dynamic reconfiguration of different aspects of cross-frequency coupling (CFC) estimated between predefined frequency pairs over 5 s epoch lengths. The proposed analytic scheme is demonstrated using the PhysioNet Sleep European Data Format (EDF) Database with repeat recordings from 20 healthy young adults. We validate our methodology in a second sleep dataset. We achieved very high classification sensitivity, specificity and accuracy of 96.2 ± 2.2%, 94.2 ± 2.3%, and 94.4 ± 2.2% across 20 folds, respectively, and also a high mean F1 score (92%, range 90-94%) when a multi-class Naive Bayes classifier was applied. High classification performance has been achieved also in the second sleep dataset. Our method outperformed the accuracy of previous studies not only on different datasets but also on the same database. Single-sensor ASSC makes the entire methodology appropriate for longitudinal monitoring using wearable EEG in real-world and laboratory-oriented environments. Crown Copyright © 2018. Published by Elsevier B.V. All rights reserved.
Mao, Hongliang; Wang, Hao
2017-03-01
Short Interspersed Nuclear Elements (SINEs) are transposable elements (TEs) that amplify through a copy-and-paste mode via RNA intermediates. The computational identification of new SINEs are challenging because of their weak structural signals and rapid diversification in sequences. Here we report SINE_Scan, a highly efficient program to predict SINE elements in genomic DNA sequences. SINE_Scan integrates hallmark of SINE transposition, copy number and structural signals to identify a SINE element. SINE_Scan outperforms the previously published de novo SINE discovery program. It shows high sensitivity and specificity in 19 plant and animal genome assemblies, of which sizes vary from 120 Mb to 3.5 Gb. It identifies numerous new families and substantially increases the estimation of the abundance of SINEs in these genomes. The code of SINE_Scan is freely available at http://github.com/maohlzj/SINE_Scan , implemented in PERL and supported on Linux. wangh8@fudan.edu.cn. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
NASA Technical Reports Server (NTRS)
Kattan, Michael W.; Hess, Kenneth R.; Kattan, Michael W.
1998-01-01
New computationally intensive tools for medical survival analyses include recursive partitioning (also called CART) and artificial neural networks. A challenge that remains is to better understand the behavior of these techniques in effort to know when they will be effective tools. Theoretically they may overcome limitations of the traditional multivariable survival technique, the Cox proportional hazards regression model. Experiments were designed to test whether the new tools would, in practice, overcome these limitations. Two datasets in which theory suggests CART and the neural network should outperform the Cox model were selected. The first was a published leukemia dataset manipulated to have a strong interaction that CART should detect. The second was a published cirrhosis dataset with pronounced nonlinear effects that a neural network should fit. Repeated sampling of 50 training and testing subsets was applied to each technique. The concordance index C was calculated as a measure of predictive accuracy by each technique on the testing dataset. In the interaction dataset, CART outperformed Cox (P less than 0.05) with a C improvement of 0.1 (95% Cl, 0.08 to 0.12). In the nonlinear dataset, the neural network outperformed the Cox model (P less than 0.05), but by a very slight amount (0.015). As predicted by theory, CART and the neural network were able to overcome limitations of the Cox model. Experiments like these are important to increase our understanding of when one of these new techniques will outperform the standard Cox model. Further research is necessary to predict which technique will do best a priori and to assess the magnitude of superiority.
2013-01-01
Background Decades of research strongly suggest that the genetic etiology of autism spectrum disorders (ASDs) is heterogeneous. However, most published studies focus on group differences between cases and controls. In contrast, we hypothesized that the heterogeneity of the disorder could be characterized by identifying pathways for which individuals are outliers rather than pathways representative of shared group differences of the ASD diagnosis. Methods Two previously published blood gene expression data sets – the Translational Genetics Research Institute (TGen) dataset (70 cases and 60 unrelated controls) and the Simons Simplex Consortium (Simons) dataset (221 probands and 191 unaffected family members) – were analyzed. All individuals of each dataset were projected to biological pathways, and each sample’s Mahalanobis distance from a pooled centroid was calculated to compare the number of case and control outliers for each pathway. Results Analysis of a set of blood gene expression profiles from 70 ASD and 60 unrelated controls revealed three pathways whose outliers were significantly overrepresented in the ASD cases: neuron development including axonogenesis and neurite development (29% of ASD, 3% of control), nitric oxide signaling (29%, 3%), and skeletal development (27%, 3%). Overall, 50% of cases and 8% of controls were outliers in one of these three pathways, which could not be identified using group comparison or gene-level outlier methods. In an independently collected data set consisting of 221 ASD and 191 unaffected family members, outliers in the neurogenesis pathway were heavily biased towards cases (20.8% of ASD, 12.0% of control). Interestingly, neurogenesis outliers were more common among unaffected family members (Simons) than unrelated controls (TGen), but the statistical significance of this effect was marginal (Chi squared P < 0.09). Conclusions Unlike group difference approaches, our analysis identified the samples within the case and control groups that manifested each expression signal, and showed that outlier groups were distinct for each implicated pathway. Moreover, our results suggest that by seeking heterogeneity, pathway-based outlier analysis can reveal expression signals that are not apparent when considering only shared group differences. PMID:24063311
Improving Fall Detection Using an On-Wrist Wearable Accelerometer
Chira, Camelia; González, Víctor M.; de la Cal, Enrique
2018-01-01
Fall detection is a very important challenge that affects both elderly people and the carers. Improvements in fall detection would reduce the aid response time. This research focuses on a method for fall detection with a sensor placed on the wrist. Falls are detected using a published threshold-based solution, although a study on threshold tuning has been carried out. The feature extraction is extended in order to balance the dataset for the minority class. Alternative models have been analyzed to reduce the computational constraints so the solution can be embedded in smart-phones or smart wristbands. Several published datasets have been used in the Materials and Methods section. Although these datasets do not include data from real falls of elderly people, a complete comparison study of fall-related datasets shows statistical differences between the simulated falls and real falls from participants suffering from impairment diseases. Given the obtained results, the rule-based systems represent a promising research line as they perform similarly to neural networks, but with a reduced computational cost. Furthermore, support vector machines performed with a high specificity. However, further research to validate the proposal in real on-line scenarios is needed. Furthermore, a slight improvement should be made to reduce the number of false alarms. PMID:29701721
The international primary ciliary dyskinesia cohort (iPCD Cohort): methods and first results.
Goutaki, Myrofora; Maurer, Elisabeth; Halbeisen, Florian S; Amirav, Israel; Barbato, Angelo; Behan, Laura; Boon, Mieke; Casaulta, Carmen; Clement, Annick; Crowley, Suzanne; Haarman, Eric; Hogg, Claire; Karadag, Bulent; Koerner-Rettberg, Cordula; Leigh, Margaret W; Loebinger, Michael R; Mazurek, Henryk; Morgan, Lucy; Nielsen, Kim G; Omran, Heymut; Schwerk, Nicolaus; Scigliano, Sergio; Werner, Claudius; Yiallouros, Panayiotis; Zivkovic, Zorica; Lucas, Jane S; Kuehni, Claudia E
2017-01-01
Data on primary ciliary dyskinesia (PCD) epidemiology is scarce and published studies are characterised by low numbers. In the framework of the European Union project BESTCILIA we aimed to combine all available datasets in a retrospective international PCD cohort (iPCD Cohort).We identified eligible datasets by performing a systematic review of published studies containing clinical information on PCD, and by contacting members of past and current European Respiratory Society Task Forces on PCD. We compared the contents of the datasets, clarified definitions and pooled them in a standardised format.As of April 2016 the iPCD Cohort includes data on 3013 patients from 18 countries. It includes data on diagnostic evaluations, symptoms, lung function, growth and treatments. Longitudinal data are currently available for 542 patients. The extent of clinical details per patient varies between centres. More than 50% of patients have a definite PCD diagnosis based on recent guidelines. Children aged 10-19 years are the largest age group, followed by younger children (≤9 years) and young adults (20-29 years).This is the largest observational PCD dataset available to date. It will allow us to answer pertinent questions on clinical phenotype, disease severity, prognosis and effect of treatments, and to investigate genotype-phenotype correlations. Copyright ©ERS 2017.
The international primary ciliary dyskinesia cohort (iPCD Cohort): methods and first results
Goutaki, Myrofora; Maurer, Elisabeth; Halbeisen, Florian S.; Amirav, Israel; Barbato, Angelo; Behan, Laura; Boon, Mieke; Casaulta, Carmen; Clement, Annick; Crowley, Suzanne; Haarman, Eric; Hogg, Claire; Karadag, Bulent; Koerner-Rettberg, Cordula; Leigh, Margaret W.; Loebinger, Michael R.; Mazurek, Henryk; Morgan, Lucy; Nielsen, Kim G.; Omran, Heymut; Schwerk, Nicolaus; Scigliano, Sergio; Werner, Claudius; Yiallouros, Panayiotis; Zivkovic, Zorica; Lucas, Jane S.
2017-01-01
Data on primary ciliary dyskinesia (PCD) epidemiology is scarce and published studies are characterised by low numbers. In the framework of the European Union project BESTCILIA we aimed to combine all available datasets in a retrospective international PCD cohort (iPCD Cohort). We identified eligible datasets by performing a systematic review of published studies containing clinical information on PCD, and by contacting members of past and current European Respiratory Society Task Forces on PCD. We compared the contents of the datasets, clarified definitions and pooled them in a standardised format. As of April 2016 the iPCD Cohort includes data on 3013 patients from 18 countries. It includes data on diagnostic evaluations, symptoms, lung function, growth and treatments. Longitudinal data are currently available for 542 patients. The extent of clinical details per patient varies between centres. More than 50% of patients have a definite PCD diagnosis based on recent guidelines. Children aged 10–19 years are the largest age group, followed by younger children (≤9 years) and young adults (20–29 years). This is the largest observational PCD dataset available to date. It will allow us to answer pertinent questions on clinical phenotype, disease severity, prognosis and effect of treatments, and to investigate genotype–phenotype correlations. PMID:28052956
Trippi, Michael H.; Kinney, Scott A.; Gunther, Gregory; Ryder, Robert T.; Ruppert, Leslie F.; Ruppert, Leslie F.; Ryder, Robert T.
2014-01-01
Metadata for these datasets are available in HTML and XML formats. Metadata files contain information about the sources of data used to create the dataset, the creation process steps, the data quality, the geographic coordinate system and horizontal datum used for the dataset, the values of attributes used in the dataset table, information about the publication and the publishing organization, and other information that may be useful to the reader. All links in the metadata were valid at the time of compilation. Some of these links may no longer be valid. No attempt has been made to determine the new online location (if one exists) for the data.
The experience of linking Victorian emergency medical service trauma data
Boyle, Malcolm J
2008-01-01
Background The linking of a large Emergency Medical Service (EMS) dataset with the Victorian Department of Human Services (DHS) hospital datasets and Victorian State Trauma Outcome Registry and Monitoring (VSTORM) dataset to determine patient outcomes has not previously been undertaken in Victoria. The objective of this study was to identify the linkage rate of a large EMS trauma dataset with the Department of Human Services hospital datasets and VSTORM dataset. Methods The linking of an EMS trauma dataset to the hospital datasets utilised deterministic and probabilistic matching. The linking of three EMS trauma datasets to the VSTORM dataset utilised deterministic, probabilistic and manual matching. Results There were 66.7% of patients from the EMS dataset located in the VEMD. There were 96% of patients located in the VAED who were defined in the VEMD as being admitted to hospital. 3.7% of patients located in the VAED could not be found in the VEMD due to hospitals not reporting to the VEMD. For the EMS datasets, there was a 146% increase in successful links with the trauma profile dataset, a 221% increase in successful links with the mechanism of injury only dataset, and a 46% increase with sudden deterioration dataset, to VSTORM when using manual compared to deterministic matching. Conclusion This study has demonstrated that EMS data can be successfully linked to other health related datasets using deterministic and probabilistic matching with varying levels of success. The quality of EMS data needs to be improved to ensure better linkage success rates with other health related datasets. PMID:19014622
James, Eric P.; Benjamin, Stanley G.; Marquis, Melinda
2016-10-28
A new gridded dataset for wind and solar resource estimation over the contiguous United States has been derived from hourly updated 1-h forecasts from the National Oceanic and Atmospheric Administration High-Resolution Rapid Refresh (HRRR) 3-km model composited over a three-year period (approximately 22 000 forecast model runs). The unique dataset features hourly data assimilation, and provides physically consistent wind and solar estimates for the renewable energy industry. The wind resource dataset shows strong similarity to that previously provided by a Department of Energy-funded study, and it includes estimates in southern Canada and northern Mexico. The solar resource dataset represents anmore » initial step towards application-specific fields such as global horizontal and direct normal irradiance. This combined dataset will continue to be augmented with new forecast data from the advanced HRRR atmospheric/land-surface model.« less
The Graduate Outcome Project: Using Data from the Integrated Data Infrastructure Project
ERIC Educational Resources Information Center
Rees, Malcolm
2014-01-01
This paper reports on progress to date with a project underway in New Zealand involving the extraction of data from multiple government agencies that is then combined into one comprehensive longitudinal integrated dataset and made available to trial participants in a way never previously thought possible. The dataset includes school leaver…
Improved group-specific primers based on the full SILVA 16S rRNA gene reference database.
Pfeiffer, Stefan; Pastar, Milica; Mitter, Birgit; Lippert, Kathrin; Hackl, Evelyn; Lojan, Paul; Oswald, Andreas; Sessitsch, Angela
2014-08-01
Quantitative PCR (qPCR) and community fingerprinting methods, such as the Terminal Restriction Fragment Length Polymorphism (T-RFLP) analysis,are well-suited techniques for the examination of microbial community structures. The use of phylum and class-specific primers can provide enhanced sensitivity and phylogenetic resolution as compared with domain-specific primers. To date, several phylum- and class-specific primers targeting the 16S ribosomal RNA gene have been published. However, many of these primers exhibit low discriminatory power against non-target bacteria in PCR. In this study, we evaluated the precision of certain published primers in silico and via specific PCR. We designed new qPCR and T-RFLP primer pairs (for the classes Alphaproteobacteria and Betaproteobacteria, and the phyla Bacteroidetes, Firmicutes and Actinobacteria) by combining the sequence information from a public dataset (SILVA SSU Ref 102 NR) with manual primer design. We evaluated the primer pairs via PCR using isolates of the above-mentioned groups and via screening of clone libraries from environmental soil samples and human faecal samples. As observed through theoretical and practical evaluation, the primers developed in this study showed a higher level of precision than previously published primers, thus allowing a deeper insight into microbial community dynamics.
Kinkar, Liina; Laurimäe, Teivi; Acosta-Jamett, Gerardo; Andresiuk, Vanessa; Balkaya, Ibrahim; Casulli, Adriano; Gasser, Robin B; van der Giessen, Joke; González, Luis Miguel; Haag, Karen L; Zait, Houria; Irshadullah, Malik; Jabbar, Abdul; Jenkins, David J; Kia, Eshrat Beigom; Manfredi, Maria Teresa; Mirhendi, Hossein; M'rad, Selim; Rostami-Nejad, Mohammad; Oudni-M'rad, Myriam; Pierangeli, Nora Beatriz; Ponce-Gordo, Francisco; Rehbein, Steffen; Sharbatkhori, Mitra; Simsek, Sami; Soriano, Silvia Viviana; Sprong, Hein; Šnábel, Viliam; Umhang, Gérald; Varcasia, Antonio; Saarma, Urmas
2018-05-19
Echinococcus granulosus sensu stricto (s.s.) is the major cause of human cystic echinococcosis worldwide and is listed among the most severe parasitic diseases of humans. To date, numerous studies have investigated the genetic diversity and population structure of E. granulosus s.s. in various geographic regions. However, there has been no global study. Recently, using mitochondrial DNA, it was shown that E. granulosus s.s. G1 and G3 are distinct genotypes, but a larger dataset is required to confirm the distinction of these genotypes. The objectives of this study were to: (i) investigate the distinction of genotypes G1 and G3 using a large global dataset; and (ii) analyse the genetic diversity and phylogeography of genotype G1 on a global scale using near-complete mitogenome sequences. For this study, 222 globally distributed E. granulosus s.s. samples were used, of which 212 belonged to genotype G1 and 10 to G3. Using a total sequence length of 11,682 bp, we inferred phylogenetic networks for three datasets: E. granulosus s.s. (n = 222), G1 (n = 212) and human G1 samples (n = 41). In addition, the Bayesian phylogenetic and phylogeographic analyses were performed. The latter yielded several strongly supported diffusion routes of genotype G1 originating from Turkey, Tunisia and Argentina. We conclude that: (i) using a considerably larger dataset than employed previously, E. granulosus s.s. G1 and G3 are indeed distinct mitochondrial genotypes; (ii) the genetic diversity of E. granulosus s.s. G1 is high globally, with lower values in South America; and (iii) the complex phylogeographic patterns emerging from the phylogenetic and geographic analyses suggest that the current distribution of genotype G1 has been shaped by intensive animal trade. Copyright © 2018 Australian Society for Parasitology. Published by Elsevier Ltd. All rights reserved.
De Anda, Valerie; Zapata-Peñasco, Icoquih; Poot-Hernandez, Augusto Cesar; Eguiarte, Luis E; Contreras-Moreira, Bruno; Souza, Valeria
2017-11-01
The increasing number of metagenomic and genomic sequences has dramatically improved our understanding of microbial diversity, yet our ability to infer metabolic capabilities in such datasets remains challenging. We describe the Multigenomic Entropy Based Score pipeline (MEBS), a software platform designed to evaluate, compare, and infer complex metabolic pathways in large "omic" datasets, including entire biogeochemical cycles. MEBS is open source and available through https://github.com/eead-csic-compbio/metagenome_Pfam_score. To demonstrate its use, we modeled the sulfur cycle by exhaustively curating the molecular and ecological elements involved (compounds, genes, metabolic pathways, and microbial taxa). This information was reduced to a collection of 112 characteristic Pfam protein domains and a list of complete-sequenced sulfur genomes. Using the mathematical framework of relative entropy (H΄), we quantitatively measured the enrichment of these domains among sulfur genomes. The entropy of each domain was used both to build up a final score that indicates whether a (meta)genomic sample contains the metabolic machinery of interest and to propose marker domains in metagenomic sequences such as DsrC (PF04358). MEBS was benchmarked with a dataset of 2107 non-redundant microbial genomes from RefSeq and 935 metagenomes from MG-RAST. Its performance, reproducibility, and robustness were evaluated using several approaches, including random sampling, linear regression models, receiver operator characteristic plots, and the area under the curve metric (AUC). Our results support the broad applicability of this algorithm to accurately classify (AUC = 0.985) hard-to-culture genomes (e.g., Candidatus Desulforudis audaxviator), previously characterized ones, and metagenomic environments such as hydrothermal vents, or deep-sea sediment. Our benchmark indicates that an entropy-based score can capture the metabolic machinery of interest and can be used to efficiently classify large genomic and metagenomic datasets, including uncultivated/unexplored taxa. © The Author 2017. Published by Oxford University Press.
A dictionary learning approach for human sperm heads classification.
Shaker, Fariba; Monadjemi, S Amirhassan; Alirezaie, Javad; Naghsh-Nilchi, Ahmad Reza
2017-12-01
To diagnose infertility in men, semen analysis is conducted in which sperm morphology is one of the factors that are evaluated. Since manual assessment of sperm morphology is time-consuming and subjective, automatic classification methods are being developed. Automatic classification of sperm heads is a complicated task due to the intra-class differences and inter-class similarities of class objects. In this research, a Dictionary Learning (DL) technique is utilized to construct a dictionary of sperm head shapes. This dictionary is used to classify the sperm heads into four different classes. Square patches are extracted from the sperm head images. Columnized patches from each class of sperm are used to learn class-specific dictionaries. The patches from a test image are reconstructed using each class-specific dictionary and the overall reconstruction error for each class is used to select the best matching class. Average accuracy, precision, recall, and F-score are used to evaluate the classification method. The method is evaluated using two publicly available datasets of human sperm head shapes. The proposed DL based method achieved an average accuracy of 92.2% on the HuSHeM dataset, and an average recall of 62% on the SCIAN-MorphoSpermGS dataset. The results show a significant improvement compared to a previously published shape-feature-based method. We have achieved high-performance results. In addition, our proposed approach offers a more balanced classifier in which all four classes are recognized with high precision and recall. In this paper, we use a Dictionary Learning approach in classifying human sperm heads. It is shown that the Dictionary Learning method is far more effective in classifying human sperm heads than classifiers using shape-based features. Also, a dataset of human sperm head shapes is introduced to facilitate future research. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
O'Connell, Dylan; Thomas, David H.; Lamb, James M.; Lewis, John H.; Dou, Tai; Sieren, Jered P.; Saylor, Melissa; Hofmann, Christian; Hoffman, Eric A.; Lee, Percy P.; Low, Daniel A.
2018-02-01
To determine if the parameters relating lung tissue displacement to a breathing surrogate signal in a previously published respiratory motion model vary with the rate of breathing during image acquisition. An anesthetized pig was imaged using multiple fast helical scans to sample the breathing cycle with simultaneous surrogate monitoring. Three datasets were collected while the animal was mechanically ventilated with different respiratory rates: 12 bpm (breaths per minute), 17 bpm, and 24 bpm. Three sets of motion model parameters describing the correspondences between surrogate signals and tissue displacements were determined. The model error was calculated individually for each dataset, as well asfor pairs of parameters and surrogate signals from different experiments. The values of one model parameter, a vector field denoted α which related tissue displacement to surrogate amplitude, determined for each experiment were compared. The mean model error of the three datasets was 1.00 ± 0.36 mm with a 95th percentile value of 1.69 mm. The mean error computed from all combinations of parameters and surrogate signals from different datasets was 1.14 ± 0.42 mm with a 95th percentile of 1.95 mm. The mean difference in α over all pairs of experiments was 4.7% ± 5.4%, and the 95th percentile was 16.8%. The mean angle between pairs of α was 5.0 ± 4.0 degrees, with a 95th percentile of 13.2 mm. The motion model parameters were largely unaffected by changes in the breathing rate during image acquisition. The mean error associated with mismatched sets of parameters and surrogate signals was 0.14 mm greater than the error achieved when using parameters and surrogate signals acquired with the same breathing rate, while maximum respiratory motion was 23.23 mm on average.
Extended Kd distributions for freshwater environment.
Boyer, Patrick; Wells, Claire; Howard, Brenda
2018-06-18
Many of the freshwater K d values required for quantifying radionuclide transfer in the environment (e.g. ERICA Tool, Symbiose modelling platform) are either poorly reported in the literature or not available. To partially address this deficiency, Working Group 4 of the IAEA program MODARIA (2012-2015) has completed an update of the freshwater K d databases and K d distributions given in TRS 472 (IAEA, 2010). Over 2300 new values for 27 new elements were added to the dataset and 270 new K d values were added for the 25 elements already included in TRS 472 (IAEA, 2010). For 49 chemical elements, the K d values have been classified according to three solid-liquid exchange conditions (adsorption, desorption and field) as was previously carried out in TRS 472. Additionally, the K d values were classified into two environmental components (suspended and deposited sediments). Each combination (radionuclide x component x condition) was associated with log-normal distributions when there was at least ten K d values in the dataset and to a geometric mean when there was less than ten values. The enhanced K d dataset shows that K d values for suspended sediments are significantly higher than for deposited sediments and that the variability of K d distributions are higher for deposited than for suspended sediments. For suspended sediments in field conditions, the variability of K d distributions can be significantly reduced as a function of the suspended load that explains more than 50% of the variability of the K d datasets of U, Si, Mo, Pb, S, Se, Cd, Ca, B, K, Ra and Po. The distinction between adsorption and desorption conditions is justified for deterministic calculations because the geometric means are systematically greater in desorption conditions. Conversely, this distinction is less relevant for probabilistic calculations due to systematic overlapping between the K d distributions of these two conditions. Copyright © 2018. Published by Elsevier Ltd.
Saeed, Mohammad
2017-05-01
Systemic lupus erythematosus (SLE) is a complex disorder. Genetic association studies of complex disorders suffer from the following three major issues: phenotypic heterogeneity, false positive (type I error), and false negative (type II error) results. Hence, genes with low to moderate effects are missed in standard analyses, especially after statistical corrections. OASIS is a novel linkage disequilibrium clustering algorithm that can potentially address false positives and negatives in genome-wide association studies (GWAS) of complex disorders such as SLE. OASIS was applied to two SLE dbGAP GWAS datasets (6077 subjects; ∼0.75 million single-nucleotide polymorphisms). OASIS identified three known SLE genes viz. IFIH1, TNIP1, and CD44, not previously reported using these GWAS datasets. In addition, 22 novel loci for SLE were identified and the 5 SLE genes previously reported using these datasets were verified. OASIS methodology was validated using single-variant replication and gene-based analysis with GATES. This led to the verification of 60% of OASIS loci. New SLE genes that OASIS identified and were further verified include TNFAIP6, DNAJB3, TTF1, GRIN2B, MON2, LATS2, SNX6, RBFOX1, NCOA3, and CHAF1B. This study presents the OASIS algorithm, software, and the meta-analyses of two publicly available SLE GWAS datasets along with the novel SLE genes. Hence, OASIS is a novel linkage disequilibrium clustering method that can be universally applied to existing GWAS datasets for the identification of new genes.
Highlights of the Version 8 SBUV and TOMS Datasets Released at this Symposium
NASA Technical Reports Server (NTRS)
Bhartia, Pawan K.; McPeters, Richard D.; Flynn, Lawrence E.; Wellemeyer, Charles G.
2004-01-01
Last October was the 25th anniversary of the launch of the SBUV and TOMS instruments on NASA's Nimbus-7 satellite. Total Ozone and ozone profile datasets produced by these and following instruments have produced a quarter century long record. Over time we have released several versions of these datasets to incorporate advances in UV radiative transfer, inverse modeling, and instrument characterization. In this meeting we are releasing datasets produced from the version 8 algorithms. They replace the previous versions (V6 SBUV, and V7 TOMS) released about a decade ago. About a dozen companion papers in this meeting provide details of the new algorithms and intercomparison of the new data with external data. In this paper we present key features of the new algorithm, and discuss how the new results differ from those released previously. We show that the new datasets have better internal consistency and also agree better with external datasets. A key feature of the V8 SBUV algorithm is that the climatology has no influence on inter-annual variability and trends; it only affects the mean values and, to a limited extent, the seasonal dependence. By contrast, climatology does have some influence on TOMS total O3 trends, particularly at large solar zenith angles. For this reason, and also because TOMS record has gaps, md EP/TOMS is suffering from data quality problems, we recommend using SBUV total ozone data for applications where the high spatial resolution of TOMS is not essential.
Unified Ecoregions of Alaska: 2001
Nowacki, Gregory J.; Spencer, Page; Fleming, Michael; Brock, Terry; Jorgenson, Torre
2003-01-01
Major ecosystems have been mapped and described for the State of Alaska and nearby areas. Ecoregion units are based on newly available datasets and field experience of ecologists, biologists, geologists and regional experts. Recently derived datasets for Alaska included climate parameters, vegetation, surficial geology and topography. Additional datasets incorporated in the mapping process were lithology, soils, permafrost, hydrography, fire regime and glaciation. Thirty two units are mapped using a combination of the approaches of Bailey (hierarchial), and Omernick (integrated). The ecoregions are grouped into two higher levels using a 'tri-archy' based on climate parameters, vegetation response and disturbance processes. The ecoregions are described with text, photos and tables on the published map.
Dataset of anomalies and malicious acts in a cyber-physical subsystem.
Laso, Pedro Merino; Brosset, David; Puentes, John
2017-10-01
This article presents a dataset produced to investigate how data and information quality estimations enable to detect aNomalies and malicious acts in cyber-physical systems. Data were acquired making use of a cyber-physical subsystem consisting of liquid containers for fuel or water, along with its automated control and data acquisition infrastructure. Described data consist of temporal series representing five operational scenarios - Normal, aNomalies, breakdown, sabotages, and cyber-attacks - corresponding to 15 different real situations. The dataset is publicly available in the .zip file published with the article, to investigate and compare faulty operation detection and characterization methods for cyber-physical systems.
Emerging trends and new developments in regenerative medicine: a scientometric update (2000 - 2014).
Chen, Chaomei; Dubin, Rachael; Kim, Meen Chul
2014-09-01
Our previous scientometric review of regenerative medicine provides a snapshot of the fast-growing field up to the end of 2011. The new review identifies emerging trends and new developments appearing in the literature of regenerative medicine based on relevant articles and reviews published between 2000 and the first month of 2014. Multiple datasets of publications relevant to regenerative medicine are constructed through topic search and citation expansion to ensure adequate coverage of the field. Networks of co-cited references representing the literature of regenerative medicine are constructed and visualized based on a combined dataset of 71,393 articles published between 2000 and 2014. Structural and temporal dynamics are identified in terms of most active topical areas and cited references. New developments are identified in terms of newly emerged clusters and research areas. Disciplinary-level patterns are visualized in dual-map overlays. While research in induced pluripotent stem cells remains the most prominent area in the field of regenerative medicine, research related to clinical and therapeutic applications in regenerative medicine has experienced a considerable growth. In addition, clinical and therapeutic developments in regenerative medicine have demonstrated profound connections with the induced pluripotent stem cell research and stem cell research in general. A rapid adaptation of graphene-based nanomaterials in regenerative medicine is evident. Both basic research represented by stem cell research and application-oriented research typically found in tissue engineering are now increasingly integrated in the scientometric landscape of regenerative medicine. Tissue engineering is an interdisciplinary field in its own right. Advances in multiple disciplines such as stem cell research and graphene research have strengthened the connections between tissue engineering and regenerative medicine.
Haakensen, Vilde D; Lingjaerde, Ole Christian; Lüders, Torben; Riis, Margit; Prat, Aleix; Troester, Melissa A; Holmen, Marit M; Frantzen, Jan Ole; Romundstad, Linda; Navjord, Dina; Bukholm, Ida K; Johannesen, Tom B; Perou, Charles M; Ursin, Giske; Kristensen, Vessela N; Børresen-Dale, Anne-Lise; Helland, Aslaug
2011-11-01
Increased understanding of the variability in normal breast biology will enable us to identify mechanisms of breast cancer initiation and the origin of different subtypes, and to better predict breast cancer risk. Gene expression patterns in breast biopsies from 79 healthy women referred to breast diagnostic centers in Norway were explored by unsupervised hierarchical clustering and supervised analyses, such as gene set enrichment analysis and gene ontology analysis and comparison with previously published genelists and independent datasets. Unsupervised hierarchical clustering identified two separate clusters of normal breast tissue based on gene-expression profiling, regardless of clustering algorithm and gene filtering used. Comparison of the expression profile of the two clusters with several published gene lists describing breast cells revealed that the samples in cluster 1 share characteristics with stromal cells and stem cells, and to a certain degree with mesenchymal cells and myoepithelial cells. The samples in cluster 1 also share many features with the newly identified claudin-low breast cancer intrinsic subtype, which also shows characteristics of stromal and stem cells. More women belonging to cluster 1 have a family history of breast cancer and there is a slight overrepresentation of nulliparous women in cluster 1. Similar findings were seen in a separate dataset consisting of histologically normal tissue from both breasts harboring breast cancer and from mammoplasty reductions. This is the first study to explore the variability of gene expression patterns in whole biopsies from normal breasts and identified distinct subtypes of normal breast tissue. Further studies are needed to determine the specific cell contribution to the variation in the biology of normal breasts, how the clusters identified relate to breast cancer risk and their possible link to the origin of the different molecular subtypes of breast cancer.
Cats of the Pharaohs: Genetic Comparison of Egyptian Cat Mummies to their Feline Contemporaries
Kurushima, Jennifer D.; Ikram, Salima; Knudsen, Joan; Bleiberg, Edward; Grahn, Robert A.; Lyons, Leslie A.
2012-01-01
The ancient Egyptians mummified an abundance of cats during the Late Period (664 - 332 BC). The overlapping morphology and sizes of developing wildcats and domestic cats confounds the identity of mummified cat species. Genetic analyses should support mummy identification and was conducted on two long bones and a mandible of three cats that were mummified by the ancient Egyptians. The mummy DNA was extracted in a dedicated ancient DNA laboratory at the University of California – Davis, then directly sequencing between 246 and 402 bp of the mtDNA control region from each bone. When compared to a dataset of wildcats (Felis silvestris silvestris, F. s. tristrami, and F. chaus) as well as a previously published worldwide dataset of modern domestic cat samples, including Egypt, the DNA evidence suggests the three mummies represent common contemporary domestic cat mitotypes prevalent in modern Egypt and the Middle East. Divergence estimates date the origin of the mummies’ mitotypes to between two and 7.5 thousand years prior to their mummification, likely prior to or during Egyptian Predyanstic and Early Dynastic Periods. These data are the first genetic evidence supporting that the ancient Egyptians used domesticated cats, F. s. catus, for votive mummies, and likely implies cats were domesticated prior to extensive mummification of cats. PMID:22923880
W Boson Mass Measurement at CDF
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kotwal, Ashutosh V.
2017-03-27
This is the closeout report for the grant for experimental research at the energy frontier in high energy physics. The report describes the precise measurement of the W boson mass at the CDF experiment at Fermilab, with an uncertainty of ≈ 12 MeV, using the full dataset of ≈ 9 fb -1 collected by the experiment up to the shutdown of the Tevatron in 2011. In this analysis, the statistical and most of the experimental systematic uncertainties have been reduced by a factor of two compared to the previous measurement with 2.2 fb -1 of CDF data. This research hasmore » been the culmination of the PI's track record of producing world-leading measurements of the W boson mass from the Tevatron. The PI performed the first and only measurement to date of the W boson mass using high-rapidity leptons using the D0 endcap calorimeters in Run 1. He has led this measurement in Run 2 at CDF, publishing two world-leading measurements in 2007 and 2012 with total uncertainties of 48 MeV and 19 MeV respectively. The analysis of the final dataset is currently under internal review in CDF. Upon approval of the internal review, the result will be available for public release.« less
Rosser, Gabriel; Baker, Ruth E.; Armitage, Judith P.; Fletcher, Alexander G.
2014-01-01
Most free-swimming bacteria move in approximately straight lines, interspersed with random reorientation phases. A key open question concerns varying mechanisms by which reorientation occurs. We combine mathematical modelling with analysis of a large tracking dataset to study the poorly understood reorientation mechanism in the monoflagellate species Rhodobacter sphaeroides. The flagellum on this species rotates counterclockwise to propel the bacterium, periodically ceasing rotation to enable reorientation. When rotation restarts the cell body usually points in a new direction. It has been assumed that the new direction is simply the result of Brownian rotation. We consider three variants of a self-propelled particle model of bacterial motility. The first considers rotational diffusion only, corresponding to a non-chemotactic mutant strain. Two further models incorporate stochastic reorientations, describing ‘run-and-tumble’ motility. We derive expressions for key summary statistics and simulate each model using a stochastic computational algorithm. We also discuss the effect of cell geometry on rotational diffusion. Working with a previously published tracking dataset, we compare predictions of the models with data on individual stopping events in R. sphaeroides. This provides strong evidence that this species undergoes some form of active reorientation rather than simple reorientation by Brownian rotation. PMID:24872500
Status and trends of land change in selected U.S. ecoregions - 2000 to 2011
Sayler, Kristi L.; Acevedo, William; Taylor, Janis
2016-01-01
U.S. Geological Survey scientists developed a dataset of 2006 and 2011 land-use and land-cover (LULC) information for selected 100-km2 sample blocks within 29 U.S. Environmental Protection Agency (EPA) Level III ecoregions across the conterminous United States. The data can be used with the previously published Land Cover Trends Dataset: 1973 to 2000 to assess landuse/land-cover change across a 37-year study period. Results from analysis of these data include ecoregion-based statistical estimates of the amount of LULC change per time period, ranking of the most common types of conversions, rates of change, and percent composition. Overall estimated amount of change per ecoregion from 2001 to 2011 ranged from a low of 370 km2 in the Northern Basin and Range Ecoregion to a high of 78,782 km2 in the Southeastern Plains Ecoregion. The Southeastern Plains continues to encompass one of the most intense forest harvesting and regrowth regions in the country, with 16.6 percent of the ecoregion changing between 2001 and 2011. These LULC change statistics provide a new, valuable resource that complements other reference data and field-verified LULC data. Researchers can use this resource to independently validate other land change products or to conduct regional land change assessments.
Broad-Enrich: functional interpretation of large sets of broad genomic regions.
Cavalcante, Raymond G; Lee, Chee; Welch, Ryan P; Patil, Snehal; Weymouth, Terry; Scott, Laura J; Sartor, Maureen A
2014-09-01
Functional enrichment testing facilitates the interpretation of Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) data in terms of pathways and other biological contexts. Previous methods developed and used to test for key gene sets affected in ChIP-seq experiments treat peaks as points, and are based on the number of peaks associated with a gene or a binary score for each gene. These approaches work well for transcription factors, but histone modifications often occur over broad domains, and across multiple genes. To incorporate the unique properties of broad domains into functional enrichment testing, we developed Broad-Enrich, a method that uses the proportion of each gene's locus covered by a peak. We show that our method has a well-calibrated false-positive rate, performing well with ChIP-seq data having broad domains compared with alternative approaches. We illustrate Broad-Enrich with 55 ENCODE ChIP-seq datasets using different methods to define gene loci. Broad-Enrich can also be applied to other datasets consisting of broad genomic domains such as copy number variations. http://broad-enrich.med.umich.edu for Web version and R package. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Photosynthetic parameters in the Beaufort Sea in relation to the phytoplankton community structure
NASA Astrophysics Data System (ADS)
Huot, Y.; Babin, M.; Bruyant, F.
2013-05-01
To model phytoplankton primary production from remotely sensed data, a method to estimate photosynthetic parameters describing the photosynthetic rates per unit biomass is required. Variability in these parameters must be related to environmental variables that are measurable remotely. In the Arctic, a limited number of measurements of photosynthetic parameters have been carried out with the concurrent environmental variables needed. Such measurements and their relationship to environmental variables will be required to improve the accuracy of remotely sensed estimates of phytoplankton primary production and our ability to predict future changes. During the MALINA cruise, a large dataset of these parameters was obtained. Together with previously published datasets, we use environmental and trophic variables to provide functional relationships for these parameters. In particular, we describe several specific aspects: the maximum rate of photosynthesis (Pmaxchl) normalized to chlorophyll decreases with depth and is higher for communities composed of large cells; the saturation parameter (Ek) decreases with depth but is independent of the community structure; and the initial slope of the photosynthesis versus irradiance curve (αchl) normalized to chlorophyll is independent of depth but is higher for communities composed of larger cells. The photosynthetic parameters were not influenced by temperature over the range encountered during the cruise (-2 to 8 °C).
Photosynthetic parameters in the Beaufort Sea in relation to the phytoplankton community structure
NASA Astrophysics Data System (ADS)
Huot, Y.; Babin, M.; Bruyant, F.
2013-01-01
To model phytoplankton primary production from remotely sensed data a method to estimate photosynthetic parameters describing the photosynthetic rates per unit biomass is required. Variability in these parameters must be related to environmental variables that are measurable remotely. In the Arctic, a limited number of measurements of photosynthetic parameter have been carried out with the concurrent environmental variables needed. Therefore, to improve the accuracy of remote estimates of phytoplankton primary production as well as our ability to predict changes in the future such measurements and relationship to environmental variables are required. During the MALINA cruise, a large dataset of these parameters were obtained. Together with previously published datasets, we use environmental and trophic variables to provide functional relationships for these parameters. In particular, we describe several specific aspects: the maximum rate of photosynthesis (Pmaxchl) normalized to chlorophyll decreases with depth and is higher for communities composed of large cells; the saturation parameter (Ek) decreases with depth but is independent of the community structure; and the initial slope of the photosynthesis versus irradiance curve (αchl) normalized to chlorophyll is independent of depth but is higher for communities composed of larger cells. The photosynthetic parameters were not influenced by temperature over the range encountered during the cruise (-2 to 8 °C).
Nayak, Deepak Ranjan; Dash, Ratnakar; Majhi, Banshidhar
2017-01-01
This paper presents an automatic classification system for segregating pathological brain from normal brains in magnetic resonance imaging scanning. The proposed system employs contrast limited adaptive histogram equalization scheme to enhance the diseased region in brain MR images. Two-dimensional stationary wavelet transform is harnessed to extract features from the preprocessed images. The feature vector is constructed using the energy and entropy values, computed from the level- 2 SWT coefficients. Then, the relevant and uncorrelated features are selected using symmetric uncertainty ranking filter. Subsequently, the selected features are given input to the proposed AdaBoost with support vector machine classifier, where SVM is used as the base classifier of AdaBoost algorithm. To validate the proposed system, three standard MR image datasets, Dataset-66, Dataset-160, and Dataset- 255 have been utilized. The 5 runs of k-fold stratified cross validation results indicate the suggested scheme offers better performance than other existing schemes in terms of accuracy and number of features. The proposed system earns ideal classification over Dataset-66 and Dataset-160; whereas, for Dataset- 255, an accuracy of 99.45% is achieved. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr
2016-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010
Omicseq: a web-based search engine for exploring omics datasets.
Sun, Xiaobo; Pittard, William S; Xu, Tianlei; Chen, Li; Zwick, Michael E; Jiang, Xiaoqian; Wang, Fusheng; Qin, Zhaohui S
2017-07-03
The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Schmidt, B. Christian; Layberry, Ross A.
2016-01-01
Abstract The identity of Celastrina species in eastern Canada is reviewed based on larval host plants, phenology, adult phenotypes, mtDNA barcodes and re-assessment of published data. The status of the Cherry Gall Azure (Celastrina serotina Pavulaan & Wright) as a distinct species in Canada is not supported by any dataset, and is removed from the Canadian fauna. Previous records of this taxon are re-identified as Celastrina lucia (Kirby) and Celastrina neglecta (Edwards). Evidence is presented that both Celastrina lucia and Celastrina neglecta have a second, summer-flying generation in parts of Canada. The summer generation of Celastrina lucia has previously been misidentified as Celastrina neglecta, which differs in phenology, adult phenotype and larval hosts from summer Celastrina lucia. DNA barcodes are highly conserved among at least three North American Celastrina species, and provide no taxonomic information. Celastrina neglecta has a Canadian distribution restricted to southern Ontario, Manitoba, Saskatchewan and easternmost Alberta. The discovery of museum specimens of Celastrina ladon (Cramer) from southernmost Ontario represents a new species for the Canadian butterfly fauna, which is in need of conservation status assessment. PMID:27199600
Historian: accurate reconstruction of ancestral sequences and evolutionary rates.
Holmes, Ian H
2017-04-15
Reconstruction of ancestral sequence histories, and estimation of parameters like indel rates, are improved by using explicit evolutionary models and summing over uncertain alignments. The previous best tool for this purpose (according to simulation benchmarks) was ProtPal, but this tool was too slow for practical use. Historian combines an efficient reimplementation of the ProtPal algorithm with performance-improving heuristics from other alignment tools. Simulation results on fidelity of rate estimation via ancestral reconstruction, along with evaluations on the structurally informed alignment dataset BAliBase 3.0, recommend Historian over other alignment tools for evolutionary applications. Historian is available at https://github.com/evoldoers/historian under the Creative Commons Attribution 3.0 US license. ihholmes+historian@gmail.com. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Varret, C; Beronius, A; Bodin, L; Bokkers, B G H; Boon, P E; Burger, M; De Wit-Bos, L; Fischer, A; Hanberg, A; Litens-Karlsson, S; Slob, W; Wolterink, G; Zilliacus, J; Beausoleil, C; Rousselle, C
2018-01-15
This study aims to evaluate the evidence for the existence of non-monotonic dose-responses (NMDRs) of substances in the area of food safety. This review was performed following the systematic review methodology with the aim to identify in vivo studies published between January 2002 and February 2015 containing evidence for potential NMDRs. Inclusion and reliability criteria were defined and used to select relevant and reliable studies. A set of six checkpoints was developed to establish the likelihood that the data retrieved contained evidence for NMDR. In this review, 49 in vivo studies were identified as relevant and reliable, of which 42 were used for dose-response analysis. These studies contained 179 in vivo dose-response datasets with at least five dose groups (and a control group) as fewer doses cannot provide evidence for NMDR. These datasets were extracted and analyzed using the PROAST software package. The resulting dose-response relationships were evaluated for possible evidence of NMDRs by applying the six checkpoints. In total, 10 out of the 179 in vivo datasets fulfilled all six checkpoints. While these datasets could be considered as providing evidence for NMDR, replicated studies would still be needed to check if the results can be reproduced to rule out that the non-monotonicity was caused by incidental anomalies in that specific study. This approach, combining a systematic review with a set of checkpoints, is new and appears useful for future evaluations of the dose response datasets regarding evidence of non-monotonicity. Published by Elsevier Inc.
Stature estimation equations for South Asian skeletons based on DXA scans of contemporary adults.
Pomeroy, Emma; Mushrif-Tripathy, Veena; Wells, Jonathan C K; Kulkarni, Bharati; Kinra, Sanjay; Stock, Jay T
2018-05-03
Stature estimation from the skeleton is a classic anthropological problem, and recent years have seen the proliferation of population-specific regression equations. Many rely on the anatomical reconstruction of stature from archaeological skeletons to derive regression equations based on long bone lengths, but this requires a collection with very good preservation. In some regions, for example, South Asia, typical environmental conditions preclude the sufficient preservation of skeletal remains. Large-scale epidemiological studies that include medical imaging of the skeleton by techniques such as dual-energy X-ray absorptiometry (DXA) offer new potential datasets for developing such equations. We derived estimation equations based on known height and bone lengths measured from DXA scans from the Andhra Pradesh Children and Parents Study (Hyderabad, India). Given debates on the most appropriate regression model to use, multiple methods were compared, and the performance of the equations was tested on a published skeletal dataset of individuals with known stature. The equations have standard errors of estimates and prediction errors similar to those derived using anatomical reconstruction or from cadaveric datasets. As measured by the number of significant differences between true and estimated stature, and the prediction errors, the new equations perform as well as, and generally better than, published equations commonly used on South Asian skeletons or based on Indian cadaveric datasets. This study demonstrates the utility of DXA scans as a data source for developing stature estimation equations and offer a new set of equations for use with South Asian datasets. © 2018 Wiley Periodicals, Inc.
Evolving Deep Networks Using HPC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Young, Steven R.; Rose, Derek C.; Johnston, Travis
While a large number of deep learning networks have been studied and published that produce outstanding results on natural image datasets, these datasets only make up a fraction of those to which deep learning can be applied. These datasets include text data, audio data, and arrays of sensors that have very different characteristics than natural images. As these “best” networks for natural images have been largely discovered through experimentation and cannot be proven optimal on some theoretical basis, there is no reason to believe that they are the optimal network for these drastically different datasets. Hyperparameter search is thus oftenmore » a very important process when applying deep learning to a new problem. In this work we present an evolutionary approach to searching the possible space of network hyperparameters and construction that can scale to 18, 000 nodes. This approach is applied to datasets of varying types and characteristics where we demonstrate the ability to rapidly find best hyperparameters in order to enable practitioners to quickly iterate between idea and result.« less
A collection of Australian Drosophila datasets on climate adaptation and species distributions.
Hangartner, Sandra B; Hoffmann, Ary A; Smith, Ailie; Griffin, Philippa C
2015-11-24
The Australian Drosophila Ecology and Evolution Resource (ADEER) collates Australian datasets on drosophilid flies, which are aimed at investigating questions around climate adaptation, species distribution limits and population genetics. Australian drosophilid species are diverse in climatic tolerance, geographic distribution and behaviour. Many species are restricted to the tropics, a few are temperate specialists, and some have broad distributions across climatic regions. Whereas some species show adaptability to climate changes through genetic and plastic changes, other species have limited adaptive capacity. This knowledge has been used to identify traits and genetic polymorphisms involved in climate change adaptation and build predictive models of responses to climate change. ADEER brings together 103 datasets from 39 studies published between 1982-2013 in a single online resource. All datasets can be downloaded freely in full, along with maps and other visualisations. These historical datasets are preserved for future studies, which will be especially useful for assessing climate-related changes over time.
Antarctic ice shelf thickness from CryoSat-2 radar altimetry
NASA Astrophysics Data System (ADS)
Chuter, Stephen; Bamber, Jonathan
2016-04-01
The Antarctic ice shelves provide buttressing to the inland grounded ice sheet, and therefore play a controlling role in regulating ice dynamics and mass imbalance. Accurate knowledge of ice shelf thickness is essential for input-output method mass balance calculations, sub-ice shelf ocean models and buttressing parameterisations in ice sheet models. Ice shelf thickness has previously been inferred from satellite altimetry elevation measurements using the assumption of hydrostatic equilibrium, as direct measurements of ice thickness do not provide the spatial coverage necessary for these applications. The sensor limitations of previous radar altimeters have led to poor data coverage and a lack of accuracy, particularly the grounding zone where a break in slope exists. We present a new ice shelf thickness dataset using four years (2011-2014) of CryoSat-2 elevation measurements, with its SARIn dual antennae mode of operation alleviating the issues affecting previous sensors. These improvements and the dense across track spacing of the satellite has resulted in ˜92% coverage of the ice shelves, with substantial improvements, for example, of over 50% across the Venable and Totten Ice Shelves in comparison to the previous dataset. Significant improvements in coverage and accuracy are also seen south of 81.5° for the Ross and Filchner-Ronne Ice Shelves. Validation of the surface elevation measurements, used to derive ice thickness, against NASA ICESat laser altimetry data shows a mean bias of less than 1 m (equivalent to less than 9 m in ice thickness) and a fourfold decrease in standard deviation in comparison to the previous continental dataset. Importantly, the most substantial improvements are found in the grounding zone. Validation of the derived thickness data has been carried out using multiple Radio Echo Sounding (RES) campaigns across the continent. Over the Amery ice shelf, where extensive RES measurements exist, the mean difference between the datasets is 3.3% and 4.7% across the whole shelf and within 10 km of the grounding line, respectively. These represent a two to three fold improvement in accuracy when compared to the previous data product. The impact of these improvements on Input-Output estimates of mass balance is illustrated for the Abbot Ice Shelf. Our new product shows a mean reduction of 29% in thickness at the grounding line when compared to the previous dataset as well as the elimination of non-physical 'data spikes' that were prevalent in the previous product in areas of complex terrain. The reduction in grounding line thickness equates to a change in mass balance for the areas from -14±9 GTyr-1to -4±9 GTyr-1. We show examples from other sectors including the Getz and George VI ice shelves. The updated estimate is more consistent with the positive surface elevation rate in this region obtained from satellite altimetry. The new thickness dataset will greatly reduce the uncertainty in Input-Output estimates of mass balance for the ˜30% of the grounding line of Antarctica where direct ice thickness measurements do not exist.
Baldewijns, Greet; Debard, Glen; Mertes, Gert; Vanrumste, Bart; Croonenborghs, Tom
2016-03-01
Fall incidents are an important health hazard for older adults. Automatic fall detection systems can reduce the consequences of a fall incident by assuring that timely aid is given. The development of these systems is therefore getting a lot of research attention. Real-life data which can help evaluate the results of this research is however sparse. Moreover, research groups that have this type of data are not at liberty to share it. Most research groups thus use simulated datasets. These simulation datasets, however, often do not incorporate the challenges the fall detection system will face when implemented in real-life. In this Letter, a more realistic simulation dataset is presented to fill this gap between real-life data and currently available datasets. It was recorded while re-enacting real-life falls recorded during previous studies. It incorporates the challenges faced by fall detection algorithms in real life. A fall detection algorithm from Debard et al. was evaluated on this dataset. This evaluation showed that the dataset possesses extra challenges compared with other publicly available datasets. In this Letter, the dataset is discussed as well as the results of this preliminary evaluation of the fall detection algorithm. The dataset can be downloaded from www.kuleuven.be/advise/datasets.
Orlek, Alex; Phan, Hang; Sheppard, Anna E; Doumith, Michel; Ellington, Matthew; Peto, Tim; Crook, Derrick; Walker, A Sarah; Woodford, Neil; Anjum, Muna F; Stoesser, Nicole
2017-05-01
Plasmid typing can provide insights into the epidemiology and transmission of plasmid-mediated antibiotic resistance. The principal plasmid typing schemes are replicon typing and MOB typing, which utilize variation in replication loci and relaxase proteins respectively. Previous studies investigating the proportion of plasmids assigned a type by these schemes ('typeability') have yielded conflicting results; moreover, thousands of plasmid sequences have been added to NCBI in recent years, without consistent annotation to indicate which sequences represent complete plasmids. Here, a curated dataset of complete Enterobacteriaceae plasmids from NCBI was compiled, and used to assess the typeability and concordance of in silico replicon and MOB typing schemes. Concordance was assessed at hierarchical replicon type resolutions, from replicon family-level to plasmid multilocus sequence type (pMLST)-level, where available. We found that 85% and 65% of the curated plasmids could be replicon and MOB typed, respectively. Overall, plasmid size and the number of resistance genes were significant independent predictors of replicon and MOB typing success. We found some degree of non-concordance between replicon families and MOB types, which was only partly resolved when partitioning plasmids into finer-resolution groups (replicon and pMLST types). In some cases, non-concordance was attributed to ambiguous boundaries between MOBP and MOBQ types; in other cases, backbone mosaicism was considered a more plausible explanation. β-lactamase resistance genes tended not to show fidelity to a particular plasmid type, though some previously reported associations were supported. Overall, replicon and MOB typing schemes are likely to continue playing an important role in plasmid analysis, but their performance is constrained by the diverse and dynamic nature of plasmid genomes. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
DTMiner: identification of potential disease targets through biomedical literature mining.
Xu, Dong; Zhang, Meizhuo; Xie, Yanping; Wang, Fan; Chen, Ming; Zhu, Kenny Q; Wei, Jia
2016-12-01
Biomedical researchers often search through massive catalogues of literature to look for potential relationships between genes and diseases. Given the rapid growth of biomedical literature, automatic relation extraction, a crucial technology in biomedical literature mining, has shown great potential to support research of gene-related diseases. Existing work in this field has produced datasets that are limited both in scale and accuracy. In this study, we propose a reliable and efficient framework that takes large biomedical literature repositories as inputs, identifies credible relationships between diseases and genes, and presents possible genes related to a given disease and possible diseases related to a given gene. The framework incorporates name entity recognition (NER), which identifies occurrences of genes and diseases in texts, association detection whereby we extract and evaluate features from gene-disease pairs, and ranking algorithms that estimate how closely the pairs are related. The F1-score of the NER phase is 0.87, which is higher than existing studies. The association detection phase takes drastically less time than previous work while maintaining a comparable F1-score of 0.86. The end-to-end result achieves a 0.259 F1-score for the top 50 genes associated with a disease, which performs better than previous work. In addition, we released a web service for public use of the dataset. The implementation of the proposed algorithms is publicly available at http://gdr-web.rwebox.com/public_html/index.php?page=download.php The web service is available at http://gdr-web.rwebox.com/public_html/index.php CONTACT: jenny.wei@astrazeneca.com or kzhu@cs.sjtu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Gallegos, Tanya J.; Varela, Brian A.
2015-01-01
Hydraulic fracturing is presently the primary stimulation technique for oil and gas production in low-permeability, unconventional reservoirs. Comprehensive, published, and publicly available information regarding the extent, location, and character of hydraulic fracturing in the United States is scarce. This national spatial and temporal analysis of data on nearly 1 million hydraulically fractured wells and 1.8 million fracturing treatment records from 1947 through 2010 (aggregated in Data Series 868) is used to identify hydraulic fracturing trends in drilling methods and use of proppants, treatment fluids, additives, and water in the United States. These trends are compared to the literature in an effort to establish a common understanding of the differences in drilling methods, treatment fluids, and chemical additives and of how the newer technology has affected the water use volumes and areal distribution of hydraulic fracturing. Historically, Texas has had the highest number of records of hydraulic fracturing treatments and associated wells in the United States documented in the datasets described herein. Water-intensive horizontal/directional drilling has also increased from 6 percent of new hydraulically fractured wells drilled in the United States in 2000 to 42 percent of new wells drilled in 2010. Increases in horizontal drilling also coincided with the emergence of water-based “slick water” fracturing fluids. As such, the most current hydraulic fracturing materials and methods are notably different from those used in previous decades and have contributed to the development of previously inaccessible unconventional oil and gas production target areas, namely in shale and tight-sand reservoirs. Publicly available derivative datasets and locations developed from these analyses are described.
Predicting Response to Histone Deacetylase Inhibitors Using High-Throughput Genomics.
Geeleher, Paul; Loboda, Andrey; Lenkala, Divya; Wang, Fan; LaCroix, Bonnie; Karovic, Sanja; Wang, Jacqueline; Nebozhyn, Michael; Chisamore, Michael; Hardwick, James; Maitland, Michael L; Huang, R Stephanie
2015-11-01
Many disparate biomarkers have been proposed as predictors of response to histone deacetylase inhibitors (HDI); however, all have failed when applied clinically. Rather than this being entirely an issue of reproducibility, response to the HDI vorinostat may be determined by the additive effect of multiple molecular factors, many of which have previously been demonstrated. We conducted a large-scale gene expression analysis using the Cancer Genome Project for discovery and generated another large independent cancer cell line dataset across different cancers for validation. We compared different approaches in terms of how accurately vorinostat response can be predicted on an independent out-of-batch set of samples and applied the polygenic marker prediction principles in a clinical trial. Using machine learning, the small effects that aggregate, resulting in sensitivity or resistance, can be recovered from gene expression data in a large panel of cancer cell lines.This approach can predict vorinostat response accurately, whereas single gene or pathway markers cannot. Our analyses recapitulated and contextualized many previous findings and suggest an important role for processes such as chromatin remodeling, autophagy, and apoptosis. As a proof of concept, we also discovered a novel causative role for CHD4, a helicase involved in the histone deacetylase complex that is associated with poor clinical outcome. As a clinical validation, we demonstrated that a common dose-limiting toxicity of vorinostat, thrombocytopenia, can be predicted (r = 0.55, P = .004) several days before it is detected clinically. Our work suggests a paradigm shift from single-gene/pathway evaluation to simultaneously evaluating multiple independent high-throughput gene expression datasets, which can be easily extended to other investigational compounds where similar issues are hampering clinical adoption. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Reitz, M. D.; Sanford, W. E.; Senay, G. B.; Cazenas, J.
2015-12-01
Evapotranspiration (ET) is a key quantity in the hydrologic cycle, accounting for ~70% of precipitation across the contiguous United States (CONUS). However, it is a challenge to estimate, due to difficulty in making direct measurements and gaps in our theoretical understanding. Here we present a new data-driven, ~1km2 resolution map of long-term average actual evapotranspiration rates across the CONUS. The new ET map is a function of the USGS Landsat-derived National Land Cover Database (NLCD), precipitation, temperature, and daily average temperature range (from the PRISM climate dataset), and is calibrated to long-term water balance data from 679 watersheds. It is unique from previously presented ET maps in that (1) it was co-developed with estimates of runoff and recharge; (2) the regression equation was chosen from among many tested, previously published and newly proposed functional forms for its optimal description of long-term water balance ET data; (3) it has values over open-water areas that are derived from separate mass-transfer and humidity equations; and (4) the data include additional precipitation representing amounts converted from 2005 USGS water-use census irrigation data. The regression equation is calibrated using data from 2000-2013, but can also be applied to individual years with their corresponding input datasets. Comparisons among this new map, the more detailed remote-sensing-based estimates of MOD16 and SSEBop, and AmeriFlux ET tower measurements shows encouraging consistency, and indicates that the empirical ET estimate approach presented here produces closer agreement with independent flux tower data for annual average actual ET than other more complex remote sensing approaches.
Curtis, Tobey H.; McCandless, Camilla T.; Carlson, John K.; Skomal, Gregory B.; Kohler, Nancy E.; Natanson, Lisa J.; Burgess, George H.; Hoey, John J.; Pratt, Harold L.
2014-01-01
Despite recent advances in field research on white sharks (Carcharodon carcharias) in several regions around the world, opportunistic capture and sighting records remain the primary source of information on this species in the northwest Atlantic Ocean (NWA). Previous studies using limited datasets have suggested a precipitous decline in the abundance of white sharks from this region, but considerable uncertainty in these studies warrants additional investigation. This study builds upon previously published data combined with recent unpublished records and presents a synthesis of 649 confirmed white shark records from the NWA compiled over a 210-year period (1800-2010), resulting in the largest white shark dataset yet compiled from this region. These comprehensive records were used to update our understanding of their seasonal distribution, relative abundance trends, habitat use, and fisheries interactions. All life stages were present in continental shelf waters year-round, but median latitude of white shark occurrence varied seasonally. White sharks primarily occurred between Massachusetts and New Jersey during summer and off Florida during winter, with broad distribution along the coast during spring and fall. The majority of fishing gear interactions occurred with rod and reel, longline, and gillnet gears. Historic abundance trends from multiple sources support a significant decline in white shark abundance in the 1970s and 1980s, but there have been apparent increases in abundance since the 1990s when a variety of conservation measures were implemented. Though the white shark's inherent vulnerability to exploitation warrants continued protections, our results suggest a more optimistic outlook for the recovery of this iconic predator in the Atlantic. PMID:24918579
Li, Yong-Xin; Zhong, Zheng; Hou, Peng; Zhang, Wei-Peng; Qian, Pei-Yuan
2018-03-07
In the version of this article originally published, the links and files for the Supplementary Information, including Supplementary Tables 1-5, Supplementary Figures 1-25, Supplementary Note, Supplementary Datasets 1-4 and the Life Sciences Reporting Summary, were missing in the HTML. The error has been corrected in the HTML version of this article.
Hierarchical Recognition Scheme for Human Facial Expression Recognition Systems
Siddiqi, Muhammad Hameed; Lee, Sungyoung; Lee, Young-Koo; Khan, Adil Mehmood; Truc, Phan Tran Ho
2013-01-01
Over the last decade, human facial expressions recognition (FER) has emerged as an important research area. Several factors make FER a challenging research problem. These include varying light conditions in training and test images; need for automatic and accurate face detection before feature extraction; and high similarity among different expressions that makes it difficult to distinguish these expressions with a high accuracy. This work implements a hierarchical linear discriminant analysis-based facial expressions recognition (HL-FER) system to tackle these problems. Unlike the previous systems, the HL-FER uses a pre-processing step to eliminate light effects, incorporates a new automatic face detection scheme, employs methods to extract both global and local features, and utilizes a HL-FER to overcome the problem of high similarity among different expressions. Unlike most of the previous works that were evaluated using a single dataset, the performance of the HL-FER is assessed using three publicly available datasets under three different experimental settings: n-fold cross validation based on subjects for each dataset separately; n-fold cross validation rule based on datasets; and, finally, a last set of experiments to assess the effectiveness of each module of the HL-FER separately. Weighted average recognition accuracy of 98.7% across three different datasets, using three classifiers, indicates the success of employing the HL-FER for human FER. PMID:24316568
Utility-preserving anonymization for health data publishing.
Lee, Hyukki; Kim, Soohyung; Kim, Jong Wook; Chung, Yon Dohn
2017-07-11
Publishing raw electronic health records (EHRs) may be considered as a breach of the privacy of individuals because they usually contain sensitive information. A common practice for the privacy-preserving data publishing is to anonymize the data before publishing, and thus satisfy privacy models such as k-anonymity. Among various anonymization techniques, generalization is the most commonly used in medical/health data processing. Generalization inevitably causes information loss, and thus, various methods have been proposed to reduce information loss. However, existing generalization-based data anonymization methods cannot avoid excessive information loss and preserve data utility. We propose a utility-preserving anonymization for privacy preserving data publishing (PPDP). To preserve data utility, the proposed method comprises three parts: (1) utility-preserving model, (2) counterfeit record insertion, (3) catalog of the counterfeit records. We also propose an anonymization algorithm using the proposed method. Our anonymization algorithm applies full-domain generalization algorithm. We evaluate our method in comparison with existence method on two aspects, information loss measured through various quality metrics and error rate of analysis result. With all different types of quality metrics, our proposed method show the lower information loss than the existing method. In the real-world EHRs analysis, analysis results show small portion of error between the anonymized data through the proposed method and original data. We propose a new utility-preserving anonymization method and an anonymization algorithm using the proposed method. Through experiments on various datasets, we show that the utility of EHRs anonymized by the proposed method is significantly better than those anonymized by previous approaches.
Richard, Arianne C; Lyons, Paul A; Peters, James E; Biasci, Daniele; Flint, Shaun M; Lee, James C; McKinney, Eoin F; Siegel, Richard M; Smith, Kenneth G C
2014-08-04
Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study. Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this "gold-standard" comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues. Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.
Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets
NASA Astrophysics Data System (ADS)
Fitriah Rusland, Nurul; Wahid, Norfaradilla; Kasim, Shahreen; Hafit, Hanayanti
2017-08-01
E-mail spam continues to become a problem on the Internet. Spammed e-mail may contain many copies of the same message, commercial advertisement or other irrelevant posts like pornographic content. In previous research, different filtering techniques are used to detect these e-mails such as using Random Forest, Naïve Bayesian, Support Vector Machine (SVM) and Neutral Network. In this research, we test Naïve Bayes algorithm for e-mail spam filtering on two datasets and test its performance, i.e., Spam Data and SPAMBASE datasets [8]. The performance of the datasets is evaluated based on their accuracy, recall, precision and F-measure. Our research use WEKA tool for the evaluation of Naïve Bayes algorithm for e-mail spam filtering on both datasets. The result shows that the type of email and the number of instances of the dataset has an influence towards the performance of Naïve Bayes.
Prediction and analysis of beta-turns in proteins by support vector machine.
Pham, Tho Hoan; Satou, Kenji; Ho, Tu Bao
2003-01-01
Tight turn has long been recognized as one of the three important features of proteins after the alpha-helix and beta-sheet. Tight turns play an important role in globular proteins from both the structural and functional points of view. More than 90% tight turns are beta-turns. Analysis and prediction of beta-turns in particular and tight turns in general are very useful for the design of new molecules such as drugs, pesticides, and antigens. In this paper, we introduce a support vector machine (SVM) approach to prediction and analysis of beta-turns. We have investigated two aspects of applying SVM to the prediction and analysis of beta-turns. First, we developed a new SVM method, called BTSVM, which predicts beta-turns of a protein from its sequence. The prediction results on the dataset of 426 non-homologous protein chains by sevenfold cross-validation technique showed that our method is superior to the other previous methods. Second, we analyzed how amino acid positions support (or prevent) the formation of beta-turns based on the "multivariable" classification model of a linear SVM. This model is more general than the other ones of previous statistical methods. Our analysis results are more comprehensive and easier to use than previously published analysis results.
Birth month affects lifetime disease risk: a phenome-wide method.
Boland, Mary Regina; Shahn, Zachary; Madigan, David; Hripcsak, George; Tatonetti, Nicholas P
2015-09-01
An individual's birth month has a significant impact on the diseases they develop during their lifetime. Previous studies reveal relationships between birth month and several diseases including atherothrombosis, asthma, attention deficit hyperactivity disorder, and myopia, leaving most diseases completely unexplored. This retrospective population study systematically explores the relationship between seasonal affects at birth and lifetime disease risk for 1688 conditions. We developed a hypothesis-free method that minimizes publication and disease selection biases by systematically investigating disease-birth month patterns across all conditions. Our dataset includes 1 749 400 individuals with records at New York-Presbyterian/Columbia University Medical Center born between 1900 and 2000 inclusive. We modeled associations between birth month and 1688 diseases using logistic regression. Significance was tested using a chi-squared test with multiplicity correction. We found 55 diseases that were significantly dependent on birth month. Of these 19 were previously reported in the literature (P < .001), 20 were for conditions with close relationships to those reported, and 16 were previously unreported. We found distinct incidence patterns across disease categories. Lifetime disease risk is affected by birth month. Seasonally dependent early developmental mechanisms may play a role in increasing lifetime risk of disease. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
NASA Technical Reports Server (NTRS)
Bond, Barbara J.; Peterson, David L.
1999-01-01
This project was a collaborative effort by researchers at ARC, OSU and the University of Arizona. The goal was to use a dataset obtained from a previous study to "empirically validate a new canopy radiative-transfer model (SART) which incorporates a recently-developed leaf-level model (LEAFMOD)". The document includes a short research summary.
Reconstructing the outburst history of Eta Carinae from WFPC2 proper motions
NASA Astrophysics Data System (ADS)
Smith, Nathan
2011-10-01
The HST archive contains several epochs of WFPC2 images of the nebula around Eta Carinae taken over a 15-year timespan, although only the earliest few years of data have been analyzed and published. The fact that all these images were taken with the same instrument, with the same pixel sampling and field distortion, makes them an invaluable resource for accurately measuring the expanding ejecta. So far, analysis of a subset of the data {with only a few year baseline} has shown that Eta Car's nebula was ejected around the time of the Great Eruption in the 1840s, but the full 15-yr dataset has much greater untapped potential. Historical data show multiple peaks in the light curve during the 1840s eruption, possibly the result of violent stellar collisions in the eccentric binary system. Proper motions with the full 15-yr dataset will definitively show if one of these is associated with the main mass ejection. Older material outside the main bipolar nebula traces previous major outbursts of the star with no recorded historical observations. We propose an ambitious reduction and analysis of the complete WFPC2 imaging dataset of Eta Car. These data can reconstruct its violent mass-loss history over the past several thousand years. This will constrain the behavior and timescale of eruptive mass loss in pre-SN evolution. The existence of several epochs over a long timespan will date older parts of the nebula that have not yet been measured, and can even measure the deceleration of the ejecta for the first time, essential for understanding their shaping and shock excitation during the nebula's continuing hydrodynamic evolution.
Miceli, Antonio; Duggan, Simon M J; Capoun, Radek; Romeo, Francesco; Caputo, Massimo; Angelini, Gianni D
2010-08-01
There is no accepted consensus on the definition of high-risk patients who may benefit from the use of intraaortic balloon pump (IABP) in coronary artery bypass grafting (CABG). The aim of this study was to develop a risk model to identify high-risk patients and predict the need for IABP insertion during CABG. From April 1996 to December 2006, 8,872 consecutive patients underwent isolated CABG; of these 182 patients (2.1%) received intraoperative or postoperative IABP. The scoring risk model was developed in 4,575 patients (derivation dataset) and validated on the remaining patients (validation dataset). Predictive accuracy was evaluated by the area under the receiver operating characteristic curve. Mortality was 1% in the entire cohort and 18.7% (22 patients) in the group which received IABP. Multivariable analysis showed that age greater than 70 years, moderate and poor left ventricular dysfunction, previous cardiac surgery, emergency operation, left main disease, Canadian Cardiovascular Society 3-4 class, and recent myocardial infarction were independent risk factors for the need of IABP insertion. Three risk groups were identified. The observed probability of receiving IABP and mortality in the validation dataset was 36.4% and 10% in the high-risk group (score >14), 10.9% and 2.8% in the medium-risk group (score 7 to 13), and 1.7% and 0.7% in the low-risk group (score 0 to 6). This simple clinical risk model based on preoperative clinical data can be used to identify high-risk patients who may benefit from elective insertion of IABP during CABG. Copyright 2010 The Society of Thoracic Surgeons. Published by Elsevier Inc. All rights reserved.
Hajat, Shakoor; Whitmore, Ceri; Sarran, Christophe; Haines, Andy; Golding, Brian; Gordon-Brown, Harriet; Kessel, Anthony; Fleming, Lora E
2017-01-01
Improved data linkages between diverse environment and health datasets have the potential to provide new insights into the health impacts of environmental exposures, including complex climate change processes. Initiatives that link and explore big data in the environment and health arenas are now being established. To encourage advances in this nascent field, this article documents the development of a web browser application to facilitate such future research, the challenges encountered to date, and how they were addressed. A 'storyboard approach' was used to aid the initial design and development of the application. The application followed a 3-tier architecture: a spatial database server for storing and querying data, server-side code for processing and running models, and client-side browser code for user interaction and for displaying data and results. The browser was validated by reproducing previously published results from a regression analysis of time-series datasets of daily mortality, air pollution and temperature in London. Data visualisation and analysis options of the application are presented. The main factors that shaped the development of the browser were: accessibility, open-source software, flexibility, efficiency, user-friendliness, licensing restrictions and data confidentiality, visualisation limitations, cost-effectiveness, and sustainability. Creating dedicated data and analysis resources, such as the one described here, will become an increasingly vital step in improving understanding of the complex interconnections between the environment and human health and wellbeing, whilst still ensuring appropriate confidentiality safeguards. The issues raised in this paper can inform the future development of similar tools by other researchers working in this field. Copyright © 2016 Elsevier B.V. All rights reserved.
Juraeva, Dilafruz; Haenisch, Britta; Zapatka, Marc; Frank, Josef; Witt, Stephanie H; Mühleisen, Thomas W; Treutlein, Jens; Strohmaier, Jana; Meier, Sandra; Degenhardt, Franziska; Giegling, Ina; Ripke, Stephan; Leber, Markus; Lange, Christoph; Schulze, Thomas G; Mössner, Rainald; Nenadic, Igor; Sauer, Heinrich; Rujescu, Dan; Maier, Wolfgang; Børglum, Anders; Ophoff, Roel; Cichon, Sven; Nöthen, Markus M; Rietschel, Marcella; Mattheisen, Manuel; Brors, Benedikt
2014-06-01
In the present study, an integrated hierarchical approach was applied to: (1) identify pathways associated with susceptibility to schizophrenia; (2) detect genes that may be potentially affected in these pathways since they contain an associated polymorphism; and (3) annotate the functional consequences of such single-nucleotide polymorphisms (SNPs) in the affected genes or their regulatory regions. The Global Test was applied to detect schizophrenia-associated pathways using discovery and replication datasets comprising 5,040 and 5,082 individuals of European ancestry, respectively. Information concerning functional gene-sets was retrieved from the Kyoto Encyclopedia of Genes and Genomes, Gene Ontology, and the Molecular Signatures Database. Fourteen of the gene-sets or pathways identified in the discovery dataset were confirmed in the replication dataset. These include functional processes involved in transcriptional regulation and gene expression, synapse organization, cell adhesion, and apoptosis. For two genes, i.e. CTCF and CACNB2, evidence for association with schizophrenia was available (at the gene-level) in both the discovery study and published data from the Psychiatric Genomics Consortium schizophrenia study. Furthermore, these genes mapped to four of the 14 presently identified pathways. Several of the SNPs assigned to CTCF and CACNB2 have potential functional consequences, and a gene in close proximity to CACNB2, i.e. ARL5B, was identified as a potential gene of interest. Application of the present hierarchical approach thus allowed: (1) identification of novel biological gene-sets or pathways with potential involvement in the etiology of schizophrenia, as well as replication of these findings in an independent cohort; (2) detection of genes of interest for future follow-up studies; and (3) the highlighting of novel genes in previously reported candidate regions for schizophrenia.
Liu, Yaoming; Cohen, Mark E; Hall, Bruce L; Ko, Clifford Y; Bilimoria, Karl Y
2016-08-01
The American College of Surgeon (ACS) NSQIP Surgical Risk Calculator has been widely adopted as a decision aid and informed consent tool by surgeons and patients. Previous evaluations showed excellent discrimination and combined discrimination and calibration, but model calibration alone, and potential benefits of recalibration, were not explored. Because lack of calibration can lead to systematic errors in assessing surgical risk, our objective was to assess calibration and determine whether spline-based adjustments could improve it. We evaluated Surgical Risk Calculator model calibration, as well as discrimination, for each of 11 outcomes modeled from nearly 3 million patients (2010 to 2014). Using independent random subsets of data, we evaluated model performance for the Development (60% of records), Validation (20%), and Test (20%) datasets, where prediction equations from the Development dataset were recalibrated using restricted cubic splines estimated from the Validation dataset. We also evaluated performance on data subsets composed of higher-risk operations. The nonrecalibrated Surgical Risk Calculator performed well, but there was a slight tendency for predicted risk to be overestimated for lowest- and highest-risk patients and underestimated for moderate-risk patients. After recalibration, this distortion was eliminated, and p values for miscalibration were most often nonsignificant. Calibration was also excellent for subsets of higher-risk operations, though observed calibration was reduced due to instability associated with smaller sample sizes. Performance of NSQIP Surgical Risk Calculator models was shown to be excellent and improved with recalibration. Surgeons and patients can rely on the calculator to provide accurate estimates of surgical risk. Copyright © 2016 American College of Surgeons. Published by Elsevier Inc. All rights reserved.
Albert, Loren P; Keenan, Trevor F; Burns, Sean P; Huxman, Travis E; Monson, Russell K
2017-05-01
Eddy covariance (EC) datasets have provided insight into climate determinants of net ecosystem productivity (NEP) and evapotranspiration (ET) in natural ecosystems for decades, but most EC studies were published in serial fashion such that one study's result became the following study's hypothesis. This approach reflects the hypothetico-deductive process by focusing on previously derived hypotheses. A synthesis of this type of sequential inference reiterates subjective biases and may amplify past assumptions about the role, and relative importance, of controls over ecosystem metabolism. Long-term EC datasets facilitate an alternative approach to synthesis: the use of inductive data-based analyses to re-examine past deductive studies of the same ecosystem. Here we examined the seasonal climate determinants of NEP and ET by analyzing a 15-year EC time-series from a subalpine forest using an ensemble of Artificial Neural Networks (ANNs) at the half-day (daytime/nighttime) time-step. We extracted relative rankings of climate drivers and driver-response relationships directly from the dataset with minimal a priori assumptions. The ANN analysis revealed temperature variables as primary climate drivers of NEP and daytime ET, when all seasons are considered, consistent with the assembly of past studies. New relations uncovered by the ANN approach include the role of soil moisture in driving daytime NEP during the snowmelt period, the nonlinear response of NEP to temperature across seasons, and the low relevance of summer rainfall for NEP or ET at the same daytime/nighttime time step. These new results offer a more complete perspective of climate-ecosystem interactions at this site than traditional deductive analyses alone.
Huh, Yong; Yu, Kiyun; Park, Woojin
2016-01-01
This paper proposes a method to detect corresponding vertex pairs between planar tessellation datasets. Applying an agglomerative hierarchical co-clustering, the method finds geometrically corresponding cell-set pairs from which corresponding vertex pairs are detected. Then, the map transformation is performed with the vertex pairs. Since these pairs are independently detected for each corresponding cell-set pairs, the method presents improved matching performance regardless of locally uneven positional discrepancies between dataset. The proposed method was applied to complicated synthetic cell datasets assumed as a cadastral map and a topographical map, and showed an improved result with the F-measures of 0.84 comparing to a previous matching method with the F-measure of 0.48.
McKinney, Bill; Meyer, Peter A; Crosas, Mercè; Sliz, Piotr
2017-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of file system structure within Dataverse-which is essential for both in-place computation and supporting non-HTTP data transfers. © 2016 New York Academy of Sciences.
NASA Technical Reports Server (NTRS)
Mocko, David M.; Rui, Hualan; Acker, James G.
2013-01-01
The North American Land Data Assimilation System (NLDAS) is a collaboration project between NASA/GSFC, NOAA, Princeton Univ., and the Univ. of Washington. NLDAS has created a surface meteorology dataset using the best-available observations and reanalyses the backbone of this dataset is a gridded precipitation analysis from rain gauges. This dataset is used to drive four separate land-surface models (LSMs) to produce datasets of soil moisture, snow, runoff, and surface fluxes. NLDAS datasets are available hourly and extend from Jan 1979 to near real-time with a typical 4-day lag. The datasets are available at 1/8th-degree over CONUS and portions of Canada and Mexico from 25-53 North. The datasets have been extensively evaluated against observations, and are also used as part of a drought monitor. NLDAS datasets are available from the NASA GES DISC and can be accessed via ftp, GDS, Mirador, and Giovanni. GES DISC news articles were published showing figures from the heat wave of 2011, Hurricane Irene, Tropical Storm Lee, and the low-snow winter of 2011-2012. For this presentation, Giovanni-generated figures using NLDAS data from the derecho across the U.S. Midwest and Mid-Atlantic will be presented. Also, similar figures will be presented from the landfall of Hurricane Isaac and the before-and-after drought conditions of the path of the tropical moisture into the central states of the U.S. Updates on future products and datasets from the NLDAS project will also be introduced.
NASA Astrophysics Data System (ADS)
Willis, D. M.; Coffey, H. E.; Henwood, R.; Erwin, E. H.; Hoyt, D. V.; Wild, M. N.; Denig, W. F.
2013-11-01
The measurements of sunspot positions and areas that were published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), as the Greenwich Photo-heliographic Results ( GPR), 1874 - 1976, exist in both printed and digital forms. These printed and digital sunspot datasets have been archived in various libraries and data centres. Unfortunately, however, typographic, systematic and isolated errors can be found in the various datasets. The purpose of the present paper is to begin the task of identifying and correcting these errors. In particular, the intention is to provide in one foundational paper all the necessary background information on the original solar observations, their various applications in scientific research, the format of the different digital datasets, the necessary definitions of the quantities measured, and the initial identification of errors in both the printed publications and the digital datasets. Two companion papers address the question of specific identifiable errors; namely, typographic errors in the printed publications, and both isolated and systematic errors in the digital datasets. The existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the programme of solar observations supported for more than a century by the Royal Observatory, Greenwich, and the Royal Greenwich Observatory. This improved dataset should be of value in many future scientific investigations.
Arend, Daniel; Lange, Matthias; Pape, Jean-Michel; Weigelt-Fischer, Kathleen; Arana-Ceballos, Fernando; Mücke, Ingo; Klukas, Christian; Altmann, Thomas; Scholz, Uwe; Junker, Astrid
2016-01-01
With the implementation of novel automated, high throughput methods and facilities in the last years, plant phenomics has developed into a highly interdisciplinary research domain integrating biology, engineering and bioinformatics. Here we present a dataset of a non-invasive high throughput plant phenotyping experiment, which uses image- and image analysis- based approaches to monitor the growth and development of 484 Arabidopsis thaliana plants (thale cress). The result is a comprehensive dataset of images and extracted phenotypical features. Such datasets require detailed documentation, standardized description of experimental metadata as well as sustainable data storage and publication in order to ensure the reproducibility of experiments, data reuse and comparability among the scientific community. Therefore the here presented dataset has been annotated using the standardized ISA-Tab format and considering the recently published recommendations for the semantical description of plant phenotyping experiments. PMID:27529152
Arend, Daniel; Lange, Matthias; Pape, Jean-Michel; Weigelt-Fischer, Kathleen; Arana-Ceballos, Fernando; Mücke, Ingo; Klukas, Christian; Altmann, Thomas; Scholz, Uwe; Junker, Astrid
2016-08-16
With the implementation of novel automated, high throughput methods and facilities in the last years, plant phenomics has developed into a highly interdisciplinary research domain integrating biology, engineering and bioinformatics. Here we present a dataset of a non-invasive high throughput plant phenotyping experiment, which uses image- and image analysis- based approaches to monitor the growth and development of 484 Arabidopsis thaliana plants (thale cress). The result is a comprehensive dataset of images and extracted phenotypical features. Such datasets require detailed documentation, standardized description of experimental metadata as well as sustainable data storage and publication in order to ensure the reproducibility of experiments, data reuse and comparability among the scientific community. Therefore the here presented dataset has been annotated using the standardized ISA-Tab format and considering the recently published recommendations for the semantical description of plant phenotyping experiments.
Madrigal, Pedro
2017-03-01
Computational evaluation of variability across DNA or RNA sequencing datasets is a crucial step in genomic science, as it allows both to evaluate reproducibility of biological or technical replicates, and to compare different datasets to identify their potential correlations. Here we present fCCAC, an application of functional canonical correlation analysis to assess covariance of nucleic acid sequencing datasets such as chromatin immunoprecipitation followed by deep sequencing (ChIP-seq). We show how this method differs from other measures of correlation, and exemplify how it can reveal shared covariance between histone modifications and DNA binding proteins, such as the relationship between the H3K4me3 chromatin mark and its epigenetic writers and readers. An R/Bioconductor package is available at http://bioconductor.org/packages/fCCAC/ . pmb59@cam.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
He, Zilong; Zhang, Huangkai; Gao, Shenghan; Lercher, Martin J; Chen, Wei-Hua; Hu, Songnian
2016-07-08
Evolview is an online visualization and management tool for customized and annotated phylogenetic trees. It allows users to visualize phylogenetic trees in various formats, customize the trees through built-in functions and user-supplied datasets and export the customization results to publication-ready figures. Its 'dataset system' contains not only the data to be visualized on the tree, but also 'modifiers' that control various aspects of the graphical annotation. Evolview is a single-page application (like Gmail); its carefully designed interface allows users to upload, visualize, manipulate and manage trees and datasets all in a single webpage. Developments since the last public release include a modern dataset editor with keyword highlighting functionality, seven newly added types of annotation datasets, collaboration support that allows users to share their trees and datasets and various improvements of the web interface and performance. In addition, we included eleven new 'Demo' trees to demonstrate the basic functionalities of Evolview, and five new 'Showcase' trees inspired by publications to showcase the power of Evolview in producing publication-ready figures. Evolview is freely available at: http://www.evolgenius.info/evolview/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Historical glacier outlines from digitized topographic maps of the Swiss Alps
NASA Astrophysics Data System (ADS)
Freudiger, Daphné; Mennekes, David; Seibert, Jan; Weiler, Markus
2018-04-01
Since the end of the Little Ice Age around 1850, the total glacier area of the central European Alps has considerably decreased. In order to understand the changes in glacier coverage at various scales and to model past and future streamflow accurately, long-term and large-scale datasets of glacier outlines are needed. To fill the gap between the morphologically reconstructed glacier outlines from the moraine extent corresponding to the time period around 1850 and the first complete dataset of glacier areas in the Swiss Alps from aerial photographs in 1973, glacier areas from 80 sheets of a historical topographic map (the Siegfried map) were manually digitized for the publication years 1878-1918 (further called first period, with most sheets being published around 1900) and 1917-1944 (further called second period, with most sheets being published around 1935). The accuracy of the digitized glacier areas was then assessed through a two-step validation process: the data were (1) visually and (2) quantitatively compared to glacier area datasets of the years 1850, 1973, 2003, and 2010, which were derived from different sources, at the large scale, basin scale, and locally. The validation showed that at least 70 % of the digitized glaciers were comparable to the outlines from the other datasets and were therefore plausible. Furthermore, the inaccuracy of the manual digitization was found to be less than 5 %. The presented datasets of glacier outlines for the first and second periods are a valuable source of information for long-term glacier mass balance or hydrological modelling in glacierized basins. The uncertainty of the historical topographic maps should be considered during the interpretation of the results. The datasets can be downloaded from the FreiDok plus data repository (https://freidok.uni-freiburg.de/data/15008, https://doi.org/10.6094/UNIFR/15008).
ERIC Educational Resources Information Center
Horta, Hugo; Santos, João M.
2016-01-01
This study analyzes the impact that publishing during the period of PhD study has on researchers' future knowledge production, impact, and co-authorship. The analysis is based on a representative sample of PhDs from all fields of science working in Portugal. For each researcher in the dataset, we compiled a lifetime publication record and…
Enabling Linked Science in Global Climate Uncertainty Quantification (UQ) Research
NASA Astrophysics Data System (ADS)
Elsethagen, T.; Stephan, E.; Lin, G.; Williams, D.; Banks, E.
2012-12-01
This paper shares a real-world global climate UQ science use case and illustrates how a linked science application called Provenance Environment (ProvEn), currently being developed, enables and facilitates scientific teams to publish, share, link, and discover new links over their UQ research results. UQ results include terascale datasets that are published to an Earth Systems Grid Federation (ESGF) repository. ProvEn demonstrates how a scientific team conducting UQ studies can discover dataset links using its domain knowledgebase, allowing them to better understand the UQ study research objectives, the experimental protocol used, the resulting dataset lineage, related analytical findings, ancillary literature citations, along with the social network of scientists associated with the study. This research claims that scientists using this linked science approach will not only allow them to greatly benefit from understanding a particular dataset within a knowledge context, a benefit can also be seen by the cross reference of knowledge among the numerous UQ studies being stored in ESGF. ProvEn collects native forms of data provenance resources as the UQ study is carried out. The native data provenance resources can be collected from a variety of sources such as scripts, a workflow engine log, simulation log files, scientific team members etc. Schema alignment is used to translate the native forms of provenance into a set of W3C PROV-O semantic statements used as a common interchange format which will also contain URI references back to resources in the UQ study dataset for querying and cross referencing. ProvEn leverages Fedora Commons' digital object model in a Resource Oriented Architecture (ROA) (i.e. a RESTful framework) to logically organize and partition native and translated provenance resources by UQ study. The ROA also provides scientists the means to both search native and translated forms of provenance.
Experimental feasibility of multistatic holography for breast microwave radar image reconstruction.
Flores-Tapia, Daniel; Rodriguez, Diego; Solis, Mario; Kopotun, Nikita; Latif, Saeed; Maizlish, Oleksandr; Fu, Lei; Gui, Yonsheng; Hu, Can-Ming; Pistorius, Stephen
2016-08-01
The goal of this study was to assess the experimental feasibility of circular multistatic holography, a novel breast microwave radar reconstruction approach, using experimental datasets recorded using a preclinical experimental setup. The performance of this approach was quantitatively evaluated by calculating the signal to clutter ratio (SCR), contrast to clutter ratio (CCR), tumor to fibroglandular response ratio (TFRR), spatial accuracy, and reconstruction time. Five datasets were recorded using synthetic phantoms with the dielectric properties of breast tissue in the 1-6 GHz range using a custom radar system developed by the authors. The datasets contained synthetic structures that mimic the dielectric properties of fibroglandular breast tissues. Four of these datasets the authors covered an 8 mm inclusion that emulated a tumor. A custom microwave radar system developed at the University of Manitoba was used to record the radar responses from the phantoms. The datasets were reconstructed using the proposed multistatic approach as well as with a monostatic holography approach that has been previously shown to yield the images with the highest contrast and focal quality. For all reconstructions, the location of the synthetic tumors in the experimental setup was consistent with the position in the both the monostatic and multistatic reconstructed images. The average spatial error was less than 4 mm, which is half the spatial resolution of the data acquisition system. The average SCR, CCR, and TFRR of the images reconstructed with the multistatic approach were 15.0, 9.4, and 10.0 dB, respectively. In comparison, monostatic images obtained using the datasets from the same experimental setups yielded average SCR, CCR, and TFRR values of 12.8, 4.9, and 5.9 dB. No artifacts, defined as responses generated by the reconstruction method of at least half the energy of the tumor signatures, were noted in the multistatic reconstructions. The average execution time of the images formed using the proposed approach was 4 s, which is one order of magnitude faster than the current state-of-the-art time-domain multistatic breast microwave radar reconstruction algorithms. The images generated by the proposed method show that multistatic holography is capable of forming spatially accurate images in real-time with signal to clutter levels and contrast values higher than other published monostatic and multistatic cylindrical radar reconstruction approaches. In comparison to the monostatic holographic approach, the images generated by the proposed multistatic approach had SCR values that were at least 50% higher. The multistatic images had CCR and TFRR values at least 200% greater than those formed using a monostatic approach.
Zhou, Zhen; Wang, Jian-Bao; Zang, Yu-Feng; Pan, Gang
2018-01-01
Classification approaches have been increasingly applied to differentiate patients and normal controls using resting-state functional magnetic resonance imaging data (RS-fMRI). Although most previous classification studies have reported promising accuracy within individual datasets, achieving high levels of accuracy with multiple datasets remains challenging for two main reasons: high dimensionality, and high variability across subjects. We used two independent RS-fMRI datasets (n = 31, 46, respectively) both with eyes closed (EC) and eyes open (EO) conditions. For each dataset, we first reduced the number of features to a small number of brain regions with paired t-tests, using the amplitude of low frequency fluctuation (ALFF) as a metric. Second, we employed a new method for feature extraction, named the PAIR method, examining EC and EO as paired conditions rather than independent conditions. Specifically, for each dataset, we obtained EC minus EO (EC—EO) maps of ALFF from half of subjects (n = 15 for dataset-1, n = 23 for dataset-2) and obtained EO—EC maps from the other half (n = 16 for dataset-1, n = 23 for dataset-2). A support vector machine (SVM) method was used for classification of EC RS-fMRI mapping and EO mapping. The mean classification accuracy of the PAIR method was 91.40% for dataset-1, and 92.75% for dataset-2 in the conventional frequency band of 0.01–0.08 Hz. For cross-dataset validation, we applied the classifier from dataset-1 directly to dataset-2, and vice versa. The mean accuracy of cross-dataset validation was 94.93% for dataset-1 to dataset-2 and 90.32% for dataset-2 to dataset-1 in the 0.01–0.08 Hz range. For the UNPAIR method, classification accuracy was substantially lower (mean 69.89% for dataset-1 and 82.97% for dataset-2), and was much lower for cross-dataset validation (64.69% for dataset-1 to dataset-2 and 64.98% for dataset-2 to dataset-1) in the 0.01–0.08 Hz range. In conclusion, for within-group design studies (e.g., paired conditions or follow-up studies), we recommend the PAIR method for feature extraction. In addition, dimensionality reduction with strong prior knowledge of specific brain regions should also be considered for feature selection in neuroimaging studies. PMID:29375288
Consolidating drug data on a global scale using Linked Data.
Jovanovik, Milos; Trajanov, Dimitar
2017-01-21
Drug product data is available on the Web in a distributed fashion. The reasons lie within the regulatory domains, which exist on a national level. As a consequence, the drug data available on the Web are independently curated by national institutions from each country, leaving the data in varying languages, with a varying structure, granularity level and format, on different locations on the Web. Therefore, one of the main challenges in the realm of drug data is the consolidation and integration of large amounts of heterogeneous data into a comprehensive dataspace, for the purpose of developing data-driven applications. In recent years, the adoption of the Linked Data principles has enabled data publishers to provide structured data on the Web and contextually interlink them with other public datasets, effectively de-siloing them. Defining methodological guidelines and specialized tools for generating Linked Data in the drug domain, applicable on a global scale, is a crucial step to achieving the necessary levels of data consolidation and alignment needed for the development of a global dataset of drug product data. This dataset would then enable a myriad of new usage scenarios, which can, for instance, provide insight into the global availability of different drug categories in different parts of the world. We developed a methodology and a set of tools which support the process of generating Linked Data in the drug domain. Using them, we generated the LinkedDrugs dataset by seamlessly transforming, consolidating and publishing high-quality, 5-star Linked Drug Data from twenty-three countries, containing over 248,000 drug products, over 99,000,000 RDF triples and over 278,000 links to generic drugs from the LOD Cloud. Using the linked nature of the dataset, we demonstrate its ability to support advanced usage scenarios in the drug domain. The process of generating the LinkedDrugs dataset demonstrates the applicability of the methodological guidelines and the supporting tools in transforming drug product data from various, independent and distributed sources, into a comprehensive Linked Drug Data dataset. The presented user-centric and analytical usage scenarios over the dataset show the advantages of having a de-siloed, consolidated and comprehensive dataspace of drug data available via the existing infrastructure of the Web.
Observation of a 27-day solar signature in noctilucent cloud altitude
NASA Astrophysics Data System (ADS)
Köhnke, Merlin C.; von Savigny, Christian; Robert, Charles E.
2018-05-01
Previous studies have identified solar 27-day signatures in several parameters in the Mesosphere/Lower thermosphere region, including temperature and Noctilucent cloud (NLC) occurrence frequency. In this study we report on a solar 27-day signature in NLC altitude with peak-to-peak variations of about 400 m. We use SCIAMACHY limb-scatter observations from 2002 to 2012 to detect NLCs. The superposed epoch analysis method is applied to extract solar 27-day signatures. A 27-day signature in NLC altitude can be identified in both hemispheres in the SCIAMACHY dataset, but the signature is more pronounced in the northern hemisphere. The solar signature in NLC altitude is found to be in phase with solar activity and temperature for latitudes ≳ 70 ° N. We provide a qualitative explanation for the positive correlation between solar activity and NLC altitude based on published model simulations.
Abedini, Atosa A.; Hurwitz, S.; Evans, William C.
2006-01-01
The database (Version 1.0) is a MS-Excel file that contains close to 5,000 entries of published information on noble gas concentrations and isotopic ratios from volcanic systems in Mid-Ocean ridges, ocean islands, seamounts, and oceanic and continental arcs (location map). Where they were available we also included the isotopic ratios of strontium, neodymium, and carbon. The database is sub-divided both into material sampled (e.g., volcanic glass, different minerals, fumarole, spring), and into different tectonic settings (MOR, ocean islands, volcanic arcs). Included is also a reference list in MS-Word and pdf from which the data was derived. The database extends previous compilations by Ozima (1994), Farley and Neroda (1998), and Graham (2002). The extended database allows scientists to test competing hypotheses, and it provides a framework for analysis of noble gas data during periods of volcanic unrest.
Discrete Roughness Effects on Shuttle Orbiter at Mach 6
NASA Technical Reports Server (NTRS)
Berry, Scott A.; Hamilton, H. Harris, II
2002-01-01
Discrete roughness boundary layer transition results on a Shuttle Orbiter model in the NASA Langley Research Center 20-Inch Mach 6 Air Tunnel have been reanalyzed with new boundary layer calculations to provide consistency for comparison to other published results. The experimental results were previously obtained utilizing the phosphor thermography system to monitor the status of the boundary layer via global heat transfer images of the Orbiter windward surface. The size and location of discrete roughness elements were systematically varied along the centerline of the 0.0075-scale model at an angle of attack of 40 deg and the boundary layer response recorded. Various correlative approaches were attempted, with the roughness transition correlations based on edge properties providing the most reliable results. When a consistent computational method is used to compute edge conditions, transition datasets for different configurations at several angles of attack have been shown to collapse to a well-behaved correlation.
In vivo and in silico determination of essential genes of Campylobacter jejuni.
Metris, Aline; Reuter, Mark; Gaskin, Duncan J H; Baranyi, Jozsef; van Vliet, Arnoud H M
2011-11-01
In the United Kingdom, the thermophilic Campylobacter species C. jejuni and C. coli are the most frequent causes of food-borne gastroenteritis in humans. While campylobacteriosis is usually a relatively mild infection, it has a significant public health and economic impact, and possible complications include reactive arthritis and the autoimmune diseases Guillain-Barré syndrome. The rapid developments in "omics" technologies have resulted in the availability of diverse datasets allowing predictions of metabolism and physiology of pathogenic micro-organisms. When combined, these datasets may allow for the identification of potential weaknesses that can be used for development of new antimicrobials to reduce or eliminate C. jejuni and C. coli from the food chain. A metabolic model of C. jejuni was constructed using the annotation of the NCTC 11168 genome sequence, a published model of the related bacterium Helicobacter pylori, and extensive literature mining. Using this model, we have used in silico Flux Balance Analysis (FBA) to determine key metabolic routes that are essential for generating energy and biomass, thus creating a list of genes potentially essential for growth under laboratory conditions. To complement this in silico approach, candidate essential genes have been determined using a whole genome transposon mutagenesis method. FBA and transposon mutagenesis (both this study and a published study) predict a similar number of essential genes (around 200). The analysis of the intersection between the three approaches highlights the shikimate pathway where genes are predicted to be essential by one or more method, and tend to be network hubs, based on a previously published Campylobacter protein-protein interaction network, and could therefore be targets for novel antimicrobial therapy. We have constructed the first curated metabolic model for the food-borne pathogen Campylobacter jejuni and have presented the resulting metabolic insights. We have shown that the combination of in silico and in vivo approaches could point to non-redundant, indispensable genes associated with the well characterised shikimate pathway, and also genes of unknown function specific to C. jejuni, which are all potential novel Campylobacter intervention targets.
An, Ji-Yong; Meng, Fan-Rong; You, Zhu-Hong; Chen, Xing; Yan, Gui-Ying; Hu, Ji-Pu
2016-10-01
Predicting protein-protein interactions (PPIs) is a challenging task and essential to construct the protein interaction networks, which is important for facilitating our understanding of the mechanisms of biological systems. Although a number of high-throughput technologies have been proposed to predict PPIs, there are unavoidable shortcomings, including high cost, time intensity, and inherently high false positive rates. For these reasons, many computational methods have been proposed for predicting PPIs. However, the problem is still far from being solved. In this article, we propose a novel computational method called RVM-BiGP that combines the relevance vector machine (RVM) model and Bi-gram Probabilities (BiGP) for PPIs detection from protein sequences. The major improvement includes (1) Protein sequences are represented using the Bi-gram probabilities (BiGP) feature representation on a Position Specific Scoring Matrix (PSSM), in which the protein evolutionary information is contained; (2) For reducing the influence of noise, the Principal Component Analysis (PCA) method is used to reduce the dimension of BiGP vector; (3) The powerful and robust Relevance Vector Machine (RVM) algorithm is used for classification. Five-fold cross-validation experiments executed on yeast and Helicobacter pylori datasets, which achieved very high accuracies of 94.57 and 90.57%, respectively. Experimental results are significantly better than previous methods. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the yeast dataset. The experimental results demonstrate that our RVM-BiGP method is significantly better than the SVM-based method. In addition, we achieved 97.15% accuracy on imbalance yeast dataset, which is higher than that of balance yeast dataset. The promising experimental results show the efficiency and robust of the proposed method, which can be an automatic decision support tool for future proteomics research. For facilitating extensive studies for future proteomics research, we developed a freely available web server called RVM-BiGP-PPIs in Hypertext Preprocessor (PHP) for predicting PPIs. The web server including source code and the datasets are available at http://219.219.62.123:8888/BiGP/. © 2016 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
Wang, James K. T.; Langfelder, Peter; Horvath, Steve; Palazzolo, Michael J.
2017-01-01
Huntington's disease (HD) is a progressive and autosomal dominant neurodegeneration caused by CAG expansion in the huntingtin gene (HTT), but the pathophysiological mechanism of mutant HTT (mHTT) remains unclear. To study HD using systems biological methodologies on all published data, we undertook the first comprehensive curation of two key PubMed HD datasets: perturbation genes that impact mHTT-driven endpoints and therefore are putatively linked causally to pathogenic mechanisms, and the protein interactome of HTT that reflects its biology. We perused PubMed articles containing co-citation of gene IDs and MeSH terms of interest to generate mechanistic gene sets for iterative enrichment analyses and rank ordering. The HD Perturbation database of 1,218 genes highly overlaps the HTT Interactome of 1,619 genes, suggesting links between normal HTT biology and mHTT pathology. These two HD datasets are enriched for protein networks of key genes underlying two mechanisms not previously implicated in HD nor in each other: exosome synaptic functions and homeostatic synaptic plasticity. Moreover, proteins, possibly including HTT, and miRNA detected in exosomes from a wide variety of sources also highly overlap the HD datasets, suggesting both mechanistic and biomarker links. Finally, the HTT Interactome highly intersects protein networks of pathogenic genes underlying Parkinson's, Alzheimer's and eight non-HD polyglutamine diseases, ALS, and spinal muscular atrophy. These protein networks in turn highly overlap the exosome and homeostatic synaptic plasticity gene sets. Thus, we hypothesize that HTT and other neurodegeneration pathogenic genes form a large interlocking protein network involved in exosome and homeostatic synaptic functions, particularly where the two mechanisms intersect. Mutant pathogenic proteins cause dysfunctions at distinct points in this network, each altering the two mechanisms in specific fashion that contributes to distinct disease pathologies, depending on the gene mutation and the cellular and biological context. This protein network is rich with drug targets, and exosomes may provide disease biomarkers, thus enabling drug discovery. All the curated datasets are made available for other investigators. Elucidating the roles of pathogenic neurodegeneration genes in exosome and homeostatic synaptic functions may provide a unifying framework for the age-dependent, progressive and tissue selective nature of multiple neurodegenerative diseases. PMID:28611571
Wang, James K T; Langfelder, Peter; Horvath, Steve; Palazzolo, Michael J
2017-01-01
Huntington's disease (HD) is a progressive and autosomal dominant neurodegeneration caused by CAG expansion in the huntingtin gene ( HTT ), but the pathophysiological mechanism of mutant HTT (mHTT) remains unclear. To study HD using systems biological methodologies on all published data, we undertook the first comprehensive curation of two key PubMed HD datasets: perturbation genes that impact mHTT-driven endpoints and therefore are putatively linked causally to pathogenic mechanisms, and the protein interactome of HTT that reflects its biology. We perused PubMed articles containing co-citation of gene IDs and MeSH terms of interest to generate mechanistic gene sets for iterative enrichment analyses and rank ordering. The HD Perturbation database of 1,218 genes highly overlaps the HTT Interactome of 1,619 genes, suggesting links between normal HTT biology and mHTT pathology. These two HD datasets are enriched for protein networks of key genes underlying two mechanisms not previously implicated in HD nor in each other: exosome synaptic functions and homeostatic synaptic plasticity. Moreover, proteins, possibly including HTT, and miRNA detected in exosomes from a wide variety of sources also highly overlap the HD datasets, suggesting both mechanistic and biomarker links. Finally, the HTT Interactome highly intersects protein networks of pathogenic genes underlying Parkinson's, Alzheimer's and eight non-HD polyglutamine diseases, ALS, and spinal muscular atrophy. These protein networks in turn highly overlap the exosome and homeostatic synaptic plasticity gene sets. Thus, we hypothesize that HTT and other neurodegeneration pathogenic genes form a large interlocking protein network involved in exosome and homeostatic synaptic functions, particularly where the two mechanisms intersect. Mutant pathogenic proteins cause dysfunctions at distinct points in this network, each altering the two mechanisms in specific fashion that contributes to distinct disease pathologies, depending on the gene mutation and the cellular and biological context. This protein network is rich with drug targets, and exosomes may provide disease biomarkers, thus enabling drug discovery. All the curated datasets are made available for other investigators. Elucidating the roles of pathogenic neurodegeneration genes in exosome and homeostatic synaptic functions may provide a unifying framework for the age-dependent, progressive and tissue selective nature of multiple neurodegenerative diseases.
ROS-based ground stereo vision detection: implementation and experiments.
Hu, Tianjiang; Zhao, Boxin; Tang, Dengqing; Zhang, Daibing; Kong, Weiwei; Shen, Lincheng
This article concentrates on open-source implementation on flying object detection in cluttered scenes. It is of significance for ground stereo-aided autonomous landing of unmanned aerial vehicles. The ground stereo vision guidance system is presented with details on system architecture and workflow. The Chan-Vese detection algorithm is further considered and implemented in the robot operating systems (ROS) environment. A data-driven interactive scheme is developed to collect datasets for parameter tuning and performance evaluating. The flying vehicle outdoor experiments capture the stereo sequential images dataset and record the simultaneous data from pan-and-tilt unit, onboard sensors and differential GPS. Experimental results by using the collected dataset validate the effectiveness of the published ROS-based detection algorithm.
A global experimental dataset for assessing grain legume production
Cernay, Charles; Pelzer, Elise; Makowski, David
2016-01-01
Grain legume crops are a significant component of the human diet and animal feed and have an important role in the environment, but the global diversity of agricultural legume species is currently underexploited. Experimental assessments of grain legume performances are required, to identify potential species with high yields. Here, we introduce a dataset including results of field experiments published in 173 articles. The selected experiments were carried out over five continents on 39 grain legume species. The dataset includes measurements of grain yield, aerial biomass, crop nitrogen content, residual soil nitrogen content and water use. When available, yields for cereals and oilseeds grown after grain legumes in the crop sequence are also included. The dataset is arranged into a relational database with nine structured tables and 198 standardized attributes. Tillage, fertilization, pest and irrigation management are systematically recorded for each of the 8,581 crop*field site*growing season*treatment combinations. The dataset is freely reusable and easy to update. We anticipate that it will provide valuable information for assessing grain legume production worldwide. PMID:27676125
Maglione, Anton G; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio
2017-01-01
In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas.
Maglione, Anton G.; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio
2017-01-01
In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas. PMID:28790907
Rooting phylogenies using gene duplications: an empirical example from the bees (Apoidea).
Brady, Seán G; Litman, Jessica R; Danforth, Bryan N
2011-09-01
The placement of the root node in a phylogeny is fundamental to characterizing evolutionary relationships. The root node of bee phylogeny remains unclear despite considerable previous attention. In order to test alternative hypotheses for the location of the root node in bees, we used the F1 and F2 paralogs of elongation factor 1-alpha (EF-1α) to compare the tree topologies that result when using outgroup versus paralogous rooting. Fifty-two taxa representing each of the seven bee families were sequenced for both copies of EF-1α. Two datasets were analyzed. In the first (the "concatenated" dataset), the F1 and F2 copies for each species were concatenated and the tree was rooted using appropriate outgroups (sphecid and crabronid wasps). In the second dataset (the "duplicated" dataset), the F1 and F2 copies were aligned to each another and each copy for all taxa were treated as separate terminals. In this dataset, the root was placed between the F1 and F2 copies (e.g., paralog rooting). Bayesian analyses demonstrate that the outgroup rooting approach outperforms paralog rooting, recovering deeper clades and showing stronger support for groups well established by both morphological and other molecular data. Sequence characteristics of the two copies were compared at the amino acid level, but little evidence was found to suggest that one copy is more functionally conserved. Although neither approach yields an unambiguous root to the tree, both approaches strongly indicate that the root of bee phylogeny does not fall near Colletidae, as has been previously proposed. We discuss paralog rooting as a general strategy and why this approach performs relatively poorly with our particular dataset. Copyright © 2011 Elsevier Inc. All rights reserved.
DeepPap: Deep Convolutional Networks for Cervical Cell Classification.
Zhang, Ling; Le Lu; Nogues, Isabella; Summers, Ronald M; Liu, Shaoxiong; Yao, Jianhua
2017-11-01
Automation-assisted cervical screening via Pap smear or liquid-based cytology (LBC) is a highly effective cell imaging based cancer detection tool, where cells are partitioned into "abnormal" and "normal" categories. However, the success of most traditional classification methods relies on the presence of accurate cell segmentations. Despite sixty years of research in this field, accurate segmentation remains a challenge in the presence of cell clusters and pathologies. Moreover, previous classification methods are only built upon the extraction of hand-crafted features, such as morphology and texture. This paper addresses these limitations by proposing a method to directly classify cervical cells-without prior segmentation-based on deep features, using convolutional neural networks (ConvNets). First, the ConvNet is pretrained on a natural image dataset. It is subsequently fine-tuned on a cervical cell dataset consisting of adaptively resampled image patches coarsely centered on the nuclei. In the testing phase, aggregation is used to average the prediction scores of a similar set of image patches. The proposed method is evaluated on both Pap smear and LBC datasets. Results show that our method outperforms previous algorithms in classification accuracy (98.3%), area under the curve (0.99) values, and especially specificity (98.3%), when applied to the Herlev benchmark Pap smear dataset and evaluated using five-fold cross validation. Similar superior performances are also achieved on the HEMLBC (H&E stained manual LBC) dataset. Our method is promising for the development of automation-assisted reading systems in primary cervical screening.
Multiple Myeloma and Glyphosate Use: A Re-Analysis of US Agricultural Health Study (AHS) Data
Sorahan, Tom
2015-01-01
A previous publication of 57,311 pesticide applicators enrolled in the US Agricultural Health Study (AHS) produced disparate findings in relation to multiple myeloma risks in the period 1993–2001 and ever-use of glyphosate (32 cases of multiple myeloma in the full dataset of 54,315 applicators without adjustment for other variables: rate ratio (RR) 1.1, 95% confidence interval (CI) 0.5 to 2.4; 22 cases of multiple myeloma in restricted dataset of 40,719 applicators with adjustment for other variables: RR 2.6, 95% CI 0.7 to 9.4). It seemed important to determine which result should be preferred. RRs for exposed and non-exposed subjects were calculated using Poisson regression; subjects with missing data were not excluded from the main analyses. Using the full dataset adjusted for age and gender the analysis produced a RR of 1.12 (95% CI 0.50 to 2.49) for ever-use of glyphosate. Additional adjustment for lifestyle factors and use of ten other pesticides had little effect (RR 1.24, 95% CI 0.52 to 2.94). There were no statistically significant trends for multiple myeloma risks in relation to reported cumulative days (or intensity weighted days) of glyphosate use. The doubling of risk reported previously arose from the use of an unrepresentative restricted dataset and analyses of the full dataset provides no convincing evidence in the AHS for a link between multiple myeloma risk and glyphosate use. PMID:25635915
Multiple myeloma and glyphosate use: a re-analysis of US Agricultural Health Study (AHS) data.
Sorahan, Tom
2015-01-28
A previous publication of 57,311 pesticide applicators enrolled in the US Agricultural Health Study (AHS) produced disparate findings in relation to multiple myeloma risks in the period 1993-2001 and ever-use of glyphosate (32 cases of multiple myeloma in the full dataset of 54,315 applicators without adjustment for other variables: rate ratio (RR) 1.1, 95% confidence interval (CI) 0.5 to 2.4; 22 cases of multiple myeloma in restricted dataset of 40,719 applicators with adjustment for other variables: RR 2.6, 95% CI 0.7 to 9.4). It seemed important to determine which result should be preferred. RRs for exposed and non-exposed subjects were calculated using Poisson regression; subjects with missing data were not excluded from the main analyses. Using the full dataset adjusted for age and gender the analysis produced a RR of 1.12 (95% CI 0.50 to 2.49) for ever-use of glyphosate. Additional adjustment for lifestyle factors and use of ten other pesticides had little effect (RR 1.24, 95% CI 0.52 to 2.94). There were no statistically significant trends for multiple myeloma risks in relation to reported cumulative days (or intensity weighted days) of glyphosate use. The doubling of risk reported previously arose from the use of an unrepresentative restricted dataset and analyses of the full dataset provides no convincing evidence in the AHS for a link between multiple myeloma risk and glyphosate use.
NASA Astrophysics Data System (ADS)
Car, Nicholas; Cox, Simon; Fitch, Peter
2015-04-01
With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty (PROV-O Entity and Activity classes) have UncertML elements recorded. This methodology is intentionally flexible to allow uncertainty metadata in many forms, not limited to UncertML. While the more formal representation of uncertainty metadata is desirable (using UncertProv elements to implement the UncertML conceptual model ), this will not always be possible, and any uncertainty data stored will be better than none. Since the UncertProv ontology contains a superset of UncertML elements to facilitate the representation of non-UncertML uncertainty data, it could easily be extended to include other formal uncertainty conceptual models thus allowing non-UncertML propagation calculations.
Lacy-Jones, Kristin; Hayward, Philip; Andrews, Steve; Gledhill, Ian; McAllister, Mark; Abrahamsson, Bertil; Rostami-Hodjegan, Amin; Pepin, Xavier
2017-03-01
The OrBiTo IMI project was designed to improve the understanding and modelling of how drugs are absorbed. To achieve this 13 pharmaceutical companies agreed to share biopharmaceutics drug properties and performance data, as long as they were able to hide certain aspects of their dataset if required. This data was then used in simulations to test how three in silico Physiological Based Pharmacokinetic (PBPK) tools performed. A unique database system was designed and implemented to store the drug data. The database system was unique, in that it had the ability to make different sections of a dataset visible or hidden depending on the stage of the project. Users were also given the option to hide identifying API attributes, to help prevent identification of project members from previously published data. This was achieved by applying blinding strategies to data parameters and the adoption of a unique numbering system. An anonymous communication tool was proposed to exchange comments about data, which enabled its curation and evolution. This paper describes the strategy adopted for numbering and blinding of the data, the tools developed to gather and search data as well as the tools used for communicating around the data with the aim of publicising the approach for other pre-competitive research between organisations. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Li, Xinzhong; Long, Jintao; He, Taigang; Belshaw, Robert; Scott, James
2015-01-01
Previous studies have evaluated gene expression in Alzheimer’s disease (AD) brains to identify mechanistic processes, but have been limited by the size of the datasets studied. Here we have implemented a novel meta-analysis approach to identify differentially expressed genes (DEGs) in published datasets comprising 450 late onset AD (LOAD) brains and 212 controls. We found 3124 DEGs, many of which were highly correlated with Braak stage and cerebral atrophy. Pathway Analysis revealed the most perturbed pathways to be (a) nitric oxide and reactive oxygen species in macrophages (NOROS), (b) NFkB and (c) mitochondrial dysfunction. NOROS was also up-regulated, and mitochondrial dysfunction down-regulated, in healthy ageing subjects. Upstream regulator analysis predicted the TLR4 ligands, STAT3 and NFKBIA, for activated pathways and RICTOR for mitochondrial genes. Protein-protein interaction network analysis emphasised the role of NFKB; identified a key interaction of CLU with complement; and linked TYROBP, TREM2 and DOK3 to modulation of LPS signalling through TLR4 and to phosphatidylinositol metabolism. We suggest that NEUROD6, ZCCHC17, PPEF1 and MANBAL are potentially implicated in LOAD, with predicted links to calcium signalling and protein mannosylation. Our study demonstrates a highly injurious combination of TLR4-mediated NFKB signalling, NOROS inflammatory pathway activation, and mitochondrial dysfunction in LOAD. PMID:26202100
Sensitivity to sequencing depth in single-cell cancer genomics.
Alves, João M; Posada, David
2018-04-16
Querying cancer genomes at single-cell resolution is expected to provide a powerful framework to understand in detail the dynamics of cancer evolution. However, given the high costs currently associated with single-cell sequencing, together with the inevitable technical noise arising from single-cell genome amplification, cost-effective strategies that maximize the quality of single-cell data are critically needed. Taking advantage of previously published single-cell whole-genome and whole-exome cancer datasets, we studied the impact of sequencing depth and sampling effort towards single-cell variant detection. Five single-cell whole-genome and whole-exome cancer datasets were independently downscaled to 25, 10, 5, and 1× sequencing depth. For each depth level, ten technical replicates were generated, resulting in a total of 6280 single-cell BAM files. The sensitivity of variant detection, including structural and driver mutations, genotyping, clonal inference, and phylogenetic reconstruction to sequencing depth was evaluated using recent tools specifically designed for single-cell data. Altogether, our results suggest that for relatively large sample sizes (25 or more cells) sequencing single tumor cells at depths > 5× does not drastically improve somatic variant discovery, characterization of clonal genotypes, or estimation of single-cell phylogenies. We suggest that sequencing multiple individual tumor cells at a modest depth represents an effective alternative to explore the mutational landscape and clonal evolutionary patterns of cancer genomes.
NASA Astrophysics Data System (ADS)
Liu, Jianzhong; Kern, Petra S.; Gerberick, G. Frank; Santos-Filho, Osvaldo A.; Esposito, Emilio X.; Hopfinger, Anton J.; Tseng, Yufeng J.
2008-06-01
In previous studies we have developed categorical QSAR models for predicting skin-sensitization potency based on 4D-fingerprint (4D-FP) descriptors and in vivo murine local lymph node assay (LLNA) measures. Only 4D-FP derived from the ground state (GMAX) structures of the molecules were used to build the QSAR models. In this study we have generated 4D-FP descriptors from the first excited state (EMAX) structures of the molecules. The GMAX, EMAX and the combined ground and excited state 4D-FP descriptors (GEMAX) were employed in building categorical QSAR models. Logistic regression (LR) and partial least square coupled logistic regression (PLS-CLR), found to be effective model building for the LLNA skin-sensitization measures in our previous studies, were used again in this study. This also permitted comparison of the prior ground state models to those involving first excited state 4D-FP descriptors. Three types of categorical QSAR models were constructed for each of the GMAX, EMAX and GEMAX datasets: a binary model (2-state), an ordinal model (3-state) and a binary-binary model (two-2-state). No significant differences exist among the LR 2-state model constructed for each of the three datasets. However, the PLS-CLR 3-state and 2-state models based on the EMAX and GEMAX datasets have higher predictivity than those constructed using only the GMAX dataset. These EMAX and GMAX categorical models are also more significant and predictive than corresponding models built in our previous QSAR studies of LLNA skin-sensitization measures.
Attribute-driven transfer learning for detecting novel buried threats with ground-penetrating radar
NASA Astrophysics Data System (ADS)
Colwell, Kenneth A.; Collins, Leslie M.
2016-05-01
Ground-penetrating radar (GPR) technology is an effective method of detecting buried explosive threats. The system uses a binary classifier to distinguish "targets", or buried threats, from "nontargets" arising from system prescreener false alarms; this classifier is trained on a dataset of previously-observed buried threat types. However, the threat environment is not static, and new threat types that appear must be effectively detected even if they are not highly similar to every previously-observed type. Gathering a new dataset that includes a new threat type is expensive and time-consuming; minimizing the amount of new data required to effectively detect the new type is therefore valuable. This research aims to reduce the number of training examples needed to effectively detect new types using transfer learning, which leverages previous learning tasks to accelerate and improve new ones. Further, new types have attribute data, such as composition, components, construction, and size, which can be observed without GPR and typically are not explicitly included in the learning process. Since attribute tags for buried threats determine many aspects of their GPR representation, a new threat type's attributes can be highly relevant to the transfer-learning process. In this work, attribute data is used to drive transfer learning, both by using attributes to select relevant dataset examples for classifier fusion, and by extending a relevance vector machine (RVM) model to perform intelligent attribute clustering and selection. Classification performance results for both the attribute-only case and the low-data case are presented, using a dataset containing a variety of threat types.
Thermodynamic Data Rescue and Informatics for Deep Carbon Science
NASA Astrophysics Data System (ADS)
Zhong, H.; Ma, X.; Prabhu, A.; Eleish, A.; Pan, F.; Parsons, M. A.; Ghiorso, M. S.; West, P.; Zednik, S.; Erickson, J. S.; Chen, Y.; Wang, H.; Fox, P. A.
2017-12-01
A large number of legacy datasets are contained in geoscience literature published between 1930 and 1980 and not expressed external to the publication text in digitized formats. Extracting, organizing, and reusing these "dark" datasets is highly valuable for many within the Earth and planetary science community. As a part of the Deep Carbon Observatory (DCO) data legacy missions, the DCO Data Science Team and Extreme Physics and Chemistry community identified thermodynamic datasets related to carbon, or more specifically datasets about the enthalpy and entropy of chemicals, as a proof of principle analysis. The data science team endeavored to develop a semi-automatic workflow, which includes identifying relevant publications, extracting contained datasets using OCR methods, collaborative reviewing, and registering the datasets via the DCO Data Portal where the 'Linked Data' feature of the data portal provides a mechanism for connecting rescued datasets beyond their individual data sources, to research domains, DCO Communities, and more, making data discovery and retrieval more effective.To date, the team has successfully rescued, deposited and registered additional datasets from publications with thermodynamic sources. These datasets contain 3 main types of data: (1) heat content or enthalpy data determined for a given compound as a function of temperature using high-temperature calorimetry, (2) heat content or enthalpy data determined for a given compound as a function of temperature using adiabatic calorimetry, and (3) direct determination of heat capacity of a compound as a function of temperature using differential scanning calorimetry. The data science team integrated these datasets and delivered a spectrum of data analytics including visualizations, which will lead to a comprehensive characterization of the thermodynamics of carbon and carbon-related materials.
EEG datasets for motor imagery brain-computer interface.
Cho, Hohyun; Ahn, Minkyu; Ahn, Sangtae; Kwon, Moonyoung; Jun, Sung Chan
2017-07-01
Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states. © The Authors 2017. Published by Oxford University Press.
2013-06-01
benefitting from rapid, automated discrimination of specific predefined signals , and is free-standing (requiring no other plugins or packages). The...previously labeled dataset, and comparing two labeled datasets. 15. SUBJECT TERMS Artifact, signal detection, EEG, MATLAB, toolbox 16. SECURITY... CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU 18. NUMBER OF PAGES 56 19a. NAME OF RESPONSIBLE PERSON W. David Hairston a. REPORT
Context-Aware Generative Adversarial Privacy
NASA Astrophysics Data System (ADS)
Huang, Chong; Kairouz, Peter; Chen, Xiao; Sankar, Lalitha; Rajagopal, Ram
2017-12-01
Preserving the utility of published datasets while simultaneously providing provable privacy guarantees is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other hand, context-aware privacy solutions, such as information theoretic privacy, achieve an improved privacy-utility tradeoff, but assume that the data holder has access to dataset statistics. We circumvent these limitations by introducing a novel context-aware privacy framework called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to allow the data holder to learn privatization schemes from the dataset itself. Under GAP, learning the privacy mechanism is formulated as a constrained minimax game between two players: a privatizer that sanitizes the dataset in a way that limits the risk of inference attacks on the individuals' private variables, and an adversary that tries to infer the private variables from the sanitized dataset. To evaluate GAP's performance, we investigate two simple (yet canonical) statistical dataset models: (a) the binary data model, and (b) the binary Gaussian mixture model. For both models, we derive game-theoretically optimal minimax privacy mechanisms, and show that the privacy mechanisms learned from data (in a generative adversarial fashion) match the theoretically optimal ones. This demonstrates that our framework can be easily applied in practice, even in the absence of dataset statistics.
Smith, Tanya; Page-Nicholson, Samantha; Morrison, Kerryn; Gibbons, Bradley; Jones, M Genevieve W; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin; Roxburgh, Lizanne
2016-01-01
The International Crane Foundation (ICF) / Endangered Wildlife Trust's (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus , Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus . The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal.
Data publication - policies and procedures from the PREPARDE project
NASA Astrophysics Data System (ADS)
Callaghan, Sarah; Murphy, Fiona; Tedds, Jonathan; Kunze, John; Lawrence, Rebecca; Mayernik, , Matthew S.; Whyte, Angus; Roberts, Timothy
2013-04-01
Data are widely acknowledged as a first class scientific output. Increases in researchers' abilities to create data need to be matched by corresponding infrastructures for them to manage and share their data. At the same time, the quality and persistence of the datasets need to be ensured, providing the dataset creators with the recognition they deserve for their efforts. Formal publication of data takes advantage of the processes and procedures already in place to publish academic articles about scientific results, enabling data to be reviewed and more broadly disseminated. Data are vastly more varied in format than papers, and so the policies required to manage and publish data must take into account the complexities associated with different data types, scientific fields, licensing rules etc. The Peer REview for Publication & Accreditation of Research Data in the Earth sciences (PREPARDE) project is JISC- and NERC-funded, and aims to investigate the policies and procedures required for the formal publication of research data. The project is investigating the whole workflow of data publication, from ingestion into a data repository, through to formal publication in a data journal. To limit the scope of the project, the focus is primarily on the policies required for the Royal Meteorological Society and Wiley's Geoscience Data Journal, though members of the project team include representatives from the life sciences (F1000Research), and will generalise the policies to other disciplines. PREPARDE addresses key issues arising in the data publication paradigm, such as: what criteria are needed for a repository to be considered objectively trustworthy; how does one peer-review a dataset; and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community and the completeness of the scientific record? To answer these questions, the project is hosting workshops addressing these issues, with interactions from key stakeholders, including data and repository managers, researchers, funders and publishers. The results of these workshops will be presented and further comment and interaction sought from interested parties.
NASA Astrophysics Data System (ADS)
Tarboton, D. G.; Idaszak, R.; Horsburgh, J. S.; Ames, D. P.; Goodall, J. L.; Band, L. E.; Merwade, V.; Couch, A.; Hooper, R. P.; Maidment, D. R.; Dash, P. K.; Stealey, M.; Yi, H.; Gan, T.; Castronova, A. M.; Miles, B.; Li, Z.; Morsy, M. M.; Crawley, S.; Ramirez, M.; Sadler, J.; Xue, Z.; Bandaragoda, C.
2016-12-01
How do you share and publish hydrologic data and models for a large collaborative project? HydroShare is a new, web-based system for sharing hydrologic data and models with specific functionality aimed at making collaboration easier. HydroShare has been developed with U.S. National Science Foundation support under the auspices of the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) to support the collaboration and community cyberinfrastructure needs of the hydrology research community. Within HydroShare, we have developed new functionality for creating datasets, describing them with metadata, and sharing them with collaborators. We cast hydrologic datasets and models as "social objects" that can be shared, collaborated around, annotated, published and discovered. In addition to data and model sharing, HydroShare supports web application programs (apps) that can act on data stored in HydroShare, just as software programs on your PC act on your data locally. This can free you from some of the limitations of local computing capacity and challenges in installing and maintaining software on your own PC. HydroShare's web-based cyberinfrastructure can take work off your desk or laptop computer and onto infrastructure or "cloud" based data and processing servers. This presentation will describe HydroShare's collaboration functionality that enables both public and private sharing with individual users and collaborative user groups, and makes it easier for collaborators to iterate on shared datasets and models, creating multiple versions along the way, and publishing them with a permanent landing page, metadata description, and citable Digital Object Identifier (DOI) when the work is complete. This presentation will also describe the web app architecture that supports interoperability with third party servers functioning as application engines for analysis and processing of big hydrologic datasets. While developed to support the cyberinfrastructure needs of the hydrology community, the informatics infrastructure for programmatic interoperability of web resources has a generality beyond the solution of hydrology problems that will be discussed.
Dey-Rao, Rama; Sinha, Animesh A
2017-01-28
Significant gaps remain regarding the pathomechanisms underlying the autoimmune response in vitiligo (VL), where the loss of self-tolerance leads to the targeted killing of melanocytes. Specifically, there is incomplete information regarding alterations in the systemic environment that are relevant to the disease state. We undertook a genome-wide profiling approach to examine gene expression in the peripheral blood of VL patients and healthy controls in the context of our previously published VL-skin gene expression profile. We used several in silico bioinformatics-based analyses to provide new insights into disease mechanisms and suggest novel targets for future therapy. Unsupervised clustering methods of the VL-blood dataset demonstrate a "disease-state"-specific set of co-expressed genes. Ontology enrichment analysis of 99 differentially expressed genes (DEGs) uncovers a down-regulated immune/inflammatory response, B-Cell antigen receptor (BCR) pathways, apoptosis and catabolic processes in VL-blood. There is evidence for both type I and II interferon (IFN) playing a role in VL pathogenesis. We used interactome analysis to identify several key blood associated transcriptional factors (TFs) from within (STAT1, STAT6 and NF-kB), as well as "hidden" (CREB1, MYC, IRF4, IRF1, and TP53) from the dataset that potentially affect disease pathogenesis. The TFs overlap with our reported lesional-skin transcriptional circuitry, underscoring their potential importance to the disease. We also identify a shared VL-blood and -skin transcriptional "hot spot" that maps to chromosome 6, and includes three VL-blood dysregulated genes (PSMB8, PSMB9 and TAP1) described as potential VL-associated genetic susceptibility loci. Finally, we provide bioinformatics-based support for prioritizing dysregulated genes in VL-blood or skin as potential therapeutic targets. We examined the VL-blood transcriptome in context with our (previously published) VL-skin transcriptional profile to address a major gap in knowledge regarding the systemic changes underlying skin-specific manifestation of vitiligo. Several transcriptional "hot spots" observed in both environments offer prioritized targets for identifying disease risk genes. Finally, within the transcriptional framework of VL, we identify five novel molecules (STAT1, PRKCD, PTPN6, MYC and FGFR2) that lend themselves to being targeted by drugs for future potential VL-therapy.
DOT National Transportation Integrated Search
2014-10-01
For this project, researchers used an existing dataset from a previous research effort to investigate the moth effect : theory, where it is believed that drivers drift toward bright lights. While the previous research study primarily : focused on sig...
New gravity anomaly map of Taiwan and its surrounding regions with some tectonic interpretations
NASA Astrophysics Data System (ADS)
Doo, Wen-Bin; Lo, Chung-Liang; Hsu, Shu-Kun; Tsai, Ching-Hui; Huang, Yin-Sheng; Wang, Hsueh-Fen; Chiu, Shye-Donq; Ma, Yu-Fang; Liang, Chin-Wei
2018-04-01
In this study, we compiled recently collected (from 2005 to 2015) and previously reported (published and open access) gravity data, including land, shipborne and satellite-derived data, for Taiwan and its surrounding regions. Based on the cross-over error analysis, all data were adjusted; and, new Free-air gravity anomalies were obtained, shedding light on the tectonics of the region. To obtain the Bouguer gravity anomalies, the densities of land terrain and marine sediments were assumed to be 2.53 and 1.80 g/cm3, respectively. The updated gravity dataset was gridded with a spacing of one arc-minute. Several previously unnoticed gravity features are revealed by the new maps and can be used in a broad range of applications: (1) An isolated gravity high is located between the Shoushan and the Kaoping Canyon off southwest Taiwan. (2) Along the Luzon Arc, both Free-air and Bouguer gravity anomaly maps reveal a significant gravity discontinuity feature at the latitude of 21°20‧N. (3) In the southwestern Okinawa Trough, the NE-SW trending cross-back-arc volcanic trail (CBVT) marks the low-high gravity anomaly (both Free-air and Bouguer) boundary.
Neopterygian phylogeny: the merger assay
Sferco, Emilia
2018-01-01
The phylogenetic relationships of the recently described genus †Ticinolepis from the Middle Triassic of the Monte San Giorgio are explored through cladistic analyses of the so far largest morphological dataset for fossil actinopterygians, including representatives of the crown-neopterygian clades Halecomorphi, Ginglymodi and Teleostei, and merging the characters from previously published systematic studies together with newly proposed characters. †Ticinolepis is retrieved as the most basal Ginglymodi and our results support the monophyly of Teleostei and Holostei, as well as Halecomorphi and Ginglymodi within the latter clade. The patterns of relationships within these clades mostly agree with those of previous studies, although a few important differences require future research. According to our results, ionoscopiforms are not monophyletic, caturids are not amiiforms and leptolepids and luisiellids form a monophyletic clade. Our phylogenetic hypothesis confirms the rapid radiation of the holostean clades Halecomorphi and Ginglymodi during the Early and Middle Triassic and the radiation of pholidophoriform teleosts during the Late Triassic. Crown-group Halecomorphi have an enormous ghost lineage throughout half of the Mesozoic, but ginglymodians and teleosts show a second radiation during the Early Jurassic. The crown-groups of Halecomorphi, Ginglymodi and Teleostei originated within parallel events of radiation during the Late Jurassic. PMID:29657820
Repetitive deliberate fires: Development and validation of a methodology to detect series.
Bruenisholz, Eva; Delémont, Olivier; Ribaux, Olivier; Wilson-Wilde, Linzi
2017-08-01
The detection of repetitive deliberate fire events is challenging and still often ineffective due to a case-by-case approach. A previous study provided a critical review of the situation and analysis of the main challenges. This study suggested that the intelligence process, integrating forensic data, could be a valid framework to provide a follow-up and systematic analysis provided it is adapted to the specificities of repetitive deliberate fires. In this current manuscript, a specific methodology to detect deliberate fires series, i.e. set by the same perpetrators, is presented and validated. It is based on case profiles relying on specific elements previously identified. The method was validated using a dataset of approximately 8000 deliberate fire events collected over 12 years in a Swiss state. Twenty possible series were detected, including 6 of 9 known series. These results are very promising and lead the way to a systematic implementation of this methodology in an intelligence framework, whilst demonstrating the need and benefit of increasing the collection of forensic specific information to strengthen the value of links between cases. Crown Copyright © 2017. Published by Elsevier B.V. All rights reserved.
High-J rotational spectrum of toluene in |m| ⩽ 3 torsional states
NASA Astrophysics Data System (ADS)
Ilyushin, Vadim V.; Alekseev, Eugene A.; Kisiel, Zbigniew; Pszczółkowski, Lech
2017-09-01
The study of the rotational spectrum of toluene (C6H5CH3) is considerably extended to include transitions in |m| ⩽ 3 torsional states up to the onset of the submillimeter wave region. New data involving torsion-rotation transitions up to 336 GHz were combined with previously published measurements and fitted using the rho-axis-method torsion-rotation Hamiltonian. The final fit used 50 parameters to give an overall weighted root-mean-square deviation of 0.69 for a dataset consisting of 8924 transitions with J up to 94 and Ka up to 50. The new analysis allowed us to resolve all problems encountered previously for m = 0 transitions beyond a certain combination of quantum numbers J and Ka when many lines of appreciable intensity and unambiguous assignment deviated from the distorted asymmetric rotor treatment. Those discrepancies are now identified to result from m = 0 ↔ m = 3 and m = 0 ↔ m = -3 resonances, which have been successfully encompassed by the current fit. At the same time an analogous problem was discovered and fitted for m = 2 transitions, which were found to be affected by many m = 1 ↔ m = 2 resonances.
Neopterygian phylogeny: the merger assay
NASA Astrophysics Data System (ADS)
López-Arbarello, Adriana; Sferco, Emilia
2018-03-01
The phylogenetic relationships of the recently described genus †Ticinolepis from the Middle Triassic of the Monte San Giorgio are explored through cladistic analyses of the so far largest morphological dataset for fossil actinopterygians, including representatives of the crown-neopterygian clades Halecomorphi, Ginglymodi and Teleostei, and merging the characters from previously published systematic studies together with newly proposed characters. †Ticinolepis is retrieved as the most basal Ginglymodi and our results support the monophyly of Teleostei and Holostei, as well as Halecomorphi and Ginglymodi within the latter clade. The patterns of relationships within these clades mostly agree with those of previous studies, although a few important differences require future research. According to our results, ionoscopiforms are not monophyletic, caturids are not amiiforms and leptolepids and luisiellids form a monophyletic clade. Our phylogenetic hypothesis confirms the rapid radiation of the holostean clades Halecomorphi and Ginglymodi during the Early and Middle Triassic and the radiation of pholidophoriform teleosts during the Late Triassic. Crown-group Halecomorphi have an enormous ghost lineage throughout half of the Mesozoic, but ginglymodians and teleosts show a second radiation during the Early Jurassic. The crown-groups of Halecomorphi, Ginglymodi and Teleostei originated within parallel events of radiation during the Late Jurassic.
Federating heterogeneous datasets to enhance data sharing and experiment reproducibility
NASA Astrophysics Data System (ADS)
Prieto, Juan C.; Paniagua, Beatriz; Yatabe, Marilia S.; Ruellas, Antonio C. O.; Fattori, Liana; Muniz, Luciana; Styner, Martin; Cevidanes, Lucia
2017-03-01
Recent studies have demonstrated the difficulties to replicate scientific findings and/or experiments published in past.1 The effects seen in the replicated experiments were smaller than previously reported. Some of the explanations for these findings include the complexity of the experimental design and the pressure on researches to report positive findings. The International Committee of Medical Journal Editors (ICMJE) suggests that every study considered for publication must submit a plan to share the de-identified patient data no later than 6 months after publication. There is a growing demand to enhance the management of clinical data, facilitate data sharing across institutions and also to keep track of the data from previous experiments. The ultimate goal is to assure the reproducibility of experiments in the future. This paper describes Shiny-tooth, a web based application created to improve clinical data acquisition during the clinical trial; data federation of such data as well as morphological data derived from medical images; Currently, this application is being used to store clinical data from an osteoarthritis (OA) study. This work is submitted to the SPIE Biomedical Applications in Molecular, Structural, and Functional Imaging conference.
Comparing ensemble learning methods based on decision tree classifiers for protein fold recognition.
Bardsiri, Mahshid Khatibi; Eftekhari, Mahdi
2014-01-01
In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.
NASA Astrophysics Data System (ADS)
Ferland, G. J.; Chatzikos, M.; Guzmán, F.; Lykins, M. L.; van Hoof, P. A. M.; Williams, R. J. R.; Abel, N. P.; Badnell, N. R.; Keenan, F. P.; Porter, R. L.; Stancil, P. C.
2017-10-01
We describe the 2017 release of the spectral synthesis code Cloudy, summarizing the many improvements to the scope and accuracy of the physics which have been made since the previous release. Exporting the atomic data into external data files has enabled many new large datasets to be incorporated into the code. The use of the complete datasets is not realistic for most calculations, so we describe the limited subset of data used by default, which predicts significantly more lines than the previous release of Cloudy. This version is nevertheless faster than the previous release, as a result of code optimizations. We give examples of the accuracy limits using small models, and the performance requirements of large complete models. We summarize several advances in the H- and He-like iso-electronic sequences and use our complete collisional-radiative models to establish the densities where the coronal and local thermodynamic equilibrium approximations work.
Robertson, Tim; Döring, Markus; Guralnick, Robert; Bloom, David; Wieczorek, John; Braak, Kyle; Otegui, Javier; Russell, Laura; Desmet, Peter
2014-01-01
The planet is experiencing an ongoing global biodiversity crisis. Measuring the magnitude and rate of change more effectively requires access to organized, easily discoverable, and digitally-formatted biodiversity data, both legacy and new, from across the globe. Assembling this coherent digital representation of biodiversity requires the integration of data that have historically been analog, dispersed, and heterogeneous. The Integrated Publishing Toolkit (IPT) is a software package developed to support biodiversity dataset publication in a common format. The IPT’s two primary functions are to 1) encode existing species occurrence datasets and checklists, such as records from natural history collections or observations, in the Darwin Core standard to enhance interoperability of data, and 2) publish and archive data and metadata for broad use in a Darwin Core Archive, a set of files following a standard format. Here we discuss the key need for the IPT, how it has developed in response to community input, and how it continues to evolve to streamline and enhance the interoperability, discoverability, and mobilization of new data types beyond basic Darwin Core records. We close with a discussion how IPT has impacted the biodiversity research community, how it enhances data publishing in more traditional journal venues, along with new features implemented in the latest version of the IPT, and future plans for more enhancements. PMID:25099149
NASA Astrophysics Data System (ADS)
Forkert, Nils Daniel; Siemonsen, Susanne; Dalski, Michael; Verleger, Tobias; Kemmling, Andre; Fiehler, Jens
2014-03-01
The acute ischemic stroke is a leading cause for death and disability in the industry nations. In case of a present acute ischemic stroke, the prediction of the future tissue outcome is of high interest for the clinicians as it can be used to support therapy decision making. Within this context, it has already been shown that the voxel-wise multi-parametric tissue outcome prediction leads to more promising results compared to single channel perfusion map thresholding. Most previously published multi-parametric predictions employ information from perfusion maps derived from perfusion-weighted MRI together with other image sequences such as diffusion-weighted MRI. However, it remains unclear if the typically calculated perfusion maps used for this purpose really include all valuable information from the PWI dataset for an optimal tissue outcome prediction. To investigate this problem in more detail, two different methods to predict tissue outcome using a k-nearest-neighbor approach were developed in this work and evaluated based on 18 datasets of acute stroke patients with known tissue outcome. The first method integrates apparent diffusion coefficient and perfusion parameter (Tmax, MTT, CBV, CBF) information for the voxel-wise prediction, while the second method employs also apparent diffusion coefficient information but the complete perfusion information in terms of the voxel-wise residue functions instead of the perfusion parameter maps for the voxel-wise prediction. Overall, the comparison of the results of the two prediction methods for the 18 patients using a leave-one-out cross validation revealed no considerable differences. Quantitatively, the parameter-based prediction of tissue outcome led to a mean Dice coefficient of 0.474, while the prediction using the residue functions led to a mean Dice coefficient of 0.461. Thus, it may be concluded from the results of this study that the perfusion parameter maps typically derived from PWI datasets include all valuable perfusion information required for a voxel-based tissue outcome prediction, while the complete analysis of the residue functions does not add further benefits for the voxel-wise tissue outcome prediction and is also computationally more expensive.
Pettey, Warren B P; Toth, Damon J A; Redd, Andrew; Carter, Marjorie E; Samore, Matthew H; Gundlapalli, Adi V
2016-06-01
Network projections of data can provide an efficient format for data exploration of co-incidence in large clinical datasets. We present and explore the utility of a network projection approach to finding patterns in health care data that could be exploited to prevent homelessness among U.S. Veterans. We divided Veteran ICD-9-CM (ICD9) data into two time periods (0-59 and 60-364days prior to the first evidence of homelessness) and then used Pajek social network analysis software to visualize these data as three different networks. A multi-relational network simultaneously displayed the magnitude of ties between the most frequent ICD9 pairings. A new association network visualized ICD9 pairings that greatly increased or decreased. A signed, subtraction network visualized the presence, absence, and magnitude difference between ICD9 associations by time period. A cohort of 9468 U.S. Veterans was identified as having administrative evidence of homelessness and visits in both time periods. They were seen in 222,599 outpatient visits that generated 484,339 ICD9 codes (average of 11.4 (range 1-23) visits and 2.2 (range 1-60) ICD9 codes per visit). Using the three network projection methods, we were able to show distinct differences in the pattern of co-morbidities in the two time periods. In the more distant time period preceding homelessness, the network was dominated by routine health maintenance visits and physical ailment diagnoses. In the 59days immediately prior to the homelessness identification, alcohol related diagnoses along with economic circumstances such as unemployment, legal circumstances, along with housing instability were noted. Network visualizations of large clinical datasets traditionally treated as tabular and difficult to manipulate reveal rich, previously hidden connections between data variables related to homelessness. A key feature is the ability to visualize changes in variables with temporality and in proximity to the event of interest. These visualizations lend support to cognitive tasks such as exploration of large clinical datasets as a prelude to hypothesis generation. Published by Elsevier Inc.
IMG/M: integrated genome and metagenome comparative data analysis system
Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; ...
2016-10-13
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less
IMG/M: integrated genome and metagenome comparative data analysis system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less
IMG/M: integrated genome and metagenome comparative data analysis system
Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; Palaniappan, Krishna; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Andersen, Evan; Huntemann, Marcel; Varghese, Neha; Hadjithomas, Michalis; Tennessen, Kristin; Nielsen, Torben; Ivanova, Natalia N.; Kyrpides, Nikos C.
2017-01-01
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system. PMID:27738135
SCPortalen: human and mouse single-cell centric database
Noguchi, Shuhei; Böttcher, Michael; Hasegawa, Akira; Kouno, Tsukasa; Kato, Sachi; Tada, Yuhki; Ura, Hiroki; Abe, Kuniya; Shin, Jay W; Plessy, Charles; Carninci, Piero
2018-01-01
Abstract Published single-cell datasets are rich resources for investigators who want to address questions not originally asked by the creators of the datasets. The single-cell datasets might be obtained by different protocols and diverse analysis strategies. The main challenge in utilizing such single-cell data is how we can make the various large-scale datasets to be comparable and reusable in a different context. To challenge this issue, we developed the single-cell centric database ‘SCPortalen’ (http://single-cell.clst.riken.jp/). The current version of the database covers human and mouse single-cell transcriptomics datasets that are publicly available from the INSDC sites. The original metadata was manually curated and single-cell samples were annotated with standard ontology terms. Following that, common quality assessment procedures were conducted to check the quality of the raw sequence. Furthermore, primary data processing of the raw data followed by advanced analyses and interpretation have been performed from scratch using our pipeline. In addition to the transcriptomics data, SCPortalen provides access to single-cell image files whenever available. The target users of SCPortalen are all researchers interested in specific cell types or population heterogeneity. Through the web interface of SCPortalen users are easily able to search, explore and download the single-cell datasets of their interests. PMID:29045713
NASA Technical Reports Server (NTRS)
Claverie, Martin; Matthews, Jessica L.; Vermote, Eric F.; Justice, Christopher O.
2016-01-01
In- land surface models, which are used to evaluate the role of vegetation in the context ofglobal climate change and variability, LAI and FAPAR play a key role, specifically with respect to thecarbon and water cycles. The AVHRR-based LAIFAPAR dataset offers daily temporal resolution,an improvement over previous products. This climate data record is based on a carefully calibratedand corrected land surface reflectance dataset to provide a high-quality, consistent time-series suitablefor climate studies. It spans from mid-1981 to the present. Further, this operational dataset is availablein near real-time allowing use for monitoring purposes. The algorithm relies on artificial neuralnetworks calibrated using the MODIS LAI/FAPAR dataset. Evaluation based on cross-comparisonwith MODIS products and in situ data show the dataset is consistent and reliable with overalluncertainties of 1.03 and 0.15 for LAI and FAPAR, respectively. However, a clear saturation effect isobserved in the broadleaf forest biomes with high LAI (greater than 4.5) and FAPAR (greater than 0.8) values.
Copes, Lynn E.; Lucas, Lynn M.; Thostenson, James O.; Hoekstra, Hopi E.; Boyer, Doug M.
2016-01-01
A dataset of high-resolution microCT scans of primate skulls (crania and mandibles) and certain postcranial elements was collected to address questions about primate skull morphology. The sample consists of 489 scans taken from 431 specimens, representing 59 species of most Primate families. These data have transformative reuse potential as such datasets are necessary for conducting high power research into primate evolution, but require significant time and funding to collect. Similar datasets were previously only available to select research groups across the world. The physical specimens are vouchered at Harvard’s Museum of Comparative Zoology. The data collection took place at the Center for Nanoscale Systems at Harvard. The dataset is archived on MorphoSource.org. Though this is the largest high fidelity comparative dataset yet available, its provisioning on a web archive that allows unlimited researcher contributions promises a future with vastly increased digital collections available at researchers’ finger tips. PMID:26836025
XRF and XANES Data for Kaplan U Paper
The dataset contains two XRF images of iron and uranium distribution on plant roots and a database of XANES data used to produce XANES spectra figure for Figure 7 in the published paper.This dataset is associated with the following publication:Kaplan, D., R. Kukkadapu, J. Seaman, B. Arey, A. Dohnalkova, S. Buettner, D. Li, T. Varga, K. Scheckel, and P. Jaffe. Iron Mineralogy and Uranium-Binding Environment in the Rhizosphere of a Wetland Soil. D. Barcelo SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 569: 53-64, (2016).
A Computational Approach to Qualitative Analysis in Large Textual Datasets
Evans, Michael S.
2014-01-01
In this paper I introduce computational techniques to extend qualitative analysis into the study of large textual datasets. I demonstrate these techniques by using probabilistic topic modeling to analyze a broad sample of 14,952 documents published in major American newspapers from 1980 through 2012. I show how computational data mining techniques can identify and evaluate the significance of qualitatively distinct subjects of discussion across a wide range of public discourse. I also show how examining large textual datasets with computational methods can overcome methodological limitations of conventional qualitative methods, such as how to measure the impact of particular cases on broader discourse, how to validate substantive inferences from small samples of textual data, and how to determine if identified cases are part of a consistent temporal pattern. PMID:24498398
Medical Subject Headings (MeSH) for indexing and retrieving open-source healthcare data.
Marc, David T; Khairat, Saif S
2014-01-01
The US federal government initiated the Open Government Directive where federal agencies are required to publish high value datasets so that they are available to the public. Data.gov and the community site Healthdata.gov were initiated to disperse such datasets. However, data searches and retrieval for these sites are keyword driven and severely limited in performance. The purpose of this paper is to address the issue of extracting relevant open-source data by proposing a method of adopting the MeSH framework for indexing and data retrieval. A pilot study was conducted to compare the performance of traditional keywords to MeSH terms for retrieving relevant open-source datasets related to "mortality". The MeSH framework resulted in greater sensitivity with comparable specificity to the keyword search. MeSH showed promise as a method for indexing and retrieving data, yet future research should conduct a larger scale evaluation of the performance of the MeSH framework for retrieving relevant open-source healthcare datasets.
Discriminating Projections for Estimating Face Age in Wild Images
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tokola, Ryan A; Bolme, David S; Ricanek, Karl
2014-01-01
We introduce a novel approach to estimating the age of a human from a single uncontrolled image. Current face age estimation algorithms work well in highly controlled images, and some are robust to changes in illumination, but it is usually assumed that images are close to frontal. This bias is clearly seen in the datasets that are commonly used to evaluate age estimation, which either entirely or mostly consist of frontal images. Using pose-specific projections, our algorithm maps image features into a pose-insensitive latent space that is discriminative with respect to age. Age estimation is then performed using a multi-classmore » SVM. We show that our approach outperforms other published results on the Images of Groups dataset, which is the only age-related dataset with a non-trivial number of off-axis face images, and that we are competitive with recent age estimation algorithms on the mostly-frontal FG-NET dataset. We also experimentally demonstrate that our feature projections introduce insensitivity to pose.« less
Kong, Xianyu; Sun, Yuyan; Su, Rongguo; Shi, Xiaoyong
2017-06-15
The development of techniques for real-time monitoring of the eutrophication status of coastal waters is of great importance for realizing potential cost savings in coastal monitoring programs and providing timely advice for marine health management. In this study, a GS optimized SVM was proposed to model relationships between 6 easily measured parameters (DO, Chl-a, C1, C2, C3 and C4) and the TRIX index for rapidly assessing marine eutrophication states of coastal waters. The good predictive performance of the developed method was indicated by the R 2 between the measured and predicted values (0.92 for the training dataset and 0.91 for the validation dataset) at a 95% confidence level. The classification accuracy of the eutrophication status was 86.5% for the training dataset and 85.6% for the validation dataset. The results indicated that it is feasible to develop an SVM technique for timely evaluation of the eutrophication status by easily measured parameters. Copyright © 2017. Published by Elsevier Ltd.
SPICE: exploration and analysis of post-cytometric complex multivariate datasets.
Roederer, Mario; Nozzi, Joshua L; Nason, Martha C
2011-02-01
Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
Assessing Metadata Quality of a Federally Sponsored Health Data Repository.
Marc, David T; Beattie, James; Herasevich, Vitaly; Gatewood, Laël; Zhang, Rui
2016-01-01
The U.S. Federal Government developed HealthData.gov to disseminate healthcare datasets to the public. Metadata is provided for each datasets and is the sole source of information to find and retrieve data. This study employed automated quality assessments of the HealthData.gov metadata published from 2012 to 2014 to measure completeness, accuracy, and consistency of applying standards. The results demonstrated that metadata published in earlier years had lower completeness, accuracy, and consistency. Also, metadata that underwent modifications following their original creation were of higher quality. HealthData.gov did not uniformly apply Dublin Core Metadata Initiative to the metadata, which is a widely accepted metadata standard. These findings suggested that the HealthData.gov metadata suffered from quality issues, particularly related to information that wasn't frequently updated. The results supported the need for policies to standardize metadata and contributed to the development of automated measures of metadata quality.
Assessing Metadata Quality of a Federally Sponsored Health Data Repository
Marc, David T.; Beattie, James; Herasevich, Vitaly; Gatewood, Laël; Zhang, Rui
2016-01-01
The U.S. Federal Government developed HealthData.gov to disseminate healthcare datasets to the public. Metadata is provided for each datasets and is the sole source of information to find and retrieve data. This study employed automated quality assessments of the HealthData.gov metadata published from 2012 to 2014 to measure completeness, accuracy, and consistency of applying standards. The results demonstrated that metadata published in earlier years had lower completeness, accuracy, and consistency. Also, metadata that underwent modifications following their original creation were of higher quality. HealthData.gov did not uniformly apply Dublin Core Metadata Initiative to the metadata, which is a widely accepted metadata standard. These findings suggested that the HealthData.gov metadata suffered from quality issues, particularly related to information that wasn’t frequently updated. The results supported the need for policies to standardize metadata and contributed to the development of automated measures of metadata quality. PMID:28269883
Somatic Mutations and Neoepitope Homology in Melanomas Treated with CTLA-4 Blockade.
Nathanson, Tavi; Ahuja, Arun; Rubinsteyn, Alexander; Aksoy, Bulent Arman; Hellmann, Matthew D; Miao, Diana; Van Allen, Eliezer; Merghoub, Taha; Wolchok, Jedd D; Snyder, Alexandra; Hammerbacher, Jeff
2017-01-01
Immune checkpoint inhibitors are promising treatments for patients with a variety of malignancies. Toward understanding the determinants of response to immune checkpoint inhibitors, it was previously demonstrated that the presence of somatic mutations is associated with benefit from checkpoint inhibition. A hypothesis was posited that neoantigen homology to pathogens may in part explain the link between somatic mutations and response. To further examine this hypothesis, we reanalyzed cancer exome data obtained from our previously published study of 64 melanoma patients treated with CTLA-4 blockade and a new dataset of RNA-Seq data from 24 of these patients. We found that the ability to accurately predict patient benefit did not increase as the analysis narrowed from somatic mutation burden, to inclusion of only those mutations predicted to be MHC class I neoantigens, to only including those neoantigens that were expressed or that had homology to pathogens. The only association between somatic mutation burden and response was found when examining samples obtained prior to treatment. Neoantigen and expressed neoantigen burden were also associated with response, but neither was more predictive than somatic mutation burden. Neither the previously described tetrapeptide signature nor an updated method to evaluate neoepitope homology to pathogens was more predictive than mutation burden. Cancer Immunol Res; 5(1); 84-91. ©2016 AACR. ©2016 American Association for Cancer Research.
Highland, Steven; James, R R
2016-04-01
Honey bee (Apis mellifera L., Hymenoptera: Apidae) colonies have experienced profound fluctuations, especially declines, in the past few decades. Long-term datasets on honey bees are needed to identify the most important environmental and cultural factors associated with these changes. While a few such datasets exist, scientists have been hesitant to use some of these due to perceived shortcomings in the data. We compared data and trends for three datasets. Two come from the US Department of Agriculture's National Agricultural Statistics Service (NASS), Agricultural Statistics Board: one is the annual survey of honey-producing colonies from the Annual Bee and Honey program (ABH), and the other is colony counts from the Census of Agriculture conducted every five years. The third dataset we developed from the number of colonies registered annually by some states. We compared the long-term patterns of change in colony numbers among the datasets on a state-by-state basis. The three datasets often showed similar hive numbers and trends varied by state, with differences between datasets being greatest for those states receiving a large number of migratory colonies. Dataset comparisons provide a method to estimate the number of colonies in a state used for pollination versus honey production. Some states also had separate data for local and migratory colonies, allowing one to determine whether the migratory colonies were typically used for pollination or honey production. The Census of Agriculture should provide the most accurate long-term data on colony numbers, but only every five years. © The Authors 2016. Published by Oxford University Press on behalf of Entomological Society of America. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Shen, Yi; Wang, Zhanwei; Loo, Lenora WM; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A.; Katsaros, Dionyssios; Yu, Herbert
2015-01-01
Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management. PMID:26564482
Shen, Yi; Wang, Zhanwei; Loo, Lenora W M; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A; Katsaros, Dionyssios; Yu, Herbert
2015-12-01
Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management.
Maes, Dirk; Vanreusel, Wouter; Herremans, Marc; Vantieghem, Pieter; Brosens, Dimitri; Gielen, Karin; Beck, Olivier; Van Dyck, Hans; Desmet, Peter; Natuurpunt, Vlinderwerkgroep
2016-01-01
Abstract In this data paper, we describe two datasets derived from two sources, which collectively represent the most complete overview of butterflies in Flanders and the Brussels Capital Region (northern Belgium). The first dataset (further referred to as the INBO dataset – http://doi.org/10.15468/njgbmh) contains 761,660 records of 70 species and is compiled by the Research Institute for Nature and Forest (INBO) in cooperation with the Butterfly working group of Natuurpunt (Vlinderwerkgroep). It is derived from the database Vlinderdatabank at the INBO, which consists of (historical) collection and literature data (1830-2001), for which all butterfly specimens in institutional and available personal collections were digitized and all entomological and other relevant publications were checked for butterfly distribution data. It also contains observations and monitoring data for the period 1991-2014. The latter type were collected by a (small) butterfly monitoring network where butterflies were recorded using a standardized protocol. The second dataset (further referred to as the Natuurpunt dataset – http://doi.org/10.15468/ezfbee) contains 612,934 records of 63 species and is derived from the database http://waarnemingen.be, hosted at the nature conservation NGO Natuurpunt in collaboration with Stichting Natuurinformatie. This dataset contains butterfly observations by volunteers (citizen scientists), mainly since 2008. Together, these datasets currently contain a total of 1,374,594 records, which are georeferenced using the centroid of their respective 5 × 5 km² Universal Transverse Mercator (UTM) grid cell. Both datasets are published as open data and are available through the Global Biodiversity Information Facility (GBIF). PMID:27199606
Maes, Dirk; Vanreusel, Wouter; Herremans, Marc; Vantieghem, Pieter; Brosens, Dimitri; Gielen, Karin; Beck, Olivier; Van Dyck, Hans; Desmet, Peter; Natuurpunt, Vlinderwerkgroep
2016-01-01
In this data paper, we describe two datasets derived from two sources, which collectively represent the most complete overview of butterflies in Flanders and the Brussels Capital Region (northern Belgium). The first dataset (further referred to as the INBO dataset - http://doi.org/10.15468/njgbmh) contains 761,660 records of 70 species and is compiled by the Research Institute for Nature and Forest (INBO) in cooperation with the Butterfly working group of Natuurpunt (Vlinderwerkgroep). It is derived from the database Vlinderdatabank at the INBO, which consists of (historical) collection and literature data (1830-2001), for which all butterfly specimens in institutional and available personal collections were digitized and all entomological and other relevant publications were checked for butterfly distribution data. It also contains observations and monitoring data for the period 1991-2014. The latter type were collected by a (small) butterfly monitoring network where butterflies were recorded using a standardized protocol. The second dataset (further referred to as the Natuurpunt dataset - http://doi.org/10.15468/ezfbee) contains 612,934 records of 63 species and is derived from the database http://waarnemingen.be, hosted at the nature conservation NGO Natuurpunt in collaboration with Stichting Natuurinformatie. This dataset contains butterfly observations by volunteers (citizen scientists), mainly since 2008. Together, these datasets currently contain a total of 1,374,594 records, which are georeferenced using the centroid of their respective 5 × 5 km² Universal Transverse Mercator (UTM) grid cell. Both datasets are published as open data and are available through the Global Biodiversity Information Facility (GBIF).
De-identification of health records using Anonym: effectiveness and robustness across datasets.
Zuccon, Guido; Kotzur, Daniel; Nguyen, Anthony; Bergheim, Anton
2014-07-01
Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data. Crown Copyright © 2014. Published by Elsevier B.V. All rights reserved.
LEAP: biomarker inference through learning and evaluating association patterns.
Jiang, Xia; Neapolitan, Richard E
2015-03-01
Single nucleotide polymorphism (SNP) high-dimensional datasets are available from Genome Wide Association Studies (GWAS). Such data provide researchers opportunities to investigate the complex genetic basis of diseases. Much of genetic risk might be due to undiscovered epistatic interactions, which are interactions in which combination of several genes affect disease. Research aimed at discovering interacting SNPs from GWAS datasets proceeded in two directions. First, tools were developed to evaluate candidate interactions. Second, algorithms were developed to search over the space of candidate interactions. Another problem when learning interacting SNPs, which has not received much attention, is evaluating how likely it is that the learned SNPs are associated with the disease. A complete system should provide this information as well. We develop such a system. Our system, called LEAP, includes a new heuristic search algorithm for learning interacting SNPs, and a Bayesian network based algorithm for computing the probability of their association. We evaluated the performance of LEAP using 100 1,000-SNP simulated datasets, each of which contains 15 SNPs involved in interactions. When learning interacting SNPs from these datasets, LEAP outperformed seven others methods. Furthermore, only SNPs involved in interactions were found to be probable. We also used LEAP to analyze real Alzheimer's disease and breast cancer GWAS datasets. We obtained interesting and new results from the Alzheimer's dataset, but limited results from the breast cancer dataset. We conclude that our results support that LEAP is a useful tool for extracting candidate interacting SNPs from high-dimensional datasets and determining their probability. © 2015 The Authors. *Genetic Epidemiology published by Wiley Periodicals, Inc.
Development of global sea ice 6.0 CICE configuration for the Met Office global coupled model
Rae, J. . G. L; Hewitt, H. T.; Keen, A. B.; ...
2015-03-05
The new sea ice configuration GSI6.0, used in the Met Office global coupled configuration GC2.0, is described and the sea ice extent, thickness and volume are compared with the previous configuration and with observationally-based datasets. In the Arctic, the sea ice is thicker in all seasons than in the previous configuration, and there is now better agreement of the modelled concentration and extent with the HadISST dataset. In the Antarctic, a warm bias in the ocean model has been exacerbated at the higher resolution of GC2.0, leading to a large reduction in ice extent and volume; further work is requiredmore » to rectify this in future configurations.« less
Relationship between Defect Size and Fatigue Life Distributions in Al-7 Pct Si-Mg Alloy Castings
NASA Astrophysics Data System (ADS)
Tiryakioğlu, Murat
2009-07-01
A new method for predicting the variability in fatigue life of castings was developed by combining the size distribution for the fatigue-initiating defects and a fatigue life model based on the Paris-Erdoğan law for crack propagation. Two datasets for the fatigue-initiating defects in Al-7 pct Si-Mg alloy castings, reported previously in the literature, were used to demonstrate that (1) the size of fatigue-initiating defects follow the Gumbel distribution; (2) the crack propagation model developed previously provides respectable fits to experimental data; and (3) the method developed in the present study expresses the variability in both datasets, almost as well as the lognormal distribution and better than the Weibull distribution.
ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System.
Urbanowicz, Ryan J; Moore, Jason H
2015-09-01
Algorithmic scalability is a major concern for any machine learning strategy in this age of 'big data'. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExS-TraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, noisy, and heterogeneous problem domains. While Michigan-style learning classifier systems are powerful and flexible learners, they are not considered to be particularly scalable. For the first time, this paper presents a complete description of the ExS-TraCS algorithm and introduces an effective strategy to dramatically improve learning classifier system scalability. ExSTraCS 2.0 addresses scalability with (1) a rule specificity limit, (2) new approaches to expert knowledge guided covering and mutation mechanisms, and (3) the implementation and utilization of the TuRF algorithm for improving the quality of expert knowledge discovery in larger datasets. Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets. ExSTraCS 2.0 was also able to reliably solve the 6, 11, 20, 37, 70, and 135 multiplexer problems, and did so in similar or fewer learning iterations than previously reported, with smaller finite training sets, and without using building blocks discovered from simpler multiplexer problems. Furthermore, ExS-TraCS usability was made simpler through the elimination of previously critical run parameters.
2011-01-01
Background Increased understanding of the variability in normal breast biology will enable us to identify mechanisms of breast cancer initiation and the origin of different subtypes, and to better predict breast cancer risk. Methods Gene expression patterns in breast biopsies from 79 healthy women referred to breast diagnostic centers in Norway were explored by unsupervised hierarchical clustering and supervised analyses, such as gene set enrichment analysis and gene ontology analysis and comparison with previously published genelists and independent datasets. Results Unsupervised hierarchical clustering identified two separate clusters of normal breast tissue based on gene-expression profiling, regardless of clustering algorithm and gene filtering used. Comparison of the expression profile of the two clusters with several published gene lists describing breast cells revealed that the samples in cluster 1 share characteristics with stromal cells and stem cells, and to a certain degree with mesenchymal cells and myoepithelial cells. The samples in cluster 1 also share many features with the newly identified claudin-low breast cancer intrinsic subtype, which also shows characteristics of stromal and stem cells. More women belonging to cluster 1 have a family history of breast cancer and there is a slight overrepresentation of nulliparous women in cluster 1. Similar findings were seen in a separate dataset consisting of histologically normal tissue from both breasts harboring breast cancer and from mammoplasty reductions. Conclusion This is the first study to explore the variability of gene expression patterns in whole biopsies from normal breasts and identified distinct subtypes of normal breast tissue. Further studies are needed to determine the specific cell contribution to the variation in the biology of normal breasts, how the clusters identified relate to breast cancer risk and their possible link to the origin of the different molecular subtypes of breast cancer. PMID:22044755
Lamontagne, Maxime; Timens, Wim; Hao, Ke; Bossé, Yohan; Laviolette, Michel; Steiling, Katrina; Campbell, Joshua D; Couture, Christian; Conti, Massimo; Sherwood, Karen; Hogg, James C; Brandsma, Corry-Anke; van den Berge, Maarten; Sandford, Andrew; Lam, Stephen; Lenburg, Marc E; Spira, Avrum; Paré, Peter D; Nickle, David; Sin, Don D; Postma, Dirkje S
2014-11-01
COPD is a complex chronic disease with poorly understood pathogenesis. Integrative genomic approaches have the potential to elucidate the biological networks underlying COPD and lung function. We recently combined genome-wide genotyping and gene expression in 1111 human lung specimens to map expression quantitative trait loci (eQTL). To determine causal associations between COPD and lung function-associated single nucleotide polymorphisms (SNPs) and lung tissue gene expression changes in our lung eQTL dataset. We evaluated causality between SNPs and gene expression for three COPD phenotypes: FEV(1)% predicted, FEV(1)/FVC and COPD as a categorical variable. Different models were assessed in the three cohorts independently and in a meta-analysis. SNPs associated with a COPD phenotype and gene expression were subjected to causal pathway modelling and manual curation. In silico analyses evaluated functional enrichment of biological pathways among newly identified causal genes. Biologically relevant causal genes were validated in two separate gene expression datasets of lung tissues and bronchial airway brushings. High reliability causal relations were found in SNP-mRNA-phenotype triplets for FEV(1)% predicted (n=169) and FEV(1)/FVC (n=80). Several genes of potential biological relevance for COPD were revealed. eQTL-SNPs upregulating cystatin C (CST3) and CD22 were associated with worse lung function. Signalling pathways enriched with causal genes included xenobiotic metabolism, apoptosis, protease-antiprotease and oxidant-antioxidant balance. By using integrative genomics and analysing the relationships of COPD phenotypes with SNPs and gene expression in lung tissue, we identified CST3 and CD22 as potential causal genes for airflow obstruction. This study also augmented the understanding of previously described COPD pathways. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
An algorithm for direct causal learning of influences on patient outcomes.
Rathnam, Chandramouli; Lee, Sanghoon; Jiang, Xia
2017-01-01
This study aims at developing and introducing a new algorithm, called direct causal learner (DCL), for learning the direct causal influences of a single target. We applied it to both simulated and real clinical and genome wide association study (GWAS) datasets and compared its performance to classic causal learning algorithms. The DCL algorithm learns the causes of a single target from passive data using Bayesian-scoring, instead of using independence checks, and a novel deletion algorithm. We generate 14,400 simulated datasets and measure the number of datasets for which DCL correctly and partially predicts the direct causes. We then compare its performance with the constraint-based path consistency (PC) and conservative PC (CPC) algorithms, the Bayesian-score based fast greedy search (FGS) algorithm, and the partial ancestral graphs algorithm fast causal inference (FCI). In addition, we extend our comparison of all five algorithms to both a real GWAS dataset and real breast cancer datasets over various time-points in order to observe how effective they are at predicting the causal influences of Alzheimer's disease and breast cancer survival. DCL consistently outperforms FGS, PC, CPC, and FCI in discovering the parents of the target for the datasets simulated using a simple network. Overall, DCL predicts significantly more datasets correctly (McNemar's test significance: p<0.0001) than any of the other algorithms for these network types. For example, when assessing overall performance (simple and complex network results combined), DCL correctly predicts approximately 1400 more datasets than the top FGS method, 1600 more datasets than the top CPC method, 4500 more datasets than the top PC method, and 5600 more datasets than the top FCI method. Although FGS did correctly predict more datasets than DCL for the complex networks, and DCL correctly predicted only a few more datasets than CPC for these networks, there is no significant difference in performance between these three algorithms for this network type. However, when we use a more continuous measure of accuracy, we find that all the DCL methods are able to better partially predict more direct causes than FGS and CPC for the complex networks. In addition, DCL consistently had faster runtimes than the other algorithms. In the application to the real datasets, DCL identified rs6784615, located on the NISCH gene, and rs10824310, located on the PRKG1 gene, as direct causes of late onset Alzheimer's disease (LOAD) development. In addition, DCL identified ER category as a direct predictor of breast cancer mortality within 5 years, and HER2 status as a direct predictor of 10-year breast cancer mortality. These predictors have been identified in previous studies to have a direct causal relationship with their respective phenotypes, supporting the predictive power of DCL. When the other algorithms discovered predictors from the real datasets, these predictors were either also found by DCL or could not be supported by previous studies. Our results show that DCL outperforms FGS, PC, CPC, and FCI in almost every case, demonstrating its potential to advance causal learning. Furthermore, our DCL algorithm effectively identifies direct causes in the LOAD and Metabric GWAS datasets, which indicates its potential for clinical applications. Copyright © 2016 Elsevier B.V. All rights reserved.
Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data
Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge
2016-01-01
This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers’ perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper. PMID:27376300
Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data.
Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge
2016-07-01
This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers' perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper.
Anguita, Alberto; García-Remesal, Miguel; Graf, Norbert; Maojo, Victor
2016-04-01
Modern biomedical research relies on the semantic integration of heterogeneous data sources to find data correlations. Researchers access multiple datasets of disparate origin, and identify elements-e.g. genes, compounds, pathways-that lead to interesting correlations. Normally, they must refer to additional public databases in order to enrich the information about the identified entities-e.g. scientific literature, published clinical trial results, etc. While semantic integration techniques have traditionally focused on providing homogeneous access to private datasets-thus helping automate the first part of the research, and there exist different solutions for browsing public data, there is still a need for tools that facilitate merging public repositories with private datasets. This paper presents a framework that automatically locates public data of interest to the researcher and semantically integrates it with existing private datasets. The framework has been designed as an extension of traditional data integration systems, and has been validated with an existing data integration platform from a European research project by integrating a private biological dataset with data from the National Center for Biotechnology Information (NCBI). Copyright © 2016 Elsevier Inc. All rights reserved.
The NASA Subsonic Jet Particle Image Velocimetry (PIV) Dataset
NASA Technical Reports Server (NTRS)
Bridges, James; Wernet, Mark P.
2011-01-01
Many tasks in fluids engineering require prediction of turbulence of jet flows. The present document documents the single-point statistics of velocity, mean and variance, of cold and hot jet flows. The jet velocities ranged from 0.5 to 1.4 times the ambient speed of sound, and temperatures ranged from unheated to static temperature ratio 2.7. Further, the report assesses the accuracies of the data, e.g., establish uncertainties for the data. This paper covers the following five tasks: (1) Document acquisition and processing procedures used to create the particle image velocimetry (PIV) datasets. (2) Compare PIV data with hotwire and laser Doppler velocimetry (LDV) data published in the open literature. (3) Compare different datasets acquired at the same flow conditions in multiple tests to establish uncertainties. (4) Create a consensus dataset for a range of hot jet flows, including uncertainty bands. (5) Analyze this consensus dataset for self-consistency and compare jet characteristics to those of the open literature. The final objective was fulfilled by using the potential core length and the spread rate of the half-velocity radius to collapse of the mean and turbulent velocity fields over the first 20 jet diameters.
A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.
Goldstein, Markus; Uchida, Seiichi
2016-01-01
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.
Statistical tests and identifiability conditions for pooling and analyzing multisite datasets.
Zhou, Hao Henry; Singh, Vikas; Johnson, Sterling C; Wahba, Grace
2018-02-13
When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer's disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies. Copyright © 2018 the Author(s). Published by PNAS.
NASA Technical Reports Server (NTRS)
Goseva-Popstojanova, Katerina; Tyo, Jacob
2017-01-01
While some prior research work exists on characteristics of software faults (i.e., bugs) and failures, very little work has been published on analysis of software applications vulnerabilities. This paper aims to contribute towards filling that gap by presenting an empirical investigation of application vulnerabilities. The results are based on data extracted from issue tracking systems of two NASA missions. These data were organized in three datasets: Ground mission IVV issues, Flight mission IVV issues, and Flight mission Developers issues. In each dataset, we identified security related software bugs and classified them in specific vulnerability classes. Then, we created the security vulnerability profiles, i.e., determined where and when the security vulnerabilities were introduced and what were the dominating vulnerabilities classes. Our main findings include: (1) In IVV issues datasets the majority of vulnerabilities were code related and were introduced in the Implementation phase. (2) For all datasets, around 90 of the vulnerabilities were located in two to four subsystems. (3) Out of 21 primary classes, five dominated: Exception Management, Memory Access, Other, Risky Values, and Unused Entities. Together, they contributed from 80 to 90 of vulnerabilities in each dataset.
MoonDB — A Data System for Analytical Data of Lunar Samples
NASA Astrophysics Data System (ADS)
Lehnert, K.; Ji, P.; Cai, M.; Evans, C.; Zeigler, R.
2018-04-01
MoonDB is a data system that makes analytical data from the Apollo lunar sample collection and lunar meteorites accessible by synthesizing published and unpublished datasets in a relational database with an online search interface.
A correlation comparison between Altmetric Attention Scores and citations for six PLOS journals.
Huang, Wenya; Wang, Peiling; Wu, Qiang
2018-01-01
This study considered all articles published in six Public Library of Science (PLOS) journals in 2012 and Web of Science citations for these articles as of May 2015. A total of 2,406 articles were analyzed to examine the relationships between Altmetric Attention Scores (AAS) and Web of Science citations. The AAS for an article, provided by Altmetric aggregates activities surrounding research outputs in social media (news outlet mentions, tweets, blogs, Wikipedia, etc.). Spearman correlation testing was done on all articles and articles with AAS. Further analysis compared the stratified datasets based on percentile ranks of AAS: top 50%, top 25%, top 10%, and top 1%. Comparisons across the six journals provided additional insights. The results show significant positive correlations between AAS and citations with varied strength for all articles and articles with AAS (or social media mentions), as well as for normalized AAS in the top 50%, top 25%, top 10%, and top 1% datasets. Four of the six PLOS journals, Genetics, Pathogens, Computational Biology, and Neglected Tropical Diseases, show significant positive correlations across all datasets. However, for the two journals with high impact factors, PLOS Biology and Medicine, the results are unexpected: the Medicine articles showed no significant correlations but the Biology articles tested positive for correlations with the whole dataset and the set with AAS. Both journals published substantially fewer articles than the other four journals. Further research to validate the AAS algorithm, adjust the weighting scheme, and include appropriate social media sources is needed to understand the potential uses and meaning of AAS in different contexts and its relationship to other metrics.
NASA Astrophysics Data System (ADS)
Santhana Vannan, S. K.; Ramachandran, R.; Deb, D.; Beaty, T.; Wright, D.
2017-12-01
This paper summarizes the workflow challenges of curating and publishing data produced from disparate data sources and provides a generalized workflow solution to efficiently archive data generated by researchers. The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) for biogeochemical dynamics and the Global Hydrology Resource Center (GHRC) DAAC have been collaborating on the development of a generalized workflow solution to efficiently manage the data publication process. The generalized workflow presented here are built on lessons learned from implementations of the workflow system. Data publication consists of the following steps: Accepting the data package from the data providers, ensuring the full integrity of the data files. Identifying and addressing data quality issues Assembling standardized, detailed metadata and documentation, including file level details, processing methodology, and characteristics of data files Setting up data access mechanisms Setup of the data in data tools and services for improved data dissemination and user experience Registering the dataset in online search and discovery catalogues Preserving the data location through Digital Object Identifiers (DOI) We will describe the steps taken to automate, and realize efficiencies to the above process. The goals of the workflow system are to reduce the time taken to publish a dataset, to increase the quality of documentation and metadata, and to track individual datasets through the data curation process. Utilities developed to achieve these goal will be described. We will also share metrics driven value of the workflow system and discuss the future steps towards creation of a common software framework.
Polling, C; Tulloch, A; Banerjee, S; Cross, S; Dutta, R; Wood, D M; Dargan, P I; Hotopf, M
2015-07-16
Self-harm is a significant public health concern in the UK. This is reflected in the recent addition to the English Public Health Outcomes Framework of rates of attendance at Emergency Departments (EDs) following self-harm. However there is currently no source of data to measure this outcome. Routinely available data for inpatient admissions following self-harm miss the majority of cases presenting to services. We aimed to investigate (i) if a dataset of ED presentations could be produced using a combination of routinely collected clinical and administrative data and (ii) to validate this dataset against another one produced using methods similar to those used in previous studies. Using the Clinical Record Interactive Search system, the electronic health records (EHRs) used in four EDs were linked to Hospital Episode Statistics to create a dataset of attendances following self-harm. This dataset was compared with an audit dataset of ED attendances created by manual searching of ED records. The proportion of total cases detected by each dataset was compared. There were 1932 attendances detected by the EHR dataset and 1906 by the audit. The EHR and audit datasets detected 77% and 76 of all attendances respectively and both detected 82% of individual patients. There were no differences in terms of age, sex, ethnicity or marital status between those detected and those missed using the EHR method. Both datasets revealed more than double the number of self-harm incidents than could be identified from inpatient admission records. It was possible to use routinely collected EHR data to create a dataset of attendances at EDs following self-harm. The dataset detected the same proportion of attendances and individuals as the audit dataset, proved more comprehensive than the use of inpatient admission records, and did not show a systematic bias in those cases it missed.
Smith, Tanya; Page-Nicholson, Samantha; Gibbons, Bradley; Jones, M. Genevieve W.; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin
2016-01-01
Abstract Background The International Crane Foundation (ICF) / Endangered Wildlife Trust’s (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus, Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus. The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. New information This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal. PMID:27956850
Data You May Like: A Recommender System for Research Data Discovery
NASA Astrophysics Data System (ADS)
Devaraju, A.; Davy, R.; Hogan, D.
2016-12-01
Various data portals been developed to facilitate access to research datasets from different sources. For example, the Data Publisher for Earth & Environmental Science (PANGAEA), the Registry of Research Data Repositories (re3data.org), and the National Geoscience Data Centre (NGDC). Due to data quantity and heterogeneity, finding relevant datasets on these portals may be difficult and tedious. Keyword searches based on specific metadata elements or multi-key indexes may return irrelevant results. Faceted searches may be unsatisfactory and time consuming, especially when facet values are exhaustive. We need a much more intelligent way to complement existing searching mechanisms in order to enhance user experiences of the data portals. We developed a recommender system that helps users to find the most relevant research datasets on the CSIRO's Data Access Portal (DAP). The system is based on content-based filtering. We computed the similarity of datasets based on data attributes (e.g., descriptions, fields of research, location, contributors, and provenance) and inference from transaction logs (e.g., the relations among datasets and between queries and datasets). We improved the recommendation quality by assigning weights to data similarities. The weight values are drawn from a survey involving data users. The recommender results for a given dataset are accessible programmatically via a web service. Taking both data attributes and user actions into account, the recommender system will make it easier for researchers to find and reuse data offered through the data portal.
Castandet, Benoît; Hotto, Amber M.; Strickler, Susan R.; ...
2016-07-06
Although RNA-Seq has revolutionized transcript analysis, organellar transcriptomes are rarely assessed even when present in published datasets. Here, we describe the development and application of a rapid and convenient method, ChloroSeq, to delineate qualitative and quantitative features of chloroplast RNA metabolism from strand-specific RNA-Seq datasets, including processing, editing, splicing, and relative transcript abundance. The use of a single experiment to analyze systematically chloroplast transcript maturation and abundance is of particular interest due to frequent pleiotropic effects observed in mutants that affect chloroplast gene expression and/or photosynthesis. To illustrate its utility, ChloroSeq was applied to published RNA-Seq datasets derived from Arabidopsismore » thaliana grown under control and abiotic stress conditions, where the organellar transcriptome had not been examined. The most appreciable effects were found for heat stress, which induces a global reduction in splicing and editing efficiency, and leads to increased abundance of chloroplast transcripts, including genic, intergenic, and antisense transcripts. Moreover, by concomitantly analyzing nuclear transcripts that encode chloroplast gene expression regulators from the same libraries, we demonstrate the possibility of achieving a holistic understanding of the nucleus-organelle system. In conclusion, ChloroSeq thus represents a unique method for streamlining RNA-Seq data interpretation of the chloroplast transcriptome and its regulators.« less
Duan, Qiaonan; Wang, Zichen; Fernandez, Nicolas F; Rouillard, Andrew D; Tan, Christopher M; Benes, Cyril H; Ma'ayan, Avi
2014-11-15
Recently, several high profile studies collected cell viability data from panels of cancer cell lines treated with many drugs applied at different concentrations. Such drug sensitivity data for cancer cell lines provide suggestive treatments for different types and subtypes of cancer. Visualization of these datasets can reveal patterns that may not be obvious by examining the data without such efforts. Here we introduce Drug/Cell-line Browser (DCB), an online interactive HTML5 data visualization tool for interacting with three of the recently published datasets of cancer cell lines/drug-viability studies. DCB uses clustering and canvas visualization of the drugs and the cell lines, as well as a bar graph that summarizes drug effectiveness for the tissue of origin or the cancer subtypes for single or multiple drugs. DCB can help in understanding drug response patterns and prioritizing drug/cancer cell line interactions by tissue of origin or cancer subtype. DCB is an open source Web-based tool that is freely available at: http://www.maayanlab.net/LINCS/DCB CONTACT: avi.maayan@mssm.edu Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Zhao, Yongan; Wang, Xiaofeng; Jiang, Xiaoqian; Ohno-Machado, Lucila; Tang, Haixu
2015-01-01
To propose a new approach to privacy preserving data selection, which helps the data users access human genomic datasets efficiently without undermining patients' privacy. Our idea is to let each data owner publish a set of differentially-private pilot data, on which a data user can test-run arbitrary association-test algorithms, including those not known to the data owner a priori. We developed a suite of new techniques, including a pilot-data generation approach that leverages the linkage disequilibrium in the human genome to preserve both the utility of the data and the privacy of the patients, and a utility evaluation method that helps the user assess the value of the real data from its pilot version with high confidence. We evaluated our approach on real human genomic data using four popular association tests. Our study shows that the proposed approach can help data users make the right choices in most cases. Even though the pilot data cannot be directly used for scientific discovery, it provides a useful indication of which datasets are more likely to be useful to data users, who can therefore approach the appropriate data owners to gain access to the data. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castandet, Benoît; Hotto, Amber M.; Strickler, Susan R.
Although RNA-Seq has revolutionized transcript analysis, organellar transcriptomes are rarely assessed even when present in published datasets. Here, we describe the development and application of a rapid and convenient method, ChloroSeq, to delineate qualitative and quantitative features of chloroplast RNA metabolism from strand-specific RNA-Seq datasets, including processing, editing, splicing, and relative transcript abundance. The use of a single experiment to analyze systematically chloroplast transcript maturation and abundance is of particular interest due to frequent pleiotropic effects observed in mutants that affect chloroplast gene expression and/or photosynthesis. To illustrate its utility, ChloroSeq was applied to published RNA-Seq datasets derived from Arabidopsismore » thaliana grown under control and abiotic stress conditions, where the organellar transcriptome had not been examined. The most appreciable effects were found for heat stress, which induces a global reduction in splicing and editing efficiency, and leads to increased abundance of chloroplast transcripts, including genic, intergenic, and antisense transcripts. Moreover, by concomitantly analyzing nuclear transcripts that encode chloroplast gene expression regulators from the same libraries, we demonstrate the possibility of achieving a holistic understanding of the nucleus-organelle system. In conclusion, ChloroSeq thus represents a unique method for streamlining RNA-Seq data interpretation of the chloroplast transcriptome and its regulators.« less
Quality Controlling CMIP datasets at GFDL
NASA Astrophysics Data System (ADS)
Horowitz, L. W.; Radhakrishnan, A.; Balaji, V.; Adcroft, A.; Krasting, J. P.; Nikonov, S.; Mason, E. E.; Schweitzer, R.; Nadeau, D.
2017-12-01
As GFDL makes the switch from model development to production in light of the Climate Model Intercomparison Project (CMIP), GFDL's efforts are shifted to testing and more importantly establishing guidelines and protocols for Quality Controlling and semi-automated data publishing. Every CMIP cycle introduces key challenges and the upcoming CMIP6 is no exception. The new CMIP experimental design comprises of multiple MIPs facilitating research in different focus areas. This paradigm has implications not only for the groups that develop the models and conduct the runs, but also for the groups that monitor, analyze and quality control the datasets before data publishing, before their knowledge makes its way into reports like the IPCC (Intergovernmental Panel on Climate Change) Assessment Reports. In this talk, we discuss some of the paths taken at GFDL to quality control the CMIP-ready datasets including: Jupyter notebooks, PrePARE, LAMP (Linux, Apache, MySQL, PHP/Python/Perl): technology-driven tracker system to monitor the status of experiments qualitatively and quantitatively, provide additional metadata and analysis services along with some in-built controlled-vocabulary validations in the workflow. In addition to this, we also discuss the integration of community-based model evaluation software (ESMValTool, PCMDI Metrics Package, and ILAMB) as part of our CMIP6 workflow.
Mashburn, Shana L.; Winton, Kimberly T.
2010-01-01
This CD-ROM contains spatial datasets that describe natural and anthropogenic features and county-level estimates of agricultural pesticide use and pesticide data for surface-water, groundwater, and biological specimens in the state of Oklahoma. County-level estimates of pesticide use were compiled from the Pesticide National Synthesis Project of the U.S. Geological Survey, National Water-Quality Assessment Program. Pesticide data for surface water, groundwater, and biological specimens were compiled from U.S. Geological Survey National Water Information System database. These spatial datasets that describe natural and manmade features were compiled from several agencies and contain information collected by the U.S. Geological Survey. The U.S. Geological Survey datasets were not collected specifically for this compilation, but were previously collected for projects with various objectives. The spatial datasets were created by different agencies from sources with varied quality. As a result, features common to multiple layers may not overlay exactly. Users should check the metadata to determine proper use of these spatial datasets. These data were not checked for accuracy or completeness. If a question of accuracy or completeness arise, the user should contact the originator cited in the metadata.
Full-motion video analysis for improved gender classification
NASA Astrophysics Data System (ADS)
Flora, Jeffrey B.; Lochtefeld, Darrell F.; Iftekharuddin, Khan M.
2014-06-01
The ability of computer systems to perform gender classification using the dynamic motion of the human subject has important applications in medicine, human factors, and human-computer interface systems. Previous works in motion analysis have used data from sensors (including gyroscopes, accelerometers, and force plates), radar signatures, and video. However, full-motion video, motion capture, range data provides a higher resolution time and spatial dataset for the analysis of dynamic motion. Works using motion capture data have been limited by small datasets in a controlled environment. In this paper, we explore machine learning techniques to a new dataset that has a larger number of subjects. Additionally, these subjects move unrestricted through a capture volume, representing a more realistic, less controlled environment. We conclude that existing linear classification methods are insufficient for the gender classification for larger dataset captured in relatively uncontrolled environment. A method based on a nonlinear support vector machine classifier is proposed to obtain gender classification for the larger dataset. In experimental testing with a dataset consisting of 98 trials (49 subjects, 2 trials per subject), classification rates using leave-one-out cross-validation are improved from 73% using linear discriminant analysis to 88% using the nonlinear support vector machine classifier.
Reliable and Persistent Identification of Linked Data Elements
NASA Astrophysics Data System (ADS)
Wood, David
Linked Data techniques rely upon common terminology in a manner similar to a relational database'vs reliance on a schema. Linked Data terminology anchors metadata descriptions and facilitates navigation of information. Common vocabularies ease the human, social tasks of understanding datasets sufficiently to construct queries and help to relate otherwise disparate datasets. Vocabulary terms must, when using the Resource Description Framework, be grounded in URIs. A current bestpractice on the World Wide Web is to serve vocabulary terms as Uniform Resource Locators (URLs) and present both human-readable and machine-readable representations to the public. Linked Data terminology published to theWorldWideWeb may be used by others without reference or notification to the publishing party. That presents a problem: Vocabulary publishers take on an implicit responsibility to maintain and publish their terms via the URLs originally assigned, regardless of the inconvenience such a responsibility may cause. Over the course of years, people change jobs, publishing organizations change Internet domain names, computers change IP addresses,systems administrators publish old material in new ways. Clearly, a mechanism is required to manageWeb-based vocabularies over a long term. This chapter places Linked Data vocabularies in context with the wider concepts of metadata in general and specifically metadata on the Web. Persistent identifier mechanisms are reviewed, with a particular emphasis on Persistent URLs, or PURLs. PURLs and PURL services are discussed in the context of Linked Data. Finally, historic weaknesses of PURLs are resolved by the introduction of a federation of PURL services to address needs specific to Linked Data.
Richmond, Sarah A; Willan, Andrew R; Rothman, Linda; Camden, Andi; Buliung, Ron; Macarthur, Colin; Howard, Andrew
2014-06-01
To perform a more sophisticated analysis of previously published data that advances the understanding of the efficacy of pedestrian countdown signal (PCS) installation on pedestrian-motor vehicle collisions (PMVCs), in the city of Toronto, Canada. This is an updated analysis of the same dataset from Camden et al. A quasi-experimental design was used to evaluate the effect of PCS on PMVC. A Poisson regression analysis, using a one-group comparison of PMVC, pre-PCS installation to post-PCS installation was used, controlling for season and temporal effects. The outcome was the frequency of reported PMVC (January 2000-December 2009). Similar models were used to analyse specific types of collisions defined by age of pedestrian, injury severity, and pedestrian and vehicle action. Incidence rate ratios with 95% CI are presented. This analysis included 9262 PMVC, 2760 during or after PCS installation, at 1965 intersections. There was a 26% increase in the rate of collisions, pre to post-PCS installation (incidence rate ratio=1.26, 95% CI 1.11 to 1.42). The installation of PCS at 1965 signalised intersections in the city of Toronto resulted in an increase in PMVC rates post-PCS installation. PCSs may have an unintended consequence of increasing pedestrian-motor vehicle collisions in some settings. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K
2015-06-04
Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.
Damage and protection cost curves for coastal floods within the 600 largest European cities
NASA Astrophysics Data System (ADS)
Prahl, Boris F.; Boettle, Markus; Costa, Luís; Kropp, Jürgen P.; Rybski, Diego
2018-03-01
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves.
Damage and protection cost curves for coastal floods within the 600 largest European cities.
Prahl, Boris F; Boettle, Markus; Costa, Luís; Kropp, Jürgen P; Rybski, Diego
2018-03-20
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves.
Klein, Max; Sharma, Rati; Bohrer, Chris H; Avelis, Cameron M; Roberts, Elijah
2017-01-15
Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology. Source code is licensed under the Apache 2.0 open source license and is available at the project website: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark CONTACT: eroberts@jhu.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Research on cross - Project software defect prediction based on transfer learning
NASA Astrophysics Data System (ADS)
Chen, Ya; Ding, Xiaoming
2018-04-01
According to the two challenges in the prediction of cross-project software defects, the distribution differences between the source project and the target project dataset and the class imbalance in the dataset, proposing a cross-project software defect prediction method based on transfer learning, named NTrA. Firstly, solving the source project data's class imbalance based on the Augmented Neighborhood Cleaning Algorithm. Secondly, the data gravity method is used to give different weights on the basis of the attribute similarity of source project and target project data. Finally, a defect prediction model is constructed by using Trad boost algorithm. Experiments were conducted using data, come from NASA and SOFTLAB respectively, from a published PROMISE dataset. The results show that the method has achieved good values of recall and F-measure, and achieved good prediction results.
Leung, Yuk Yee; Chang, Chun Qi; Hung, Yeung Sam
2012-01-01
Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the 'wrong' (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.
Analysis Of Navy Hornet Squadron Mishap Costs With Regard To Previously Flown Flight Hours
2017-06-01
mishaps occur more frequently in a squadron when flight hours are reduced. This thesis correlates F/A-18 Hornet and Super Hornet squadron previously... correlated to the flight hours flown during the previous three and six months. A linear multivariate model was developed and used to analyze a dataset...hours are reduced. This thesis correlates F/A-18 Hornet and Super Hornet squadron previously flown flight hours with mishap costs. It uses a macro
The cost of a small membrane bioreactor.
Lo, C H; McAdam, E; Judd, S
2015-01-01
The individual cost contributions to the mechanical components of a small membrane bioreactor (MBR) (100-2,500 m3/d flow capacity) are itemised and collated to generate overall capital and operating costs (CAPEX and OPEX) as a function of size. The outcomes are compared to those from previously published detailed cost studies provided for both very small containerised plants (<40 m3/day capacity) and larger municipal plants (2,200-19,000 m3/d). Cost curves, as a function of flow capacity, determined for OPEX, CAPEX and net present value (NPV) based on the heuristic data used indicate a logarithmic function for OPEX and a power-based one for the CAPEX. OPEX correlations were in good quantitative agreement with those reported in the literature. Disparities in the calculated CAPEX trend compared with reported data were attributed to differences in assumptions concerning cost contributions. More reasonable agreement was obtained with the reported membrane separation component CAPEX data from published studies. The heuristic approach taken appears appropriate for small-scale MBRs with minimal costs associated with installation. An overall relationship of net present value=(a tb)Q(-c lnt+d) was determined for the net present value where a=1.265, b=0.44, c=0.00385 and d=0.868 according to the dataset employed for the analysis.
Mu, John C.; Tootoonchi Afshar, Pegah; Mohiyuddin, Marghoob; Chen, Xi; Li, Jian; Bani Asadi, Narges; Gerstein, Mark B.; Wong, Wing H.; Lam, Hugo Y. K.
2015-01-01
A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools. PMID:26412485
Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article
Shotton, David; Portwin, Katie; Klyne, Graham; Miles, Alistair
2009-01-01
Scientific innovation depends on finding, integrating, and re-using the products of previous research. Here we explore how recent developments in Web technology, particularly those related to the publication of data and metadata, might assist that process by providing semantic enhancements to journal articles within the mainstream process of scholarly journal publishing. We exemplify this by describing semantic enhancements we have made to a recent biomedical research article taken from PLoS Neglected Tropical Diseases, providing enrichment to its content and increased access to datasets within it. These semantic enhancements include provision of live DOIs and hyperlinks; semantic markup of textual terms, with links to relevant third-party information resources; interactive figures; a re-orderable reference list; a document summary containing a study summary, a tag cloud, and a citation analysis; and two novel types of semantic enrichment: the first, a Supporting Claims Tooltip to permit “Citations in Context”, and the second, Tag Trees that bring together semantically related terms. In addition, we have published downloadable spreadsheets containing data from within tables and figures, have enriched these with provenance information, and have demonstrated various types of data fusion (mashups) with results from other research articles and with Google Maps. We have also published machine-readable RDF metadata both about the article and about the references it cites, for which we developed a Citation Typing Ontology, CiTO (http://purl.org/net/cito/). The enhanced article, which is available at http://dx.doi.org/10.1371/journal.pntd.0000228.x001, presents a compelling existence proof of the possibilities of semantic publication. We hope the showcase of examples and ideas it contains, described in this paper, will excite the imaginations of researchers and publishers, stimulating them to explore the possibilities of semantic publishing for their own research articles, and thereby break down present barriers to the discovery and re-use of information within traditional modes of scholarly communication. PMID:19381256
Computational Psychiatry and the Challenge of Schizophrenia.
Krystal, John H; Murray, John D; Chekroud, Adam M; Corlett, Philip R; Yang, Genevieve; Wang, Xiao-Jing; Anticevic, Alan
2017-05-01
Schizophrenia research is plagued by enormous challenges in integrating and analyzing large datasets and difficulties developing formal theories related to the etiology, pathophysiology, and treatment of this disorder. Computational psychiatry provides a path to enhance analyses of these large and complex datasets and to promote the development and refinement of formal models for features of this disorder. This presentation introduces the reader to the notion of computational psychiatry and describes discovery-oriented and theory-driven applications to schizophrenia involving machine learning, reinforcement learning theory, and biophysically-informed neural circuit models. Published by Oxford University Press on behalf of the Maryland Psychiatric Research Center 2017.
GLEAM v3: updated land evaporation and root-zone soil moisture datasets
NASA Astrophysics Data System (ADS)
Martens, Brecht; Miralles, Diego; Lievens, Hans; van der Schalie, Robin; de Jeu, Richard; Fernández-Prieto, Diego; Verhoest, Niko
2016-04-01
Evaporation determines the availability of surface water resources and the requirements for irrigation. In addition, through its impacts on the water, carbon and energy budgets, evaporation influences the occurrence of rainfall and the dynamics of air temperature. Therefore, reliable estimates of this flux at regional to global scales are of major importance for water management and meteorological forecasting of extreme events. However, the global-scale magnitude and variability of the flux, and the sensitivity of the underlying physical process to changes in environmental factors, are still poorly understood due to the limited global coverage of in situ measurements. Remote sensing techniques can help to overcome the lack of ground data. However, evaporation is not directly observable from satellite systems. As a result, recent efforts have focussed on combining the observable drivers of evaporation within process-based models. The Global Land Evaporation Amsterdam Model (GLEAM, www.gleam.eu) estimates terrestrial evaporation based on daily satellite observations of meteorological drivers of terrestrial evaporation, vegetation characteristics and soil moisture. Since the publication of the first version of the model in 2011, GLEAM has been widely applied for the study of trends in the water cycle, interactions between land and atmosphere and hydrometeorological extreme events. A third version of the GLEAM global datasets will be available from the beginning of 2016 and will be distributed using www.gleam.eu as gateway. The updated datasets include separate estimates for the different components of the evaporative flux (i.e. transpiration, bare-soil evaporation, interception loss, open-water evaporation and snow sublimation), as well as variables like the evaporative stress, potential evaporation, root-zone soil moisture and surface soil moisture. A new dataset using SMOS-based input data of surface soil moisture and vegetation optical depth will also be distributed. The most important updates in GLEAM include the revision of the soil moisture data assimilation system, the evaporative stress functions and the infiltration of rainfall. In this presentation, we will highlight the changes of the methodology and present the new datasets, their validation against in situ observations and the comparisons against alternative datasets of terrestrial evaporation, such as GLDAS-Noah, ERA-Interim and previous GLEAM datasets. Preliminary results indicate that the magnitude and the spatio-temporal variability of the evaporation estimates have been slightly improved upon previous versions of the datasets.
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E.
2011-12-01
Under several NASA grants, we are generating multi-sensor merged atmospheric datasets to enable the detection of instrument biases and studies of climate trends over decades of data. For example, under a NASA MEASURES grant we are producing a water vapor climatology from the A-Train instruments, stratified by the Cloudsat cloud classification for each geophysical scene. The generation and proper use of such multi-sensor climate data records (CDR's) requires a high level of openness, transparency, and traceability. To make the datasets self-documenting and provide access to full metadata and traceability, we have implemented a set of capabilities and services using known, interoperable protocols. These protocols include OpenSearch, OPeNDAP, Open Provenance Model, service & data casting technologies using Atom feeds, and REST-callable analysis workflows implemented as SciFlo (XML) documents. We advocate that our approach can serve as a blueprint for how to openly "document and serve" complex, multi-sensor CDR's with full traceability. The capabilities and services provided include: - Discovery of the collections by keyword search, exposed using OpenSearch protocol; - Space/time query across the CDR's granules and all of the input datasets via OpenSearch; - User-level configuration of the production workflows so that scientists can select additional physical variables from the A-Train to add to the next iteration of the merged datasets; - Efficient data merging using on-the-fly OPeNDAP variable slicing & spatial subsetting of data out of input netCDF and HDF files (without moving the entire files); - Self-documenting CDR's published in a highly usable netCDF4 format with groups used to organize the variables, CF-style attributes for each variable, numeric array compression, & links to OPM provenance; - Recording of processing provenance and data lineage into a query-able provenance trail in Open Provenance Model (OPM) format, auto-captured by the workflow engine; - Open Publishing of all of the workflows used to generate products as machine-callable REST web services, using the capabilities of the SciFlo workflow engine; - Advertising of the metadata (e.g. physical variables provided, space/time bounding box, etc.) for our prepared datasets as "datacasts" using the Atom feed format; - Publishing of all datasets via our "DataDrop" service, which exploits the WebDAV protocol to enable scientists to access remote data directories as local files on their laptops; - Rich "web browse" of the CDR's with full metadata and the provenance trail one click away; - Advertising of all services as Google-discoverable "service casts" using the Atom format. The presentation will describe our use of the interoperable protocols and demonstrate the capabilities and service GUI's.
van der Krieke, Lian; Emerencia, Ando C; Bos, Elisabeth H; Rosmalen, Judith Gm; Riese, Harriëtte; Aiello, Marco; Sytema, Sjoerd; de Jonge, Peter
2015-08-07
Health promotion can be tailored by combining ecological momentary assessments (EMA) with time series analysis. This combined method allows for studying the temporal order of dynamic relationships among variables, which may provide concrete indications for intervention. However, application of this method in health care practice is hampered because analyses are conducted manually and advanced statistical expertise is required. This study aims to show how this limitation can be overcome by introducing automated vector autoregressive modeling (VAR) of EMA data and to evaluate its feasibility through comparisons with results of previously published manual analyses. We developed a Web-based open source application, called AutoVAR, which automates time series analyses of EMA data and provides output that is intended to be interpretable by nonexperts. The statistical technique we used was VAR. AutoVAR tests and evaluates all possible VAR models within a given combinatorial search space and summarizes their results, thereby replacing the researcher's tasks of conducting the analysis, making an informed selection of models, and choosing the best model. We compared the output of AutoVAR to the output of a previously published manual analysis (n=4). An illustrative example consisting of 4 analyses was provided. Compared to the manual output, the AutoVAR output presents similar model characteristics and statistical results in terms of the Akaike information criterion, the Bayesian information criterion, and the test statistic of the Granger causality test. Results suggest that automated analysis and interpretation of times series is feasible. Compared to a manual procedure, the automated procedure is more robust and can save days of time. These findings may pave the way for using time series analysis for health promotion on a larger scale. AutoVAR was evaluated using the results of a previously conducted manual analysis. Analysis of additional datasets is needed in order to validate and refine the application for general use.
Emerencia, Ando C; Bos, Elisabeth H; Rosmalen, Judith GM; Riese, Harriëtte; Aiello, Marco; Sytema, Sjoerd; de Jonge, Peter
2015-01-01
Background Health promotion can be tailored by combining ecological momentary assessments (EMA) with time series analysis. This combined method allows for studying the temporal order of dynamic relationships among variables, which may provide concrete indications for intervention. However, application of this method in health care practice is hampered because analyses are conducted manually and advanced statistical expertise is required. Objective This study aims to show how this limitation can be overcome by introducing automated vector autoregressive modeling (VAR) of EMA data and to evaluate its feasibility through comparisons with results of previously published manual analyses. Methods We developed a Web-based open source application, called AutoVAR, which automates time series analyses of EMA data and provides output that is intended to be interpretable by nonexperts. The statistical technique we used was VAR. AutoVAR tests and evaluates all possible VAR models within a given combinatorial search space and summarizes their results, thereby replacing the researcher’s tasks of conducting the analysis, making an informed selection of models, and choosing the best model. We compared the output of AutoVAR to the output of a previously published manual analysis (n=4). Results An illustrative example consisting of 4 analyses was provided. Compared to the manual output, the AutoVAR output presents similar model characteristics and statistical results in terms of the Akaike information criterion, the Bayesian information criterion, and the test statistic of the Granger causality test. Conclusions Results suggest that automated analysis and interpretation of times series is feasible. Compared to a manual procedure, the automated procedure is more robust and can save days of time. These findings may pave the way for using time series analysis for health promotion on a larger scale. AutoVAR was evaluated using the results of a previously conducted manual analysis. Analysis of additional datasets is needed in order to validate and refine the application for general use. PMID:26254160
Fernandez-Lozano, Carlos; Gestal, Marcos; Munteanu, Cristian R; Dorado, Julian; Pazos, Alejandro
2016-01-01
The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable.
YM500v2: a small RNA sequencing (smRNA-seq) database for human cancer miRNome research.
Cheng, Wei-Chung; Chung, I-Fang; Tsai, Cheng-Fong; Huang, Tse-Shun; Chen, Chen-Yang; Wang, Shao-Chuan; Chang, Ting-Yu; Sun, Hsing-Jen; Chao, Jeffrey Yung-Chuan; Cheng, Cheng-Chung; Wu, Cheng-Wen; Wang, Hsei-Wei
2015-01-01
We previously presented YM500, which is an integrated database for miRNA quantification, isomiR identification, arm switching discovery and novel miRNA prediction from 468 human smRNA-seq datasets. Here in this updated YM500v2 database (http://ngs.ym.edu.tw/ym500/), we focus on the cancer miRNome to make the database more disease-orientated. New miRNA-related algorithms developed after YM500 were included in YM500v2, and, more significantly, more than 8000 cancer-related smRNA-seq datasets (including those of primary tumors, paired normal tissues, PBMC, recurrent tumors, and metastatic tumors) were incorporated into YM500v2. Novel miRNAs (miRNAs not included in the miRBase R21) were not only predicted by three independent algorithms but also cleaned by a new in silico filtration strategy and validated by wetlab data such as Cross-Linked ImmunoPrecipitation sequencing (CLIP-seq) to reduce the false-positive rate. A new function 'Meta-analysis' is additionally provided for allowing users to identify real-time differentially expressed miRNAs and arm-switching events according to customer-defined sample groups and dozens of clinical criteria tidying up by proficient clinicians. Cancer miRNAs identified hold the potential for both basic research and biotech applications. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Gestal, Marcos; Munteanu, Cristian R.; Dorado, Julian; Pazos, Alejandro
2016-01-01
The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable. PMID:27920952
Pattern of hematological malignancies in adolescents and young adults in Bangladesh.
Hasan, Md Mahbub; Raheem, Enayetur; Sultana, Tanvira Afroze; Hossain, Mohammad Sorowar
2017-12-01
The adolescent and young adult (AYA) age group (15-39 years) bears distinct characteristics in terms of cancer biology, long-term health and treatment-related complications and psychosocial aspects. The overall scenario of cancer including hematological malignancies (HMs) is largely unknown in Bangladesh, where a significant proportion of people (44% of total population) belong to AYA age group. This study aims to describe the patterns of HM among AYA in the context of Bangladesh METHODS: Two previously published datasets (on hematological malignancies and childhood and adolescent cancer) were merged to construct a comprehensive dataset focusing exclusively on HMs in AYA age group. Univariate descriptive statistics were calculated and bivariate association were tested using Pearson's Chi-square test. A total of 2144 diagnosed HM related cases over a period of 2007-2014 were analyzed. Acute myeloid leukemia (AML) was the most frequent HM (35.1%) in AYAs, which was followed by acute lymphoblastic leukemia (ALL) and chronic myeloid leukemia (CML) constituting 22.7% and 20.8%, respectively. Among lymphomas, Non-Hodgkin lymphoma (NHL) constituted 13.9% of all HMs while 4.6% was for Hodgkin's lymphoma (HL). This is the first attempt to provide a glimpse on the pattern and distribution of HMs among AYA in Bangladesh. Future studies are essential to get a better insight on the epidemiology, biology, potential risk factors and treatment outcomes for the AYA age group. Copyright © 2017 Elsevier Ltd. All rights reserved.
Can currently available non-animal methods detect pre and ...
Predictive testing to identify and characterise substances for their skin sensitisation potential has historically been based on animal tests such as the Local Lymph Node Assay (LLNA). In recent years, regulations in the cosmetics and chemicals sectors has provided a strong impetus to develop and evaluate non-animal alternative methods. The AOP for skin sensitisation provides a framework to anchor non-animal test methods to key events in the pathway to help identify what tests can be combined together to generate the potency information required for risk assessment. The 3 test methods that have undergone extensive development and validation are the direct peptide reactivity assay (DPRA), the KeratinoSensTM and the human Cell Line Activation Test (h-CLAT). Whilst these methods have been shown to perform relatively well in predicting LLNA results (accuracy ~ 80%), a particular concern that has been raised is their ability to predict chemicals that need to be activated to act as sensitisers (either abiotically on the skin (pre-hapten) or metabolically in the skin (pro-hapten)). The DPRA is a cell free system whereas the other two methods make use of cells that do not fully represent the in vivo metabolic situation. Based on previously published datasets of LLNA data, it has been found that approximately 25% of sensitisers are pre- and/or pro-haptens. This study reviewed an EURL ECVAM dataset of 127 substances for which information was available in the LLNA and the
Multiple timescales of cyclical behaviour observed at two dome-forming eruptions
NASA Astrophysics Data System (ADS)
Lamb, Oliver D.; Varley, Nick R.; Mather, Tamsin A.; Pyle, David M.; Smith, Patrick J.; Liu, Emma J.
2014-09-01
Cyclic behaviour over a range of timescales is a well-documented feature of many dome-forming volcanoes, but has not previously been identified in high resolution seismic data from Volcán de Colima (Mexico). Using daily seismic count datasets from Volcán de Colima and Soufrière Hills volcano (Montserrat), this study explores parallels in the long-term behaviour of seismicity at two long-lived systems. Datasets are examined using multiple techniques, including Fast-Fourier Transform, Detrended Fluctuation Analysis and Probabilistic Distribution Analysis, and the comparison of results from two systems reveals interesting parallels in sub-surface processes operating at both systems. Patterns of seismicity at both systems reveal complex but broadly similar long-term temporal patterns with cycles on the order of ~ 50- to ~ 200-days. These patterns are consistent with previously published spectral analyses of SO2 flux time-series at Soufrière Hills volcano, and are attributed to variations in the movement of magma in each system. Detrended Fluctuation Analysis determined that both volcanic systems showed a systematic relationship between the number of seismic events and the relative ‘roughness' of the time-series, and explosions at Volcán de Colima showed a 1.5-2 year cycle; neither observation has a clear explanatory mechanism. At Volcán de Colima, analysis of repose intervals between seismic events shows long-term behaviour that responds to changes in activity at the system. Similar patterns for both volcanic systems suggest a common process or processes driving the observed signal but it is not clear from these results alone what those processes may be. Further attempts to model conduit processes at each volcano must account for the similarities and differences in activity within each system. The identification of some commonalities in the patterns of behaviour during long-lived dome-forming eruptions at andesitic volcanoes provides a motivation for investigating further use of time-series analysis as a monitoring tool.
Del Fiol, Guilherme; Michelson, Matthew; Iorio, Alfonso; Cotoi, Chris; Haynes, R Brian
2018-06-25
A major barrier to the practice of evidence-based medicine is efficiently finding scientifically sound studies on a given clinical topic. To investigate a deep learning approach to retrieve scientifically sound treatment studies from the biomedical literature. We trained a Convolutional Neural Network using a noisy dataset of 403,216 PubMed citations with title and abstract as features. The deep learning model was compared with state-of-the-art search filters, such as PubMed's Clinical Query Broad treatment filter, McMaster's textword search strategy (no Medical Subject Heading, MeSH, terms), and Clinical Query Balanced treatment filter. A previously annotated dataset (Clinical Hedges) was used as the gold standard. The deep learning model obtained significantly lower recall than the Clinical Queries Broad treatment filter (96.9% vs 98.4%; P<.001); and equivalent recall to McMaster's textword search (96.9% vs 97.1%; P=.57) and Clinical Queries Balanced filter (96.9% vs 97.0%; P=.63). Deep learning obtained significantly higher precision than the Clinical Queries Broad filter (34.6% vs 22.4%; P<.001) and McMaster's textword search (34.6% vs 11.8%; P<.001), but was significantly lower than the Clinical Queries Balanced filter (34.6% vs 40.9%; P<.001). Deep learning performed well compared to state-of-the-art search filters, especially when citations were not indexed. Unlike previous machine learning approaches, the proposed deep learning model does not require feature engineering, or time-sensitive or proprietary features, such as MeSH terms and bibliometrics. Deep learning is a promising approach to identifying reports of scientifically rigorous clinical research. Further work is needed to optimize the deep learning model and to assess generalizability to other areas, such as diagnosis, etiology, and prognosis. ©Guilherme Del Fiol, Matthew Michelson, Alfonso Iorio, Chris Cotoi, R Brian Haynes. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 25.06.2018.
Tsuchiya, Mariko; Amano, Kojiro; Abe, Masaya; Seki, Misato; Hase, Sumitaka; Sato, Kengo; Sakakibara, Yasubumi
2016-06-15
Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs. We developed an algorithm termed SHARAKU to align two read mapping profiles of next-generation sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5'-end processing and 3'-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain. The source code of our program SHARAKU is available at http://www.dna.bio.keio.ac.jp/sharaku/, and the simulated dataset used in this work is available at the same link. Accession code: The sequence data from the whole RNA transcripts in the hippocampus of the left brain used in this work is available from the DNA DataBank of Japan (DDBJ) Sequence Read Archive (DRA) under the accession number DRA004502. yasu@bio.keio.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Enhancing studies of the connectome in autism using the autism brain imaging data exchange II
Di Martino, Adriana; O’Connor, David; Chen, Bosi; Alaerts, Kaat; Anderson, Jeffrey S.; Assaf, Michal; Balsters, Joshua H.; Baxter, Leslie; Beggiato, Anita; Bernaerts, Sylvie; Blanken, Laura M. E.; Bookheimer, Susan Y.; Braden, B. Blair; Byrge, Lisa; Castellanos, F. Xavier; Dapretto, Mirella; Delorme, Richard; Fair, Damien A.; Fishman, Inna; Fitzgerald, Jacqueline; Gallagher, Louise; Keehn, R. Joanne Jao; Kennedy, Daniel P.; Lainhart, Janet E.; Luna, Beatriz; Mostofsky, Stewart H.; Müller, Ralph-Axel; Nebel, Mary Beth; Nigg, Joel T.; O’Hearn, Kirsten; Solomon, Marjorie; Toro, Roberto; Vaidya, Chandan J.; Wenderoth, Nicole; White, Tonya; Craddock, R. Cameron; Lord, Catherine; Leventhal, Bennett; Milham, Michael P.
2017-01-01
The second iteration of the Autism Brain Imaging Data Exchange (ABIDE II) aims to enhance the scope of brain connectomics research in Autism Spectrum Disorder (ASD). Consistent with the initial ABIDE effort (ABIDE I), that released 1112 datasets in 2012, this new multisite open-data resource is an aggregate of resting state functional magnetic resonance imaging (MRI) and corresponding structural MRI and phenotypic datasets. ABIDE II includes datasets from an additional 487 individuals with ASD and 557 controls previously collected across 16 international institutions. The combination of ABIDE I and ABIDE II provides investigators with 2156 unique cross-sectional datasets allowing selection of samples for discovery and/or replication. This sample size can also facilitate the identification of neurobiological subgroups, as well as preliminary examinations of sex differences in ASD. Additionally, ABIDE II includes a range of psychiatric variables to inform our understanding of the neural correlates of co-occurring psychopathology; 284 diffusion imaging datasets are also included. It is anticipated that these enhancements will contribute to unraveling key sources of ASD heterogeneity. PMID:28291247
ConnectViz: Accelerated Approach for Brain Structural Connectivity Using Delaunay Triangulation.
Adeshina, A M; Hashim, R
2016-03-01
Stroke is a cardiovascular disease with high mortality and long-term disability in the world. Normal functioning of the brain is dependent on the adequate supply of oxygen and nutrients to the brain complex network through the blood vessels. Stroke, occasionally a hemorrhagic stroke, ischemia or other blood vessel dysfunctions can affect patients during a cerebrovascular incident. Structurally, the left and the right carotid arteries, and the right and the left vertebral arteries are responsible for supplying blood to the brain, scalp and the face. However, a number of impairment in the function of the frontal lobes may occur as a result of any decrease in the flow of the blood through one of the internal carotid arteries. Such impairment commonly results in numbness, weakness or paralysis. Recently, the concepts of brain's wiring representation, the connectome, was introduced. However, construction and visualization of such brain network requires tremendous computation. Consequently, previously proposed approaches have been identified with common problems of high memory consumption and slow execution. Furthermore, interactivity in the previously proposed frameworks for brain network is also an outstanding issue. This study proposes an accelerated approach for brain connectomic visualization based on graph theory paradigm using compute unified device architecture, extending the previously proposed SurLens Visualization and computer aided hepatocellular carcinoma frameworks. The accelerated brain structural connectivity framework was evaluated with stripped brain datasets from the Department of Surgery, University of North Carolina, Chapel Hill, USA. Significantly, our proposed framework is able to generate and extract points and edges of datasets, displays nodes and edges in the datasets in form of a network and clearly maps data volume to the corresponding brain surface. Moreover, with the framework, surfaces of the dataset were simultaneously displayed with the nodes and the edges. The framework is very efficient in providing greater interactivity as a way of representing the nodes and the edges intuitively, all achieved at a considerably interactive speed for instantaneous mapping of the datasets' features. Uniquely, the connectomic algorithm performed remarkably fast with normal hardware requirement specifications.
ConnectViz: Accelerated approach for brain structural connectivity using Delaunay triangulation.
Adeshina, A M; Hashim, R
2015-02-06
Stroke is a cardiovascular disease with high mortality and long-term disability in the world. Normal functioning of the brain is dependent on the adequate supply of oxygen and nutrients to the brain complex network through the blood vessels. Stroke, occasionally a hemorrhagic stroke, ischemia or other blood vessel dysfunctions can affect patients during a cerebrovascular incident. Structurally, the left and the right carotid arteries, and the right and the left vertebral arteries are responsible for supplying blood to the brain, scalp and the face. However, a number of impairment in the function of the frontal lobes may occur as a result of any decrease in the flow of the blood through one of the internal carotid arteries. Such impairment commonly results in numbness, weakness or paralysis. Recently, the concepts of brain's wiring representation, the connectome, was introduced. However, construction and visualization of such brain network requires tremendous computation. Consequently, previously proposed approaches have been identified with common problems of high memory consumption and slow execution. Furthermore, interactivity in the previously proposed frameworks for brain network is also an outstanding issue. This study proposes an accelerated approach for brain connectomic visualization based on graph theory paradigm using Compute Unified Device Architecture (CUDA), extending the previously proposed SurLens Visualization and Computer Aided Hepatocellular Carcinoma (CAHECA) frameworks. The accelerated brain structural connectivity framework was evaluated with stripped brain datasets from the Department of Surgery, University of North Carolina, Chapel Hill, United States. Significantly, our proposed framework is able to generates and extracts points and edges of datasets, displays nodes and edges in the datasets in form of a network and clearly maps data volume to the corresponding brain surface. Moreover, with the framework, surfaces of the dataset were simultaneously displayed with the nodes and the edges. The framework is very efficient in providing greater interactivity as a way of representing the nodes and the edges intuitively, all achieved at a considerably interactive speed for instantaneous mapping of the datasets' features. Uniquely, the connectomic algorithm performed remarkably fast with normal hardware requirement specifications.
This dataset supports the modeling study of Seltzer et al. (2016) published in Atmospheric Environment. In this study, techniques typically used for future air quality projections are applied to a historical 11-year period to assess the performance of the modeling system when the driving meteorological conditions are obtained using dynamical downscaling of coarse-scale fields without correcting toward higher resolution observations. The Weather Research and Forecasting model and the Community Multiscale Air Quality model are used to simulate regional climate and air quality over the contiguous United States for 2000-2010. The air quality simulations for that historical period are then compared to observations from four national networks. Comparisons are drawn between defined performance metrics and other published modeling results for predicted ozone, fine particulate matter, and speciated fine particulate matter. The results indicate that the historical air quality simulations driven by dynamically downscaled meteorology are typically within defined modeling performance benchmarks and are consistent with results from other published modeling studies using finer-resolution meteorology. This indicates that the regional climate and air quality modeling framework utilized here does not introduce substantial bias, which provides confidence in the method??s use for future air quality projections.This dataset is associated with the following publication:Seltzer, K., C
Solutions for research data from a publisher's perspective
NASA Astrophysics Data System (ADS)
Cotroneo, P.
2015-12-01
Sharing research data has the potential to make research more efficient and reproducible. Elsevier has developed several initiatives to address the different needs of research data users. These include PANGEA Linked data, which provides geo-referenced, citable datasets from earth and life sciences, archived as supplementary data from publications by the PANGEA data repository; Mendeley Data, which allows users to freely upload and share their data; a database linking program that creates links between articles on ScienceDirect and datasets held in external data repositories such as EarthRef and EarthChem; a pilot for searching for research data through a map interface; an open data pilot that allows authors publishing in Elsevier journals to store and share research data and make this publicly available as a supplementary file alongside their article; and data journals, including Data in Brief, which allow researchers to share their data open access. Through these initiatives, researchers are not only encouraged to share their research data, but also supported in optimizing their research data management. By making data more readily citable and visible, and hence generating citations for authors, these initiatives also aim to ensure that researchers get the recognition they deserve for publishing their data.
Santos, Eduardo Jose Melos Dos; McCabe, Antony; Gonzalez-Galarza, Faviel F; Jones, Andrew R; Middleton, Derek
2016-03-01
The Allele Frequencies Net Database (AFND) is a freely accessible database which stores population frequencies for alleles or genes of the immune system in worldwide populations. Herein we introduce two new tools. We have defined new classifications of data (gold, silver and bronze) to assist users in identifying the most suitable populations for their tasks. The gold standard datasets are defined by allele frequencies summing to 1, sample sizes >50 and high resolution genotyping, while silver standard datasets do not meet gold standard genotyping resolution and/or sample size criteria. The bronze standard datasets are those that could not be classified under the silver or gold standards. The gold standard includes >500 datasets covering over 3 million individuals from >100 countries at one or more of the following loci: HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1 and -DRB1 - with all loci except DPA1 present in more than 220 datasets. Three out of 12 geographic regions have low representation (the majority of their countries having less than five datasets) and the Central Asia region has no representation. There are 18 countries that are not represented by any gold standard datasets but are represented by at least one dataset that is either silver or bronze standard. We also briefly summarize the data held by AFND for KIR genes, alleles and their ligands. Our second new component is a data submission tool to assist users in the collection of the genotypes of the individuals (raw data), facilitating submission of short population reports to Human Immunology, as well as simplifying the submission of population demographics and frequency data. Copyright © 2015 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.
Mars Global Geologic Mapping: Amazonian Results
NASA Technical Reports Server (NTRS)
Tanaka, K. L.; Dohm, J. M.; Irwin, R.; Kolb, E. J.; Skinner, J. A., Jr.; Hare, T. M.
2008-01-01
We are in the second year of a five-year effort to map the geology of Mars using mainly Mars Global Surveyor, Mars Express, and Mars Odyssey imaging and altimetry datasets. Previously, we have reported on details of project management, mapping datasets (local and regional), initial and anticipated mapping approaches, and tactics of map unit delineation and description [1-2]. For example, we have seen how the multiple types and huge quantity of image data as well as more accurate and detailed altimetry data now available allow for broader and deeper geologic perspectives, based largely on improved landform perception, characterization, and analysis. Here, we describe early mapping results, which include updating of previous northern plains mapping [3], including delineation of mainly Amazonian units and regional fault mapping, as well as other advances.
NASA Technical Reports Server (NTRS)
Platnick, Steven; Meyer, Kerry G.; King, Michael D.; Wind, Galina; Amarasinghe, Nandana; Marchant, Benjamin G.; Arnold, G. Thomas; Zhang, Zhibo; Hubanks, Paul A.; Holz, Robert E.;
2016-01-01
The MODIS Level-2 cloud product (Earth Science Data Set names MOD06 and MYD06 for Terra and Aqua MODIS, respectively) provides pixel-level retrievals of cloud-top properties (day and night pressure, temperature, and height) and cloud optical properties(optical thickness, effective particle radius, and water path for both liquid water and ice cloud thermodynamic phases daytime only). Collection 6 (C6) reprocessing of the product was completed in May 2014 and March 2015 for MODIS Aqua and Terra, respectively. Here we provide an overview of major C6 optical property algorithm changes relative to the previous Collection 5 (C5) product. Notable C6 optical and microphysical algorithm changes include: (i) new ice cloud optical property models and a more extensive cloud radiative transfer code lookup table (LUT) approach, (ii) improvement in the skill of the shortwave-derived cloud thermodynamic phase, (iii) separate cloud effective radius retrieval datasets for each spectral combination used in previous collections, (iv) separate retrievals for partly cloudy pixels and those associated with cloud edges, (v) failure metrics that provide diagnostic information for pixels having observations that fall outside the LUT solution space, and (vi) enhanced pixel-level retrieval uncertainty calculations.The C6 algorithm changes collectively can result in significant changes relative to C5,though the magnitude depends on the dataset and the pixels retrieval location in the cloud parameter space. Example Level-2 granule and Level-3 gridded dataset differences between the two collections are shown. While the emphasis is on the suite of cloud opticalproperty datasets, other MODIS cloud datasets are discussed when relevant.
Platnick, Steven; Meyer, Kerry G; King, Michael D; Wind, Galina; Amarasinghe, Nandana; Marchant, Benjamin; Arnold, G Thomas; Zhang, Zhibo; Hubanks, Paul A; Holz, Robert E; Yang, Ping; Ridgway, William L; Riedi, Jérôme
2017-01-01
The MODIS Level-2 cloud product (Earth Science Data Set names MOD06 and MYD06 for Terra and Aqua MODIS, respectively) provides pixel-level retrievals of cloud-top properties (day and night pressure, temperature, and height) and cloud optical properties (optical thickness, effective particle radius, and water path for both liquid water and ice cloud thermodynamic phases-daytime only). Collection 6 (C6) reprocessing of the product was completed in May 2014 and March 2015 for MODIS Aqua and Terra, respectively. Here we provide an overview of major C6 optical property algorithm changes relative to the previous Collection 5 (C5) product. Notable C6 optical and microphysical algorithm changes include: (i) new ice cloud optical property models and a more extensive cloud radiative transfer code lookup table (LUT) approach, (ii) improvement in the skill of the shortwave-derived cloud thermodynamic phase, (iii) separate cloud effective radius retrieval datasets for each spectral combination used in previous collections, (iv) separate retrievals for partly cloudy pixels and those associated with cloud edges, (v) failure metrics that provide diagnostic information for pixels having observations that fall outside the LUT solution space, and (vi) enhanced pixel-level retrieval uncertainty calculations. The C6 algorithm changes collectively can result in significant changes relative to C5, though the magnitude depends on the dataset and the pixel's retrieval location in the cloud parameter space. Example Level-2 granule and Level-3 gridded dataset differences between the two collections are shown. While the emphasis is on the suite of cloud optical property datasets, other MODIS cloud datasets are discussed when relevant.
Platnick, Steven; Meyer, Kerry G.; King, Michael D.; Wind, Galina; Amarasinghe, Nandana; Marchant, Benjamin; Arnold, G. Thomas; Zhang, Zhibo; Hubanks, Paul A.; Holz, Robert E.; Yang, Ping; Ridgway, William L.; Riedi, Jérôme
2018-01-01
The MODIS Level-2 cloud product (Earth Science Data Set names MOD06 and MYD06 for Terra and Aqua MODIS, respectively) provides pixel-level retrievals of cloud-top properties (day and night pressure, temperature, and height) and cloud optical properties (optical thickness, effective particle radius, and water path for both liquid water and ice cloud thermodynamic phases–daytime only). Collection 6 (C6) reprocessing of the product was completed in May 2014 and March 2015 for MODIS Aqua and Terra, respectively. Here we provide an overview of major C6 optical property algorithm changes relative to the previous Collection 5 (C5) product. Notable C6 optical and microphysical algorithm changes include: (i) new ice cloud optical property models and a more extensive cloud radiative transfer code lookup table (LUT) approach, (ii) improvement in the skill of the shortwave-derived cloud thermodynamic phase, (iii) separate cloud effective radius retrieval datasets for each spectral combination used in previous collections, (iv) separate retrievals for partly cloudy pixels and those associated with cloud edges, (v) failure metrics that provide diagnostic information for pixels having observations that fall outside the LUT solution space, and (vi) enhanced pixel-level retrieval uncertainty calculations. The C6 algorithm changes collectively can result in significant changes relative to C5, though the magnitude depends on the dataset and the pixel’s retrieval location in the cloud parameter space. Example Level-2 granule and Level-3 gridded dataset differences between the two collections are shown. While the emphasis is on the suite of cloud optical property datasets, other MODIS cloud datasets are discussed when relevant. PMID:29657349
2013-01-01
Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies. PMID:24564333
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lamarque, Jean-Francois; Dentener, Frank; McConnell, J.R.
2013-08-20
We present multi-model global datasets of nitrogen and sulfate deposition covering time periods from 1850 to 2100, calculated within the Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP). The computed deposition fluxes are compared to surface wet deposition and ice-core measurements. We use a new dataset of wet deposition for 2000-2002 based on critical assessment of the quality of existing regional network data. We show that for present-day (year 2000 ACCMIP time-slice), the ACCMIP results perform similarly to previously published multi-model assessments. The analysis of changes between 1980 and 2000 indicates significant differences between model and measurements over the Unitedmore » States, but less so over Europe. This difference points towards misrepresentation of 1980 NH3 emissions over North America. Based on ice-core records, the 1850 deposition fluxes agree well with Greenland ice cores but the change between 1850 and 2000 seems to be overestimated in the Northern Hemisphere for both nitrogen and sulfur species. Using the Representative Concentration Pathways to define the projected climate and atmospheric chemistry related emissions and concentrations, we find large regional nitrogen deposition increases in 2100 in Latin America, Africa and parts of Asia under some of the scenarios considered. Increases in South Asia are especially large, and are seen in all scenarios, with 2100 values more than double 2000 in some scenarios and reaching >1300 mgN/m2/yr averaged over regional to continental scale regions in RCP 2.6 and 8.5, ~30-50% larger than the values in any region currently (2000). Despite known issues, the new ACCMIP deposition dataset provides novel, consistent and evaluated global gridded deposition fields for use in a wide range of climate and ecological studies.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Setati, Mathabatha E.; Jacobson, Daniel; Bauer, Florian F.
Recent microbiomic research of agricultural habitats has highlighted tremendous microbial biodiversity associated with such ecosystems. In addition, data generated in vineyards have furthermore highlighted significant regional differences in vineyard biodiversity, hinting at the possibility that such differences might be responsible for regional differences in wine style and character, a hypothesis referred to as "microbial terroir." The current study further contributes to this body of work by comparing the mycobiome associated with South African (SA) Cabernet Sauvignon grapes in three neighboring vineyards that employ different agronomic approaches, and comparing the outcome with similar data sets from Californian vineyards. The aim ofmore » this study was to fully characterize the mycobiomes associated with the grapes from these vineyards. The data revealed approximately 10 times more fungal diversity than what is typically retrieved from culture-based studies. The Biodynamic vineyard was found to harbor a more diverse fungal community (H = 2.6) than the conventional (H = 2.1) and integrated (H = 1.8) vineyards. The data show that ascomycota are the most abundant phylum in the three vineyards, with Aureobasidium pullulans and its close relative Kabatiella microsticta being the most dominant fungi. This is the first report to reveal a high incidence of K. microsticta in the grape/wine ecosystem. Different common wine yeast species, such as Metschnikowia pulcherrima and Starmerella bacillaris dominated the mycobiome in the three vineyards. The data show that the filamentous fungi are the most abundant community in grape must although they are not regarded as relevant during wine fermentation. Comparison of metagenomic datasets from the three SA vineyards and previously published data from Californian vineyards revealed only 25% of the fungi in the SA dataset was also present in the Californian dataset, with greater variation evident amongst ubiquitous epiphytic fungi.« less
NASA Astrophysics Data System (ADS)
Tóbiás, Roland; Furtenbacher, Tibor; Császár, Attila G.; Naumenko, Olga V.; Tennyson, Jonathan; Flaud, Jean-Marie; Kumar, Praveen; Poirier, Bill
2018-03-01
A critical evaluation and validation of the complete set of previously published experimental rotational-vibrational line positions is reported for the four stable sulphur isotopologues of the semirigid SO2 molecule - i.e., 32S16O2, 33S16O2, 34S16O2, and 36S16O2. The experimentally measured, assigned, and labeled transitions are collated from 43 sources. The 32S16O2, 33S16O2, 34S16O2, and 36S16O2 datasets contain 40,269, 15,628, 31,080, and 31 lines, respectively. Of the datasets collated, only the extremely limited 36S16O2 dataset is not subjected to a detailed analysis. As part of a detailed analysis of the experimental spectroscopic networks corresponding to the ground electronic states of the 32S16O2, 33S16O2, and 34S16O2 isotopologues, the MARVEL (Measured Active Rotational-Vibrational Energy Levels) procedure is used to determine the rovibrational energy levels. The rovibrational levels and their vibrational parent and asymmetric-top quantum numbers are compared to ones obtained from accurate variational nuclear-motion computations as well as to results of carefully designed effective Hamiltonian models. The rovibrational energy levels of the three isotopologues having the same labels are also compared against each other to ensure self-consistency. This careful, multifaceted analysis gives rise to 15,130, 5852, and 10,893 validated rovibrational energy levels, with a typical accuracy of a few 0.0001 cm-1 , for 32S16O2, 33S16O2, and 34S16O2, respectively. The extensive list of validated experimental lines and empirical (MARVEL) energy levels of the S16O2 isotopologues studied are deposited in the Supplementary Material of this article, as well as in the distributed information system ReSpecTh (http://respecth.hu).
NASA Technical Reports Server (NTRS)
Lamarque, J.-F.; Dentener, F.; McConnell, J.; Ro, C.-U.; Shaw, M.; Vet, R.; Bergmann, D.; Cameron-Smith, P.; Doherty, R.; Faluvegi, G.;
2013-01-01
We present multi-model global datasets of nitrogen and sulfate deposition covering time periods from 1850 to 2100, calculated within the Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP). The computed deposition fluxes are compared to surface wet deposition and ice-core measurements. We use a new dataset of wet deposition for 2000-2002 based on critical assessment of the quality of existing regional network data. We show that for present-day (year 2000 ACCMIP time-slice), the ACCMIP results perform similarly to previously published multi-model assessments. For this time slice, we find a multi-model mean deposition of 50 Tg(N) yr1 from nitrogen oxide emissions, 60 Tg(N) yr1 from ammonia emissions, and 83 Tg(S) yr1 from sulfur emissions. The analysis of changes between 1980 and 2000 indicates significant differences between model and measurements over the United States but less so over Europe. This difference points towards misrepresentation of 1980 NH3 emissions over North America. Based on ice-core records, the 1850 deposition fluxes agree well with Greenland ice cores but the change between 1850 and 2000 seems to be overestimated in the Northern Hemisphere for both nitrogen and sulfur species. Using the Representative Concentration Pathways to define the projected climate and atmospheric chemistry related emissions and concentrations, we find large regional nitrogen deposition increases in 2100 in Latin America, Africa and parts of Asia under some of the scenarios considered. Increases in South Asia are especially large, and are seen in all scenarios, with 2100 values more than double 2000 in some scenarios and reaching 1300 mg(N) m2 yr1 averaged over regional to continental scale regions in RCP 2.6 and 8.5, 3050 larger than the values in any region currently (2000). The new ACCMIP deposition dataset provides novel, consistent and evaluated global gridded deposition fields for use in a wide range of climate and ecological studies.
Mahony, Stephen; Foley, Nicole M; Biju, S D; Teeling, Emma C
2017-03-01
Molecular dating studies typically need fossils to calibrate the analyses. Unfortunately, the fossil record is extremely poor or presently nonexistent for many species groups, rendering such dating analysis difficult. One such group is the Asian horned frogs (Megophryinae). Sampling all generic nomina, we combined a novel ∼5 kb dataset composed of four nuclear and three mitochondrial gene fragments to produce a robust phylogeny, with an extensive external morphological study to produce a working taxonomy for the group. Expanding the molecular dataset to include out-groups of fossil-represented ancestral anuran families, we compared the priorless RelTime dating method with the widely used prior-based Bayesian timetree method, MCMCtree, utilizing a novel combination of fossil priors for anuran phylogenetic dating. The phylogeny was then subjected to ancestral phylogeographic analyses, and dating estimates were compared with likely biogeographic vicariant events. Phylogenetic analyses demonstrated that previously proposed systematic hypotheses were incorrect due to the paraphyly of genera. Molecular phylogenetic, morphological, and timetree results support the recognition of Megophryinae as a single genus, Megophrys, with a subgenus level classification. Timetree results using RelTime better corresponded with the known fossil record for the out-group anuran tree. For the priorless in-group, it also outperformed MCMCtree when node date estimates were compared with likely influential historical biogeographic events, providing novel insights into the evolutionary history of this pan-Asian anuran group. Given a relatively small molecular dataset, and limited prior knowledge, this study demonstrates that the computationally rapid RelTime dating tool may outperform more popular and complex prior reliant timetree methodologies. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Setati, Mathabatha E.; Jacobson, Daniel; Bauer, Florian F.
2015-01-01
Recent microbiomic research of agricultural habitats has highlighted tremendous microbial biodiversity associated with such ecosystems. Data generated in vineyards have furthermore highlighted significant regional differences in vineyard biodiversity, hinting at the possibility that such differences might be responsible for regional differences in wine style and character, a hypothesis referred to as “microbial terroir.” The current study further contributes to this body of work by comparing the mycobiome associated with South African (SA) Cabernet Sauvignon grapes in three neighboring vineyards that employ different agronomic approaches, and comparing the outcome with similar data sets from Californian vineyards. The aim of this study was to fully characterize the mycobiomes associated with the grapes from these vineyards. The data revealed approximately 10 times more fungal diversity than what is typically retrieved from culture-based studies. The Biodynamic vineyard was found to harbor a more diverse fungal community (H = 2.6) than the conventional (H = 2.1) and integrated (H = 1.8) vineyards. The data show that ascomycota are the most abundant phylum in the three vineyards, with Aureobasidium pullulans and its close relative Kabatiella microsticta being the most dominant fungi. This is the first report to reveal a high incidence of K. microsticta in the grape/wine ecosystem. Different common wine yeast species, such as Metschnikowia pulcherrima and Starmerella bacillaris dominated the mycobiome in the three vineyards. The data show that the filamentous fungi are the most abundant community in grape must although they are not regarded as relevant during wine fermentation. Comparison of metagenomic datasets from the three SA vineyards and previously published data from Californian vineyards revealed only 25% of the fungi in the SA dataset was also present in the Californian dataset, with greater variation evident amongst ubiquitous epiphytic fungi. PMID:26648930
Setati, Mathabatha E.; Jacobson, Daniel; Bauer, Florian F.
2015-11-30
Recent microbiomic research of agricultural habitats has highlighted tremendous microbial biodiversity associated with such ecosystems. In addition, data generated in vineyards have furthermore highlighted significant regional differences in vineyard biodiversity, hinting at the possibility that such differences might be responsible for regional differences in wine style and character, a hypothesis referred to as "microbial terroir." The current study further contributes to this body of work by comparing the mycobiome associated with South African (SA) Cabernet Sauvignon grapes in three neighboring vineyards that employ different agronomic approaches, and comparing the outcome with similar data sets from Californian vineyards. The aim ofmore » this study was to fully characterize the mycobiomes associated with the grapes from these vineyards. The data revealed approximately 10 times more fungal diversity than what is typically retrieved from culture-based studies. The Biodynamic vineyard was found to harbor a more diverse fungal community (H = 2.6) than the conventional (H = 2.1) and integrated (H = 1.8) vineyards. The data show that ascomycota are the most abundant phylum in the three vineyards, with Aureobasidium pullulans and its close relative Kabatiella microsticta being the most dominant fungi. This is the first report to reveal a high incidence of K. microsticta in the grape/wine ecosystem. Different common wine yeast species, such as Metschnikowia pulcherrima and Starmerella bacillaris dominated the mycobiome in the three vineyards. The data show that the filamentous fungi are the most abundant community in grape must although they are not regarded as relevant during wine fermentation. Comparison of metagenomic datasets from the three SA vineyards and previously published data from Californian vineyards revealed only 25% of the fungi in the SA dataset was also present in the Californian dataset, with greater variation evident amongst ubiquitous epiphytic fungi.« less
Historical greenhouse gas concentrations for climate modelling (CMIP6)
NASA Astrophysics Data System (ADS)
Meinshausen, Malte; Vogel, Elisabeth; Nauels, Alexander; Lorbacher, Katja; Meinshausen, Nicolai; Etheridge, David M.; Fraser, Paul J.; Montzka, Stephen A.; Rayner, Peter J.; Trudinger, Cathy M.; Krummel, Paul B.; Beyerle, Urs; Canadell, Josep G.; Daniel, John S.; Enting, Ian G.; Law, Rachel M.; Lunder, Chris R.; O'Doherty, Simon; Prinn, Ron G.; Reimann, Stefan; Rubino, Mauro; Velders, Guus J. M.; Vollmer, Martin K.; Wang, Ray H. J.; Weiss, Ray
2017-05-01
Atmospheric greenhouse gas (GHG) concentrations are at unprecedented, record-high levels compared to the last 800 000 years. Those elevated GHG concentrations warm the planet and - partially offset by net cooling effects by aerosols - are largely responsible for the observed warming over the past 150 years. An accurate representation of GHG concentrations is hence important to understand and model recent climate change. So far, community efforts to create composite datasets of GHG concentrations with seasonal and latitudinal information have focused on marine boundary layer conditions and recent trends since the 1980s. Here, we provide consolidated datasets of historical atmospheric concentrations (mole fractions) of 43 GHGs to be used in the Climate Model Intercomparison Project - Phase 6 (CMIP6) experiments. The presented datasets are based on AGAGE and NOAA networks, firn and ice core data, and archived air data, and a large set of published studies. In contrast to previous intercomparisons, the new datasets are latitudinally resolved and include seasonality. We focus on the period 1850-2014 for historical CMIP6 runs, but data are also provided for the last 2000 years. We provide consolidated datasets in various spatiotemporal resolutions for carbon dioxide (CO2), methane (CH4) and nitrous oxide (N2O), as well as 40 other GHGs, namely 17 ozone-depleting substances, 11 hydrofluorocarbons (HFCs), 9 perfluorocarbons (PFCs), sulfur hexafluoride (SF6), nitrogen trifluoride (NF3) and sulfuryl fluoride (SO2F2). In addition, we provide three equivalence species that aggregate concentrations of GHGs other than CO2, CH4 and N2O, weighted by their radiative forcing efficiencies. For the year 1850, which is used for pre-industrial control runs, we estimate annual global-mean surface concentrations of CO2 at 284.3 ppm, CH4 at 808.2 ppb and N2O at 273.0 ppb. The data are available at https://esgf-node.llnl.gov/search/input4mips/ and http://www.climatecollege.unimelb.edu.au/cmip6. While the minimum CMIP6 recommendation is to use the global- and annual-mean time series, modelling groups can also choose our monthly and latitudinally resolved concentrations, which imply a stronger radiative forcing in the Northern Hemisphere winter (due to the latitudinal gradient and seasonality).
ERIC Educational Resources Information Center
Seah, Lay Hoon; Clarke, David; Hart, Christina
2015-01-01
This study examines how a class of Grade 7 students employed linguistic resources to explain density differences. Drawing from the same data-set as a previous study by, we take a language perspective to investigate the challenges students face in learning the concept of density. Our study thus complements previous research on learning about…
Parallel Visualization of Large-Scale Aerodynamics Calculations: A Case Study on the Cray T3E
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu; Crockett, Thomas W.
1999-01-01
This paper reports the performance of a parallel volume rendering algorithm for visualizing a large-scale, unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times larger than the one we examined previously. This high resolution dataset also allows us to see fine, three-dimensional features in the flow field. All our tests were performed on the Silicon Graphics Inc. (SGI)/Cray T3E operated by NASA's Goddard Space Flight Center. Using 511 processors, a rendering rate of almost 9 million tetrahedra/second was achieved with a parallel overhead of 26%.
Time-Series Analysis: A Cautionary Tale
NASA Technical Reports Server (NTRS)
Damadeo, Robert
2015-01-01
Time-series analysis has often been a useful tool in atmospheric science for deriving long-term trends in various atmospherically important parameters (e.g., temperature or the concentration of trace gas species). In particular, time-series analysis has been repeatedly applied to satellite datasets in order to derive the long-term trends in stratospheric ozone, which is a critical atmospheric constituent. However, many of the potential pitfalls relating to the non-uniform sampling of the datasets were often ignored and the results presented by the scientific community have been unknowingly biased. A newly developed and more robust application of this technique is applied to the Stratospheric Aerosol and Gas Experiment (SAGE) II version 7.0 ozone dataset and the previous biases and newly derived trends are presented.
"Where did my data layer come from?" The semantics of data release
NASA Astrophysics Data System (ADS)
Leadbetter, Adam; Buck, Justin
2015-04-01
In his lecture, "Theory of Creative Fitting" (Margullis, Corner & Holt, 2006), Ian McHarg introduced his vision for cross-disciplinary data and information sharing networks with the end goal of producing detailed overlay maps for the purposes of ecological architectural planning. Within McHarg's networks, experts in various fields, such as hydrology or surface geology, would provide data layers to the final overlay map with full provenance, such that the users of the overlay maps would know the originator of the data, the "value systems" by which the data were created and could place their trust in the outcomes. In the light of McHarg's statements and in order to allow the encoding of value systems in a cyber-GIS, analyses of: data quality (Giarlo, 2013); data publication networks (Reinsfelder, 2012); trust in collaborative research networks (Leadbetter, 2015); and the metaphors of data publication, data release and data ecosystems (Parsons & Fox, 2013) have been synthesised into a logical model of the data release lifecycle. This model concerns the actors in the data release process; the data-information-knowledge ecosystem through the various stages of the data release process and the impact of data release on perceptions of trust through the data release lifecycle. The data-information-knowledge ecosystem described how the collection of data can be presented in new ways to form information products, and how these information products can inform conversations amongst information-consumers who integrate the information into new knowledge. The actors concerned in the process comprise: researchers data publishers academic publishers & academic administrators Finally, the lifecycle of data release involves the initial release of a data-layer, possibly with a Persistent Identifier (PID) more generic than a Digital Object Identifier (DOI). A data description paper can be written about the dataset, which then necessitates the assignment of a DOI to the datasets; the DOI can be seen as an indicator of trust through "benevolence". A technical document citing the dataset may then be informed by the dataset release or the dataset description paper. These citations may show the "competence" (in terms of a trust model) of the original datasets, and the dataset description papers or other technical articles show the integrity of the dataset. The synthesised logical model has been represented in freely available ontologies, such that data layers can be annotated with metadata about their provenance and stage within the data release lifecycle before incorporation into a cyber-GIS, in which distributed data providers provide for a collaborative research environment. References Giarlo, M. (2013). Academic libraries as data quality hubs. Journal of Librarianship and Scholarly Communication 1(3): eP1059. doi: 10.7710/2162-3309.1059 Leadbetter, A. (2015). Examining trust in collaborative research networks. In P. Diviacco, P. Fox, C. Pshenichny, and A. Leadbetter (Eds.) Collaborative Knowledge in Scientific Research Networks, Hershey, PA: IGI Global. doi: 10.4018/978-1-4666-6567-5.ch002 Margulis, L., Corner, J. and Hawthorne, B. (Editors), 2006. Ian McHarg: conversations with students. Dwelling with nature. New York, NY: Princeton Architectural Press. Parsons, M., and Fox, P. (2013). Is data publication the right metaphor? Data Science Journal 12, 32-46. doi:10.2481/dsj.WDS-042 Reinsfelder, T. (2012). Open access publishing practices in a complex environment: conditions, barriers and bases of power. Journal of Librarianship and Scholarly Communication 1(1): eP1029. doi: 10.7710/2162-3309.1029
Data for Regional Heat flow Studies in and around Japan and its relationship to seismogenic layer
NASA Astrophysics Data System (ADS)
Tanaka, A.
2017-12-01
Heat flow is a fundamental parameter to constrain the thermal structure of the lithosphere. It also provides a constraint to lithospheric rheology, which is sensitive to temperature. General features of the heat flow distribution in and around Japan had been revealed by the early 1970's, and heat flow data have been continuously updated by further data compilation from mainly published data and investigations. These include additional data, which were not published individually, but were included in site-specific reports. Also, thermal conductivity measurements were conducted on cores from boreholes using a line-source device with a half-space type box probe and an optical scanning device, and previously unpublished thermal conductivities were compiled. It has been more than 10 years since the last published compilation and analysis of heat flow data of Tanaka et al. (2004), which published all of the heat flow data in the northwestern Pacific area (from 0 to 60oN and from 120 to 160oE) and geothermal gradient data in and around Japan. Because these added data and information are drawn from various sources, the updated database is compiled in each datasets: heat flow, geothermal gradient, and thermal conductivity. The updated and improved database represents considerable improvement to past updates and presents an opportunity to revisit the thermal state of the lithosphere along with other geophysical/geochemical constraints on heat flow extrapolation. The spatial distribution of the cut-off depth of shallow seismicity of Japan using relocated hypocentres during the last decade (Omuralieva et al., 2012) and this updated database are used to quantify the concept of temperature as a fundamental parameter for determining the seismogenic thickness.
Data discovery with DATS: exemplar adoptions and lessons learned.
Gonzalez-Beltran, Alejandra N; Campbell, John; Dunn, Patrick; Guijarro, Diana; Ionescu, Sanda; Kim, Hyeoneui; Lyle, Jared; Wiser, Jeffrey; Sansone, Susanna-Assunta; Rocca-Serra, Philippe
2018-01-01
The DAta Tag Suite (DATS) is a model supporting dataset description, indexing, and discovery. It is available as an annotated serialization with schema.org, a vocabulary used by major search engines, thus making the datasets discoverable on the web. DATS underlies DataMed, the National Institutes of Health Big Data to Knowledge Data Discovery Index prototype, which aims to provide a "PubMed for datasets." The experience gained while indexing a heterogeneous range of >60 repositories in DataMed helped in evaluating DATS's entities, attributes, and scope. In this work, 3 additional exemplary and diverse data sources were mapped to DATS by their representatives or experts, offering a deep scan of DATS fitness against a new set of existing data. The procedure, including feedback from users and implementers, resulted in DATS implementation guidelines and best practices, and identification of a path for evolving and optimizing the model. Finally, the work exposed additional needs when defining datasets for indexing, especially in the context of clinical and observational information. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Murray, David; Stankovic, Lina; Stankovic, Vladimir
2017-01-01
Smart meter roll-outs provide easy access to granular meter measurements, enabling advanced energy services, ranging from demand response measures, tailored energy feedback and smart home/building automation. To design such services, train and validate models, access to data that resembles what is expected of smart meters, collected in a real-world setting, is necessary. The REFIT electrical load measurements dataset described in this paper includes whole house aggregate loads and nine individual appliance measurements at 8-second intervals per house, collected continuously over a period of two years from 20 houses. During monitoring, the occupants were conducting their usual routines. At the time of publishing, the dataset has the largest number of houses monitored in the United Kingdom at less than 1-minute intervals over a period greater than one year. The dataset comprises 1,194,958,790 readings, that represent over 250,000 monitored appliance uses. The data is accessible in an easy-to-use comma-separated format, is time-stamped and cleaned to remove invalid measurements, correctly label appliance data and fill in small gaps of missing data. PMID:28055033
NASA Astrophysics Data System (ADS)
Murray, David; Stankovic, Lina; Stankovic, Vladimir
2017-01-01
Smart meter roll-outs provide easy access to granular meter measurements, enabling advanced energy services, ranging from demand response measures, tailored energy feedback and smart home/building automation. To design such services, train and validate models, access to data that resembles what is expected of smart meters, collected in a real-world setting, is necessary. The REFIT electrical load measurements dataset described in this paper includes whole house aggregate loads and nine individual appliance measurements at 8-second intervals per house, collected continuously over a period of two years from 20 houses. During monitoring, the occupants were conducting their usual routines. At the time of publishing, the dataset has the largest number of houses monitored in the United Kingdom at less than 1-minute intervals over a period greater than one year. The dataset comprises 1,194,958,790 readings, that represent over 250,000 monitored appliance uses. The data is accessible in an easy-to-use comma-separated format, is time-stamped and cleaned to remove invalid measurements, correctly label appliance data and fill in small gaps of missing data.
Murray, David; Stankovic, Lina; Stankovic, Vladimir
2017-01-05
Smart meter roll-outs provide easy access to granular meter measurements, enabling advanced energy services, ranging from demand response measures, tailored energy feedback and smart home/building automation. To design such services, train and validate models, access to data that resembles what is expected of smart meters, collected in a real-world setting, is necessary. The REFIT electrical load measurements dataset described in this paper includes whole house aggregate loads and nine individual appliance measurements at 8-second intervals per house, collected continuously over a period of two years from 20 houses. During monitoring, the occupants were conducting their usual routines. At the time of publishing, the dataset has the largest number of houses monitored in the United Kingdom at less than 1-minute intervals over a period greater than one year. The dataset comprises 1,194,958,790 readings, that represent over 250,000 monitored appliance uses. The data is accessible in an easy-to-use comma-separated format, is time-stamped and cleaned to remove invalid measurements, correctly label appliance data and fill in small gaps of missing data.
Daily Temperature and Precipitation Data for 518 Russian Meteorological Stations (1881 - 2010)
Bulygina, O. N. [All-Russian Research Institute of Hydrometeorological Information-World Data Centre; Razuvaev, V. N. [All-Russian Research Institute of Hydrometeorological Information-World Data Centre
2012-01-01
Over the past several decades, many climate datasets have been exchanged directly between the principal climate data centers of the United States (NOAA's National Climatic Data Center (NCDC)) and the former-USSR/Russia (All-Russian Research Institute for Hydrometeorological Information-World Data Center (RIHMI-WDC)). This data exchange has its roots in a bilateral initiative known as the Agreement on Protection of the Environment (Tatusko 1990). CDIAC has partnered with NCDC and RIHMI-WDC since the early 1990s to help make former-USSR climate datasets available to the public. The first former-USSR daily temperature and precipitation dataset released by CDIAC was initially created within the framework of the international cooperation between RIHMI-WDC and CDIAC and was published by CDIAC as NDP-040, consisting of data from 223 stations over the former USSR whose data were published in USSR Meteorological Monthly (Part 1: Daily Data). The database presented here consists of records from 518 Russian stations (excluding the former-USSR stations outside the Russian territory contained in NDP-040), for the most part extending through 2010. Records not extending through 2010 result from stations having closed or else their data were not published in Meteorological Monthly of CIS Stations (Part 1: Daily Data). The database was created from the digital media of the State Data Holding. The station inventory was arrived at using (a) the list of Roshydromet stations that are included in the Global Climate Observation Network (this list was approved by the Head of Roshydromet on 25 March 2004) and (b) the list of Roshydromet benchmark meteorological stations prepared by V.I. Kodratyuk, Head of the Department at Voeikov Main Geophysical Observatory.
Illuminating the Depths of the MagIC (Magnetics Information Consortium) Database
NASA Astrophysics Data System (ADS)
Koppers, A. A. P.; Minnett, R.; Jarboe, N.; Jonestrask, L.; Tauxe, L.; Constable, C.
2015-12-01
The Magnetics Information Consortium (http://earthref.org/MagIC/) is a grass-roots cyberinfrastructure effort envisioned by the paleo-, geo-, and rock magnetic scientific community. Its mission is to archive their wealth of peer-reviewed raw data and interpretations from magnetics studies on natural and synthetic samples. Many of these valuable data are legacy datasets that were never published in their entirety, some resided in other databases that are no longer maintained, and others were never digitized from the field notebooks and lab work. Due to the volume of data collected, most studies, modern and legacy, only publish the interpreted results and, occasionally, a subset of the raw data. MagIC is making an extraordinary effort to archive these data in a single data model, including the raw instrument measurements if possible. This facilitates the reproducibility of the interpretations, the re-interpretation of the raw data as the community introduces new techniques, and the compilation of heterogeneous datasets that are otherwise distributed across multiple formats and physical locations. MagIC has developed tools to assist the scientific community in many stages of their workflow. Contributors easily share studies (in a private mode if so desired) in the MagIC Database with colleagues and reviewers prior to publication, publish the data online after the study is peer reviewed, and visualize their data in the context of the rest of the contributions to the MagIC Database. From organizing their data in the MagIC Data Model with an online editable spreadsheet, to validating the integrity of the dataset with automated plots and statistics, MagIC is continually lowering the barriers to transforming dark data into transparent and reproducible datasets. Additionally, this web application generalizes to other databases in MagIC's umbrella website (EarthRef.org) so that the Geochemical Earth Reference Model (http://earthref.org/GERM/) portal, Seamount Biogeosciences Network (http://earthref.org/SBN/), EarthRef Digital Archive (http://earthref.org/ERDA/) and EarthRef Reference Database (http://earthref.org/ERR/) benefit from its development.
Guerrero, Rafael; Almansa, Julio F; Torres, Javier; Lallena, Antonio M
2014-12-01
(60)Co sources are being used as an alternative to (192)Ir sources in high dose rate brachytherapy treatments. In a recent document from AAPM and ESTRO, a consensus dataset for the (60)Co BEBIG (model Co0.A86) high dose rate source was prepared by using results taken from different publications due to discrepancies observed among them. The aim of the present work is to provide a new calculation of the dosimetric characteristics of that (60)Co source according to the recommendations of the AAPM and ESTRO report. Radial dose function, anisotropy function, air-kerma strength, dose rate constant and absorbed dose rate in water have been calculated and compared to the results of previous works. Simulations using the two different geometries considered by other authors have been carried out and the effect of the cable density and length has been studied. Copyright © 2014 Associazione Italiana di Fisica Medica. Published by Elsevier Ltd. All rights reserved.
cisTEM, user-friendly software for single-particle image processing.
Grant, Timothy; Rohou, Alexis; Grigorieff, Nikolaus
2018-03-07
We have developed new open-source software called cis TEM (computational imaging system for transmission electron microscopy) for the processing of data for high-resolution electron cryo-microscopy and single-particle averaging. cis TEM features a graphical user interface that is used to submit jobs, monitor their progress, and display results. It implements a full processing pipeline including movie processing, image defocus determination, automatic particle picking, 2D classification, ab-initio 3D map generation from random parameters, 3D classification, and high-resolution refinement and reconstruction. Some of these steps implement newly-developed algorithms; others were adapted from previously published algorithms. The software is optimized to enable processing of typical datasets (2000 micrographs, 200 k - 300 k particles) on a high-end, CPU-based workstation in half a day or less, comparable to GPU-accelerated processing. Jobs can also be scheduled on large computer clusters using flexible run profiles that can be adapted for most computing environments. cis TEM is available for download from cistem.org. © 2018, Grant et al.
Diabetes Changes Symptoms Cluster Patterns in Persons Living With HIV.
Zuniga, Julie Ann; Bose, Eliezer; Park, Jungmin; Lapiz-Bluhm, M Danet; García, Alexandra A
Approximately 10-15% of persons living with HIV (PLWH) have a comorbid diagnosis of diabetes mellitus (DM). Both of these long-term chronic conditions are associated with high rates of symptom burden. The purpose of our study was to describe symptom patterns for PLWH with DM (PLWH+DM) using a large secondary dataset. The prevalence, burden, and bothersomeness of symptoms reported by patients in routine clinic visits during 2015 were assessed using the 20-item HIV Symptom Index. Principal component analysis was used to identify symptom clusters. Three main clusters were identified: (a) neurological/psychological, (b) gastrointestinal/flu-like, and (c) physical changes. The most prevalent symptoms were fatigue, poor sleep, aches, neuropathy, and sadness. When compared to a previous symptom study with PLWH, symptoms clustered differently in our sample of patients with dual diagnoses of HIV and diabetes. Clinicians should appropriately assess symptoms for their patients' comorbid conditions. Copyright © 2017 Association of Nurses in AIDS Care. Published by Elsevier Inc. All rights reserved.
Comprehensive discovery of noncoding RNAs in acute myeloid leukemia cell transcriptomes.
Zhang, Jin; Griffith, Malachi; Miller, Christopher A; Griffith, Obi L; Spencer, David H; Walker, Jason R; Magrini, Vincent; McGrath, Sean D; Ly, Amy; Helton, Nichole M; Trissal, Maria; Link, Daniel C; Dang, Ha X; Larson, David E; Kulkarni, Shashikant; Cordes, Matthew G; Fronick, Catrina C; Fulton, Robert S; Klco, Jeffery M; Mardis, Elaine R; Ley, Timothy J; Wilson, Richard K; Maher, Christopher A
2017-11-01
To detect diverse and novel RNA species comprehensively, we compared deep small RNA and RNA sequencing (RNA-seq) methods applied to a primary acute myeloid leukemia (AML) sample. We were able to discover previously unannotated small RNAs using deep sequencing of a library method using broader insert size selection. We analyzed the long noncoding RNA (lncRNA) landscape in AML by comparing deep sequencing from multiple RNA-seq library construction methods for the sample that we studied and then integrating RNA-seq data from 179 AML cases. This identified lncRNAs that are completely novel, differentially expressed, and associated with specific AML subtypes. Our study revealed the complexity of the noncoding RNA transcriptome through a combined strategy of strand-specific small RNA and total RNA-seq. This dataset will serve as an invaluable resource for future RNA-based analyses. Copyright © 2017 ISEH – Society for Hematology and Stem Cells. Published by Elsevier Inc. All rights reserved.
Recognition of Arabic Sign Language Alphabet Using Polynomial Classifiers
NASA Astrophysics Data System (ADS)
Assaleh, Khaled; Al-Rousan, M.
2005-12-01
Building an accurate automatic sign language recognition system is of great importance in facilitating efficient communication with deaf people. In this paper, we propose the use of polynomial classifiers as a classification engine for the recognition of Arabic sign language (ArSL) alphabet. Polynomial classifiers have several advantages over other classifiers in that they do not require iterative training, and that they are highly computationally scalable with the number of classes. Based on polynomial classifiers, we have built an ArSL system and measured its performance using real ArSL data collected from deaf people. We show that the proposed system provides superior recognition results when compared with previously published results using ANFIS-based classification on the same dataset and feature extraction methodology. The comparison is shown in terms of the number of misclassified test patterns. The reduction in the rate of misclassified patterns was very significant. In particular, we have achieved a 36% reduction of misclassifications on the training data and 57% on the test data.
Automated Axon Counting in Rodent Optic Nerve Sections with AxonJ.
Zarei, Kasra; Scheetz, Todd E; Christopher, Mark; Miller, Kathy; Hedberg-Buenz, Adam; Tandon, Anamika; Anderson, Michael G; Fingert, John H; Abràmoff, Michael David
2016-05-26
We have developed a publicly available tool, AxonJ, which quantifies the axons in optic nerve sections of rodents stained with paraphenylenediamine (PPD). In this study, we compare AxonJ's performance to human experts on 100x and 40x images of optic nerve sections obtained from multiple strains of mice, including mice with defects relevant to glaucoma. AxonJ produced reliable axon counts with high sensitivity of 0.959 and high precision of 0.907, high repeatability of 0.95 when compared to a gold-standard of manual assessments and high correlation of 0.882 to the glaucoma damage staging of a previously published dataset. AxonJ allows analyses that are quantitative, consistent, fully-automated, parameter-free, and rapid on whole optic nerve sections at 40x. As a freely available ImageJ plugin that requires no highly specialized equipment to utilize, AxonJ represents a powerful new community resource augmenting studies of the optic nerve using mice.
The (un)reliability of item-level semantic priming effects.
Heyman, Tom; Bruninx, Anke; Hutchison, Keith A; Storms, Gert
2018-04-05
Many researchers have tried to predict semantic priming effects using a myriad of variables (e.g., prime-target associative strength or co-occurrence frequency). The idea is that relatedness varies across prime-target pairs, which should be reflected in the size of the priming effect (e.g., cat should prime dog more than animal does). However, it is only insightful to predict item-level priming effects if they can be measured reliably. Thus, in the present study we examined the split-half and test-retest reliabilities of item-level priming effects under conditions that should discourage the use of strategies. The resulting priming effects proved extremely unreliable, and reanalyses of three published priming datasets revealed similar cases of low reliability. These results imply that previous attempts to predict semantic priming were unlikely to be successful. However, one study with an unusually large sample size yielded more favorable reliability estimates, suggesting that big data, in terms of items and participants, should be the future for semantic priming research.
cisTEM, user-friendly software for single-particle image processing
2018-01-01
We have developed new open-source software called cisTEM (computational imaging system for transmission electron microscopy) for the processing of data for high-resolution electron cryo-microscopy and single-particle averaging. cisTEM features a graphical user interface that is used to submit jobs, monitor their progress, and display results. It implements a full processing pipeline including movie processing, image defocus determination, automatic particle picking, 2D classification, ab-initio 3D map generation from random parameters, 3D classification, and high-resolution refinement and reconstruction. Some of these steps implement newly-developed algorithms; others were adapted from previously published algorithms. The software is optimized to enable processing of typical datasets (2000 micrographs, 200 k – 300 k particles) on a high-end, CPU-based workstation in half a day or less, comparable to GPU-accelerated processing. Jobs can also be scheduled on large computer clusters using flexible run profiles that can be adapted for most computing environments. cisTEM is available for download from cistem.org. PMID:29513216
Combined analysis of fourteen nuclear genes refines the Ursidae phylogeny.
Pagès, Marie; Calvignac, Sébastien; Klein, Catherine; Paris, Mathilde; Hughes, Sandrine; Hänni, Catherine
2008-04-01
Despite numerous studies, questions remain about the evolutionary history of Ursidae and additional independent genetic markers were needed to elucidate these ambiguities. For this purpose, we sequenced ten nuclear genes for all the eight extant bear species. By combining these new sequences with those of four other recently published nuclear markers, we provide new insights into the phylogenetic relationships of the Ursidae family members. The hypothesis that the giant panda was the first species to diverge among ursids is definitively confirmed and the precise branching order within the Ursus genus is clarified for the first time. Moreover, our analyses indicate that the American and the Asiatic black bears do not cluster as sister taxa, as had been previously hypothesised. Sun and sloth bears clearly appear as the most basal ursine species but uncertainties about their exact relationships remain. Since our larger dataset did not enable us to clarify this last question, identifying rare genomic changes in bear genomes could be a promising solution for further studies.
Automated Axon Counting in Rodent Optic Nerve Sections with AxonJ
NASA Astrophysics Data System (ADS)
Zarei, Kasra; Scheetz, Todd E.; Christopher, Mark; Miller, Kathy; Hedberg-Buenz, Adam; Tandon, Anamika; Anderson, Michael G.; Fingert, John H.; Abràmoff, Michael David
2016-05-01
We have developed a publicly available tool, AxonJ, which quantifies the axons in optic nerve sections of rodents stained with paraphenylenediamine (PPD). In this study, we compare AxonJ’s performance to human experts on 100x and 40x images of optic nerve sections obtained from multiple strains of mice, including mice with defects relevant to glaucoma. AxonJ produced reliable axon counts with high sensitivity of 0.959 and high precision of 0.907, high repeatability of 0.95 when compared to a gold-standard of manual assessments and high correlation of 0.882 to the glaucoma damage staging of a previously published dataset. AxonJ allows analyses that are quantitative, consistent, fully-automated, parameter-free, and rapid on whole optic nerve sections at 40x. As a freely available ImageJ plugin that requires no highly specialized equipment to utilize, AxonJ represents a powerful new community resource augmenting studies of the optic nerve using mice.
The processing of mispredicted and unpredicted sensory inputs interact differently with attention.
Hsu, Yi-Fang; Hämäläinen, Jarmo A; Waszak, Florian
2018-03-01
Prediction and attention are fundamental brain functions in the service of perception. Interestingly, previous investigations found prediction effects independent of attention in some cases but attention-dependent in other cases. The discrepancy might be related to whether the prediction effect was revealed by comparing mispredicted event (where there is incorrect prediction) or unpredicted event (where there is no precise prediction) against predicted event, which are associated with different precision-weighted prediction error. Here we conducted a joint analysis on four published electroencephalography (EEG) datasets which allow for proper dissociation of mispredicted and unpredicted conditions when there was orthogonal manipulation of prediction and attention. We found that the mispredicted-versus-predicted contrast revealed an attention-independent effect of prediction suppression, whereas the unpredicted-versus-predicted contrast revealed a prediction effect that was reversed by attention on auditory N1. The results suggest that mispredicted and unpredicted processing interact with attention in distinct manners. Copyright © 2018 Elsevier Ltd. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van de Velde, Joris, E-mail: joris.vandevelde@ugent.be; Department of Radiotherapy, Ghent University, Ghent; Audenaert, Emmanuel
Purpose: To develop contouring guidelines for the brachial plexus (BP) using anatomically validated cadaver datasets. Magnetic resonance imaging (MRI) and computed tomography (CT) were used to obtain detailed visualizations of the BP region, with the goal of achieving maximal inclusion of the actual BP in a small contoured volume while also accommodating for anatomic variations. Methods and Materials: CT and MRI were obtained for 8 cadavers positioned for intensity modulated radiation therapy. 3-dimensional reconstructions of soft tissue (from MRI) and bone (from CT) were combined to create 8 separate enhanced CT project files. Dissection of the corresponding cadavers anatomically validatedmore » the reconstructions created. Seven enhanced CT project files were then automatically fitted, separately in different regions, to obtain a single dataset of superimposed BP regions that incorporated anatomic variations. From this dataset, improved BP contouring guidelines were developed. These guidelines were then applied to the 7 original CT project files and also to 1 additional file, left out from the superimposing procedure. The percentage of BP inclusion was compared with the published guidelines. Results: The anatomic validation procedure showed a high level of conformity for the BP regions examined between the 3-dimensional reconstructions generated and the dissected counterparts. Accurate and detailed BP contouring guidelines were developed, which provided corresponding guidance for each level in a clinical dataset. An average margin of 4.7 mm around the anatomically validated BP contour is sufficient to accommodate for anatomic variations. Using the new guidelines, 100% inclusion of the BP was achieved, compared with a mean inclusion of 37.75% when published guidelines were applied. Conclusion: Improved guidelines for BP delineation were developed using combined MRI and CT imaging with validation by anatomic dissection.« less
Who shares? Who doesn't? Factors associated with openly archiving raw research data.
Piwowar, Heather A
2011-01-01
Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.
Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets.
Heath, Allison P; Greenway, Matthew; Powell, Raymond; Spring, Jonathan; Suarez, Rafael; Hanley, David; Bandlamudi, Chai; McNerney, Megan E; White, Kevin P; Grossman, Robert L
2014-01-01
As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it. Bionimbus is an open source cloud-computing platform that is based primarily upon OpenStack, which manages on-demand virtual machines that provide the required computational resources, and GlusterFS, which is a high-performance clustered file system. Bionimbus also includes Tukey, which is a portal, and associated middleware that provides a single entry point and a single sign on for the various Bionimbus resources; and Yates, which automates the installation, configuration, and maintenance of the software infrastructure required. Bionimbus is used by a variety of projects to process genomics and phenotypic data. For example, it is used by an acute myeloid leukemia resequencing project at the University of Chicago. The project requires several computational pipelines, including pipelines for quality control, alignment, variant calling, and annotation. For each sample, the alignment step requires eight CPUs for about 12 h. BAM file sizes ranged from 5 GB to 10 GB for each sample. Most members of the research community have difficulty downloading large genomics datasets and obtaining sufficient storage and computer resources to manage and analyze the data. Cloud computing platforms, such as Bionimbus, with data commons that contain large genomics datasets, are one choice for broadening access to research data in genomics. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
A correlation comparison between Altmetric Attention Scores and citations for six PLOS journals
Huang, Wenya; Wang, Peiling
2018-01-01
This study considered all articles published in six Public Library of Science (PLOS) journals in 2012 and Web of Science citations for these articles as of May 2015. A total of 2,406 articles were analyzed to examine the relationships between Altmetric Attention Scores (AAS) and Web of Science citations. The AAS for an article, provided by Altmetric aggregates activities surrounding research outputs in social media (news outlet mentions, tweets, blogs, Wikipedia, etc.). Spearman correlation testing was done on all articles and articles with AAS. Further analysis compared the stratified datasets based on percentile ranks of AAS: top 50%, top 25%, top 10%, and top 1%. Comparisons across the six journals provided additional insights. The results show significant positive correlations between AAS and citations with varied strength for all articles and articles with AAS (or social media mentions), as well as for normalized AAS in the top 50%, top 25%, top 10%, and top 1% datasets. Four of the six PLOS journals, Genetics, Pathogens, Computational Biology, and Neglected Tropical Diseases, show significant positive correlations across all datasets. However, for the two journals with high impact factors, PLOS Biology and Medicine, the results are unexpected: the Medicine articles showed no significant correlations but the Biology articles tested positive for correlations with the whole dataset and the set with AAS. Both journals published substantially fewer articles than the other four journals. Further research to validate the AAS algorithm, adjust the weighting scheme, and include appropriate social media sources is needed to understand the potential uses and meaning of AAS in different contexts and its relationship to other metrics. PMID:29621253
Oscar, Nels; Fox, Pamela A; Croucher, Racheal; Wernick, Riana; Keune, Jessica; Hooker, Karen
2017-09-01
Social scientists need practical methods for harnessing large, publicly available datasets that inform the social context of aging. We describe our development of a semi-automated text coding method and use a content analysis of Alzheimer's disease (AD) and dementia portrayal on Twitter to demonstrate its use. The approach improves feasibility of examining large publicly available datasets. Machine learning techniques modeled stigmatization expressed in 31,150 AD-related tweets collected via Twitter's search API based on 9 AD-related keywords. Two researchers manually coded 311 random tweets on 6 dimensions. This input from 1% of the dataset was used to train a classifier against the tweet text and code the remaining 99% of the dataset. Our automated process identified that 21.13% of the AD-related tweets used AD-related keywords to perpetuate public stigma, which could impact stereotypes and negative expectations for individuals with the disease and increase "excess disability". This technique could be applied to questions in social gerontology related to how social media outlets reflect and shape attitudes bearing on other developmental outcomes. Recommendations for the collection and analysis of large Twitter datasets are discussed. © The Author 2017. Published by Oxford University Press on behalf of The Gerontological Society of America. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data
Goldstein, Markus; Uchida, Seiichi
2016-01-01
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks. PMID:27093601
Differentially Private Histogram Publication For Dynamic Datasets: An Adaptive Sampling Approach
Li, Haoran; Jiang, Xiaoqian; Xiong, Li; Liu, Jinfei
2016-01-01
Differential privacy has recently become a de facto standard for private statistical data release. Many algorithms have been proposed to generate differentially private histograms or synthetic data. However, most of them focus on “one-time” release of a static dataset and do not adequately address the increasing need of releasing series of dynamic datasets in real time. A straightforward application of existing histogram methods on each snapshot of such dynamic datasets will incur high accumulated error due to the composibility of differential privacy and correlations or overlapping users between the snapshots. In this paper, we address the problem of releasing series of dynamic datasets in real time with differential privacy, using a novel adaptive distance-based sampling approach. Our first method, DSFT, uses a fixed distance threshold and releases a differentially private histogram only when the current snapshot is sufficiently different from the previous one, i.e., with a distance greater than a predefined threshold. Our second method, DSAT, further improves DSFT and uses a dynamic threshold adaptively adjusted by a feedback control mechanism to capture the data dynamics. Extensive experiments on real and synthetic datasets demonstrate that our approach achieves better utility than baseline methods and existing state-of-the-art methods. PMID:26973795
Abràmoff, Michael D; Niemeijer, Meindert; Suttorp-Schulten, Maria S A; Viergever, Max A; Russell, Stephen R; van Ginneken, Bram
2008-02-01
To evaluate the performance of a system for automated detection of diabetic retinopathy in digital retinal photographs, built from published algorithms, in a large, representative, screening population. We conducted a retrospective analysis of 10,000 consecutive patient visits, specifically exams (four retinal photographs, two left and two right) from 5,692 unique patients from the EyeCheck diabetic retinopathy screening project imaged with three types of cameras at 10 centers. Inclusion criteria included no previous diagnosis of diabetic retinopathy, no previous visit to ophthalmologist for dilated eye exam, and both eyes photographed. One of three retinal specialists evaluated each exam as unacceptable quality, no referable retinopathy, or referable retinopathy. We then selected exams with sufficient image quality and determined presence or absence of referable retinopathy. Outcome measures included area under the receiver operating characteristic curve (number needed to miss one case [NNM]) and type of false negative. Total area under the receiver operating characteristic curve was 0.84, and NNM was 80 at a sensitivity of 0.84 and a specificity of 0.64. At this point, 7,689 of 10,000 exams had sufficient image quality, 4,648 of 7,689 (60%) were true negatives, 59 of 7,689 (0.8%) were false negatives, 319 of 7,689 (4%) were true positives, and 2,581 of 7,689 (33%) were false positives. Twenty-seven percent of false negatives contained large hemorrhages and/or neovascularizations. Automated detection of diabetic retinopathy using published algorithms cannot yet be recommended for clinical practice. However, performance is such that evaluation on validated, publicly available datasets should be pursued. If algorithms can be improved, such a system may in the future lead to improved prevention of blindness and vision loss in patients with diabetes.
Mixed Model Association with Family-Biased Case-Control Ascertainment.
Hayeck, Tristan J; Loh, Po-Ru; Pollack, Samuela; Gusev, Alexander; Patterson, Nick; Zaitlen, Noah A; Price, Alkes L
2017-01-05
Mixed models have become the tool of choice for genetic association studies; however, standard mixed model methods may be poorly calibrated or underpowered under family sampling bias and/or case-control ascertainment. Previously, we introduced a liability threshold-based mixed model association statistic (LTMLM) to address case-control ascertainment in unrelated samples. Here, we consider family-biased case-control ascertainment, where case and control subjects are ascertained non-randomly with respect to family relatedness. Previous work has shown that this type of ascertainment can severely bias heritability estimates; we show here that it also impacts mixed model association statistics. We introduce a family-based association statistic (LT-Fam) that is robust to this problem. Similar to LTMLM, LT-Fam is computed from posterior mean liabilities (PML) under a liability threshold model; however, LT-Fam uses published narrow-sense heritability estimates to avoid the problem of biased heritability estimation, enabling correct calibration. In simulations with family-biased case-control ascertainment, LT-Fam was correctly calibrated (average χ 2 = 1.00-1.02 for null SNPs), whereas the Armitage trend test (ATT), standard mixed model association (MLM), and case-control retrospective association test (CARAT) were mis-calibrated (e.g., average χ 2 = 0.50-1.22 for MLM, 0.89-2.65 for CARAT). LT-Fam also attained higher power than other methods in some settings. In 1,259 type 2 diabetes-affected case subjects and 5,765 control subjects from the CARe cohort, downsampled to induce family-biased ascertainment, LT-Fam was correctly calibrated whereas ATT, MLM, and CARAT were again mis-calibrated. Our results highlight the importance of modeling family sampling bias in case-control datasets with related samples. Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Scott, Frank I; Mamtani, Ronac; Haynes, Kevin; Goldberg, David S; Mahmoud, Najjia N.; Lewis, James D
2016-01-01
PURPOSE Epidemiological data on adhesion-related complications following intra-abdominal surgery are limited. We tested the accuracy of recording of these surgeries and complications within The Health Improvement Network (THIN), a primary care database within the United Kingdom. METHODS Individuals within THIN from 1995–2011 with an incident intra-abdominal surgery and subsequent bowel obstruction (SBO) or adhesiolysis were identified using diagnostic codes. To compute positive predictive values (PPVs), requests were sent to treating physicians of patients with these diagnostic codes to confirm the surgery, SBO, or adhesiolysis code. Completeness of recording was estimated by comparing observed surgical rates within THIN to expected rates derived from the Hospital Episode Statistics (HES) dataset within England. Cumulative incidence rates of adhesion-related complications at 5 years were compared to a previously published cohort within Scotland. RESULTS 217 of 245 (89%) questionnaires were returned (180 SBO and 37 adhesiolysis). The PPV of codes for surgery was 94.5% (95%CI: 91–97%). 88.8% of procedure types were correctly coded. The PPV for SBO and adhesiolysis was 86.1% (95% CI: 80–91%) and 89.2% (95% CI: 75–97%), respectively. Colectomy, appendectomy, and cholecystectomy rates within THIN were 99%, 95%, and 84% of rates observed in national HES data, respectively. Cumulative incidence rates of adhesion related complications following colectomy, appendectomy, and small bowel surgery were similar to those published previously. CONCLUSIONS Surgical procedures, SBO, and adhesiolysis can be accurately identified within THIN using diagnostic codes. THIN represents a new tool for assessing patient-specific risk factors for adhesion-related complications and long term outcomes. PMID:26860870
Walsh, Kyle M; Anderson, Erik; Hansen, Helen M; Decker, Paul A; Kosel, Matt L; Kollmeyer, Thomas; Rice, Terri; Zheng, Shichun; Xiao, Yuanyuan; Chang, Jeffrey S; McCoy, Lucie S; Bracci, Paige M; Wiemels, Joe L; Pico, Alexander R; Smirnov, Ivan; Lachance, Daniel H; Sicotte, Hugues; Eckel-Passow, Jeanette E; Wiencke, John K; Jenkins, Robert B; Wrensch, Margaret R
2013-02-01
Genomewide association studies (GWAS) and candidate-gene studies have implicated single-nucleotide polymorphisms (SNPs) in at least 45 different genes as putative glioma risk factors. Attempts to validate these associations have yielded variable results and few genetic risk factors have been consistently replicated. We conducted a case-control study of Caucasian glioma cases and controls from the University of California San Francisco (810 cases, 512 controls) and the Mayo Clinic (852 cases, 789 controls) in an attempt to replicate previously reported genetic risk factors for glioma. Sixty SNPs selected from the literature (eight from GWAS and 52 from candidate-gene studies) were successfully genotyped on an Illumina custom genotyping panel. Eight SNPs in/near seven different genes (TERT, EGFR, CCDC26, CDKN2A, PHLDB1, RTEL1, TP53) were significantly associated with glioma risk in the combined dataset (P < 0.05), with all associations in the same direction as in previous reports. Several SNP associations showed considerable differences across histologic subtype. All eight successfully replicated associations were first identified by GWAS, although none of the putative risk SNPs from candidate-gene studies was associated in the full case-control sample (all P values > 0.05). Although several confirmed associations are located near genes long known to be involved in gliomagenesis (e.g., EGFR, CDKN2A, TP53), these associations were first discovered by the GWAS approach and are in noncoding regions. These results highlight that the deficiencies of the candidate-gene approach lay in selecting both appropriate genes and relevant SNPs within these genes. © 2012 WILEY PERIODICALS, INC.
Morace, Jennifer L.
2007-01-01
Growth and decomposition of dense blooms of Aphanizomenon flos-aquae in Upper Klamath Lake frequently cause extreme water-quality conditions that have led to critical fishery concerns for the region, including the listing of two species of endemic suckers as endangered. The Bureau of Reclamation has asked the U.S. Geological Survey (USGS) to examine water-quality data collected by the Klamath Tribes for relations with lake level. This analysis evaluates a 17-year dataset (1990-2006) and updates a previous USGS analysis of a 5-year dataset (1990-94). Both univariate hypothesis testing and multivariable analyses evaluated using an information-theoretic approach revealed the same results-no one overarching factor emerged from the data. No single factor could be relegated from consideration either. The lack of statistically significant, strong correlations between water-quality conditions, lake level, and climatic factors does not necessarily show that these factors do not influence water-quality conditions; it is more likely that these conditions work in conjunction with each other to affect water quality. A few different conclusions could be drawn from the larger dataset than from the smaller dataset examined in 1996, but for the most part, the outcome was the same. Using an observational dataset that may not capture all variation in water-quality conditions (samples were collected on a two-week interval) and that has a limited range of conditions for evaluation (confined to the operation of lake) may have confounded the exploration of explanatory factors. In the end, all years experienced some variation in poor water-quality conditions, either in timing of occurrence of the poor conditions or in their duration. The dataset of 17 years simply provided 17 different patterns of lake level, cumulative degree-days, timing of the bloom onset, and poor water-quality conditions, with no overriding causal factor emerging from the variations. Water-quality conditions were evaluated for their potential to be harmful to the endangered sucker species on the basis of high-stress thresholds-water temperature values greater than 28 degrees Celsius, dissolved-oxygen concentrations less than 4 milligrams per liter, and pH values greater than 9.7. Few water temperatures were greater than 28 degrees Celsius, and dissolved-oxygen concentrations less than 4 milligrams per liter generally were recorded in mid to late summer. In contrast, high pH values were more frequent, occurring earlier in the season and parallel with growth in the algal bloom. The 10 hypotheses relating water-quality variables, lake level, and climatic factors from the earlier USGS study were tested in this analysis for the larger 1990-2006 dataset. These hypotheses proposed relations between lake level and chlorophyll-a, pH, dissolved oxygen, total phosphorus, and water temperature. As in the previous study, no evidence was found in the larger dataset for any of these relations based on a seasonal (May-October) distribution. When analyzing only the June data, the previous 5-year study did find evidence for three hypotheses relating lake level to the onset of the bloom, chlorophyll-a concentrations, and the frequency of high pH values in June. These hypotheses were not supported by the 1990-2006 dataset, but the two hypotheses related to cumulative degree-days from the previous study were: chlorophyll-a concentrations were lower and onset of the algal bloom was delayed when spring air temperatures were cooler. Other relations between water-quality variables and cumulative degree-days were not significant. In an attempt to identify interrelations among variables not detected by univariate analysis, multiple regressions were performed between lakewide measures of low dissolved-oxygen concentrations or high pH values in July and August and six physical and biological variables (peak chlorophyll-a concentrations, degree-days, water temperature, median October-May discharg
Bidirectional segmentation of prostate capsule from ultrasound volumes: an improved strategy
NASA Astrophysics Data System (ADS)
Wei, Liyang; Narayanan, Ramkrishnan; Kumar, Dinesh; Fenster, Aaron; Barqawi, Albaha; Werahera, Priya; Crawford, E. David; Suri, Jasjit S.
2008-03-01
Prostate volume is an indirect indicator for several prostate diseases. Volume estimation is a desired requirement during prostate biopsy, therapy and clinical follow up. Image segmentation is thus necessary. Previously, discrete dynamic contour (DDC) was implemented in orthogonal unidirectional on the slice-by-slice basis for prostate boundary estimation. This suffered from the glitch that it needed stopping criteria during the propagation of segmentation procedure from slice-to-slice. To overcome this glitch, axial DDC was implemented and this suffered from the fact that central axis never remains fixed and wobbles during propagation of segmentation from slice-to-slice. The effect of this was a multi-fold reconstructed surface. This paper presents a bidirectional DDC approach, thereby removing the two glitches. Our bidirectional DDC protocol was tested on a clinical dataset on 28 3-D ultrasound image volumes acquired using side fire Philips transrectal ultrasound. We demonstrate the orthogonal bidirectional DDC strategy achieved the most accurate volume estimation compared with previously published orthogonal unidirectional DDC and axial DDC methods. Compared to the ground truth, we show that the mean volume estimation errors were: 18.48%, 9.21% and 7.82% for unidirectional, axial and bidirectional DDC methods, respectively. The segmentation architecture is implemented in Visual C++ in Windows environment.
Carbon footprint of aerobic biological treatment of winery wastewater.
Rosso, D; Bolzonella, D
2009-01-01
The carbon associated with wastewater and its treatment accounts for approximately 6% of the global carbon balance. Within the wastewater treatment industry, winery wastewater has a minor contribution, although it can have a major impact on wine-producing regions. Typically, winery wastewater is treated by biological processes, such as the activated sludge process. Biomass produced during treatment is usually disposed of directly, i.e. without digestion or other anaerobic processes. We applied our previously published model for carbon-footprint calculation to the areas worldwide producing yearly more than 10(6) m(3) of wine (i.e., France, Italy, Spain, California, Argentina, Australia, China, and South Africa). Datasets on wine production from the Food and Agriculture Organisation were processed and wastewater flow rates calculated with assumptions based on our previous experience. Results show that the wine production, hence the calculated wastewater flow, is reported as fairly constant in the period 2005-2007. Nevertheless, treatment process efficiency and energy-conservation may play a significant role on the overall carbon-footprint. We performed a sensitivity analysis on the efficiency of the aeration process (alphaSOTE per unit depth, or alphaSOTE/Z) in the biological treatment operations and showed significant margin for improvement. Our results show that the carbon-footprint reduction via aeration efficiency improvement is in the range of 8.1 to 12.3%.
Multi-scale graph-cut algorithm for efficient water-fat separation.
Berglund, Johan; Skorpil, Mikael
2017-09-01
To improve the accuracy and robustness to noise in water-fat separation by unifying the multiscale and graph cut based approaches to B 0 -correction. A previously proposed water-fat separation algorithm that corrects for B 0 field inhomogeneity in 3D by a single quadratic pseudo-Boolean optimization (QPBO) graph cut was incorporated into a multi-scale framework, where field map solutions are propagated from coarse to fine scales for voxels that are not resolved by the graph cut. The accuracy of the single-scale and multi-scale QPBO algorithms was evaluated against benchmark reference datasets. The robustness to noise was evaluated by adding noise to the input data prior to water-fat separation. Both algorithms achieved the highest accuracy when compared with seven previously published methods, while computation times were acceptable for implementation in clinical routine. The multi-scale algorithm was more robust to noise than the single-scale algorithm, while causing only a small increase (+10%) of the reconstruction time. The proposed 3D multi-scale QPBO algorithm offers accurate water-fat separation, robustness to noise, and fast reconstruction. The software implementation is freely available to the research community. Magn Reson Med 78:941-949, 2017. © 2016 International Society for Magnetic Resonance in Medicine. © 2016 International Society for Magnetic Resonance in Medicine.
BLIND ordering of large-scale transcriptomic developmental timecourses.
Anavy, Leon; Levin, Michal; Khair, Sally; Nakanishi, Nagayasu; Fernandez-Valverde, Selene L; Degnan, Bernard M; Yanai, Itai
2014-03-01
RNA-Seq enables the efficient transcriptome sequencing of many samples from small amounts of material, but the analysis of these data remains challenging. In particular, in developmental studies, RNA-Seq is challenged by the morphological staging of samples, such as embryos, since these often lack clear markers at any particular stage. In such cases, the automatic identification of the stage of a sample would enable previously infeasible experimental designs. Here we present the 'basic linear index determination of transcriptomes' (BLIND) method for ordering samples comprising different developmental stages. The method is an implementation of a traveling salesman algorithm to order the transcriptomes according to their inter-relationships as defined by principal components analysis. To establish the direction of the ordered samples, we show that an appropriate indicator is the entropy of transcriptomic gene expression levels, which increases over developmental time. Using BLIND, we correctly recover the annotated order of previously published embryonic transcriptomic timecourses for frog, mosquito, fly and zebrafish. We further demonstrate the efficacy of BLIND by collecting 59 embryos of the sponge Amphimedon queenslandica and ordering their transcriptomes according to developmental stage. BLIND is thus useful in establishing the temporal order of samples within large datasets and is of particular relevance to the study of organisms with asynchronous development and when morphological staging is difficult.
Damage and protection cost curves for coastal floods within the 600 largest European cities
Prahl, Boris F.; Boettle, Markus; Costa, Luís; Kropp, Jürgen P.; Rybski, Diego
2018-01-01
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves. PMID:29557944
Evaluation of nine popular de novo assemblers in microbial genome assembly.
Forouzan, Esmaeil; Maleki, Masoumeh Sadat Mousavi; Karkhane, Ali Asghar; Yakhchali, Bagher
2017-12-01
Next generation sequencing (NGS) technologies are revolutionizing biology, with Illumina being the most popular NGS platform. Short read assembly is a critical part of most genome studies using NGS. Hence, in this study, the performance of nine well-known assemblers was evaluated in the assembly of seven different microbial genomes. Effect of different read coverage and k-mer parameters on the quality of the assembly were also evaluated on both simulated and actual read datasets. Our results show that the performance of assemblers on real and simulated datasets could be significantly different, mainly because of coverage bias. According to outputs on actual read datasets, for all studied read coverages (of 7×, 25× and 100×), SPAdes and IDBA-UD clearly outperformed other assemblers based on NGA50 and accuracy metrics. Velvet is the most conservative assembler with the lowest NGA50 and error rate. Copyright © 2017. Published by Elsevier B.V.
The International Human Epigenome Consortium Data Portal.
Bujold, David; Morais, David Anderson de Lima; Gauthier, Carol; Côté, Catherine; Caron, Maxime; Kwan, Tony; Chen, Kuang Chung; Laperle, Jonathan; Markovits, Alexei Nordell; Pastinen, Tomi; Caron, Bryan; Veilleux, Alain; Jacques, Pierre-Étienne; Bourque, Guillaume
2016-11-23
The International Human Epigenome Consortium (IHEC) coordinates the production of reference epigenome maps through the characterization of the regulome, methylome, and transcriptome from a wide range of tissues and cell types. To define conventions ensuring the compatibility of datasets and establish an infrastructure enabling data integration, analysis, and sharing, we developed the IHEC Data Portal (http://epigenomesportal.ca/ihec). The portal provides access to >7,000 reference epigenomic datasets, generated from >600 tissues, which have been contributed by seven international consortia: ENCODE, NIH Roadmap, CEEHRC, Blueprint, DEEP, AMED-CREST, and KNIH. The portal enhances the utility of these reference maps by facilitating the discovery, visualization, analysis, download, and sharing of epigenomics data. The IHEC Data Portal is the official source to navigate through IHEC datasets and represents a strategy for unifying the distributed data produced by international research consortia. Crown Copyright © 2016. Published by Elsevier Inc. All rights reserved.
The Brainomics/Localizer database.
Papadopoulos Orfanos, Dimitri; Michel, Vincent; Schwartz, Yannick; Pinel, Philippe; Moreno, Antonio; Le Bihan, Denis; Frouin, Vincent
2017-01-01
The Brainomics/Localizer database exposes part of the data collected by the in-house Localizer project, which planned to acquire four types of data from volunteer research subjects: anatomical MRI scans, functional MRI data, behavioral and demographic data, and DNA sampling. Over the years, this local project has been collecting such data from hundreds of subjects. We had selected 94 of these subjects for their complete datasets, including all four types of data, as the basis for a prior publication; the Brainomics/Localizer database publishes the data associated with these 94 subjects. Since regulatory rules prevent us from making genetic data available for download, the database serves only anatomical MRI scans, functional MRI data, behavioral and demographic data. To publish this set of heterogeneous data, we use dedicated software based on the open-source CubicWeb semantic web framework. Through genericity in the data model and flexibility in the display of data (web pages, CSV, JSON, XML), CubicWeb helps us expose these complex datasets in original and efficient ways. Copyright © 2015 Elsevier Inc. All rights reserved.
Data Publication: The Evolving Lifecyle
NASA Astrophysics Data System (ADS)
Studwell, S.; Elliott, J.; Anderson, A.
2015-12-01
Datasets are recognized as valuable information entities in their own right that, now and in the future, need to be available for citation, discovery, retrieval and reuse. The U.S. Department of Energy's Office of Scientific and Technical Information (OSTI) provides Digital Object Identifiers (DOIs) to DOE-funded data through partnership with DataCite. The Geothermal Data Repository (GDR) has been using OSTI's Data ID Service since summer, 2014 and is a success story for data publishing in several different ways. This presentation attributes the initial success to the insistence of DOE's Geothermal Technologies Office on detailed planning, robust data curation, and submitter participation. OSTI widely disseminates these data products across both U.S. and international platforms and continually enhances the Data ID Service to facilitate better linkage between published literature, supplementary data components, and the underlying datasets within the structure of the GDR repository. Issues of granularity in DOI assignment, the role of new federal government guidelines on public access to digital data, and the challenges still ahead will be addressed.
McCann, Liza J; Arnold, Katie; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Beard, Laura; Beresford, Michael W; Wedderburn, Lucy R
2014-01-01
Juvenile dermatomyositis (JDM) is a rare but severe autoimmune inflammatory myositis of childhood. International collaboration is essential in order to undertake clinical trials, understand the disease and improve long-term outcome. The aim of this study was to propose from existing collaborative initiatives a preliminary minimal dataset for JDM. This will form the basis of the future development of an international consensus-approved minimum core dataset to be used both in clinical care and inform research, allowing integration of data between centres. A working group of internationally-representative JDM experts was formed to develop a provisional minimal dataset. Clinical and laboratory variables contained within current national and international collaborative databases of patients with idiopathic inflammatory myopathies were scrutinised. Judgements were informed by published literature and a more detailed analysis of the Juvenile Dermatomyositis Cohort Biomarker Study and Repository, UK and Ireland. A provisional minimal JDM dataset has been produced, with an associated glossary of definitions. The provisional minimal dataset will request information at time of patient diagnosis and during on-going prospective follow up. At time of patient diagnosis, information will be requested on patient demographics, diagnostic criteria and treatments given prior to diagnosis. During on-going prospective follow-up, variables will include the presence of active muscle or skin disease, major organ involvement or constitutional symptoms, investigations, treatment, physician global assessments and patient reported outcome measures. An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. This preliminary dataset can now be developed into a consensus-approved minimum core dataset and tested in a wider setting with the aim of achieving international agreement.
So many genes, so little time: A practical approach to divergence-time estimation in the genomic era
2018-01-01
Phylogenomic datasets have been successfully used to address questions involving evolutionary relationships, patterns of genome structure, signatures of selection, and gene and genome duplications. However, despite the recent explosion in genomic and transcriptomic data, the utility of these data sources for efficient divergence-time inference remains unexamined. Phylogenomic datasets pose two distinct problems for divergence-time estimation: (i) the volume of data makes inference of the entire dataset intractable, and (ii) the extent of underlying topological and rate heterogeneity across genes makes model mis-specification a real concern. “Gene shopping”, wherein a phylogenomic dataset is winnowed to a set of genes with desirable properties, represents an alternative approach that holds promise in alleviating these issues. We implemented an approach for phylogenomic datasets (available in SortaDate) that filters genes by three criteria: (i) clock-likeness, (ii) reasonable tree length (i.e., discernible information content), and (iii) least topological conflict with a focal species tree (presumed to have already been inferred). Such a winnowing procedure ensures that errors associated with model (both clock and topology) mis-specification are minimized, therefore reducing error in divergence-time estimation. We demonstrated the efficacy of this approach through simulation and applied it to published animal (Aves, Diplopoda, and Hymenoptera) and plant (carnivorous Caryophyllales, broad Caryophyllales, and Vitales) phylogenomic datasets. By quantifying rate heterogeneity across both genes and lineages we found that every empirical dataset examined included genes with clock-like, or nearly clock-like, behavior. Moreover, many datasets had genes that were clock-like, exhibited reasonable evolutionary rates, and were mostly compatible with the species tree. We identified overlap in age estimates when analyzing these filtered genes under strict clock and uncorrelated lognormal (UCLN) models. However, this overlap was often due to imprecise estimates from the UCLN model. We find that “gene shopping” can be an efficient approach to divergence-time inference for phylogenomic datasets that may otherwise be characterized by extensive gene tree heterogeneity. PMID:29772020
Smith, Stephen A; Brown, Joseph W; Walker, Joseph F
2018-01-01
Phylogenomic datasets have been successfully used to address questions involving evolutionary relationships, patterns of genome structure, signatures of selection, and gene and genome duplications. However, despite the recent explosion in genomic and transcriptomic data, the utility of these data sources for efficient divergence-time inference remains unexamined. Phylogenomic datasets pose two distinct problems for divergence-time estimation: (i) the volume of data makes inference of the entire dataset intractable, and (ii) the extent of underlying topological and rate heterogeneity across genes makes model mis-specification a real concern. "Gene shopping", wherein a phylogenomic dataset is winnowed to a set of genes with desirable properties, represents an alternative approach that holds promise in alleviating these issues. We implemented an approach for phylogenomic datasets (available in SortaDate) that filters genes by three criteria: (i) clock-likeness, (ii) reasonable tree length (i.e., discernible information content), and (iii) least topological conflict with a focal species tree (presumed to have already been inferred). Such a winnowing procedure ensures that errors associated with model (both clock and topology) mis-specification are minimized, therefore reducing error in divergence-time estimation. We demonstrated the efficacy of this approach through simulation and applied it to published animal (Aves, Diplopoda, and Hymenoptera) and plant (carnivorous Caryophyllales, broad Caryophyllales, and Vitales) phylogenomic datasets. By quantifying rate heterogeneity across both genes and lineages we found that every empirical dataset examined included genes with clock-like, or nearly clock-like, behavior. Moreover, many datasets had genes that were clock-like, exhibited reasonable evolutionary rates, and were mostly compatible with the species tree. We identified overlap in age estimates when analyzing these filtered genes under strict clock and uncorrelated lognormal (UCLN) models. However, this overlap was often due to imprecise estimates from the UCLN model. We find that "gene shopping" can be an efficient approach to divergence-time inference for phylogenomic datasets that may otherwise be characterized by extensive gene tree heterogeneity.
2014-01-01
Background Juvenile dermatomyositis (JDM) is a rare but severe autoimmune inflammatory myositis of childhood. International collaboration is essential in order to undertake clinical trials, understand the disease and improve long-term outcome. The aim of this study was to propose from existing collaborative initiatives a preliminary minimal dataset for JDM. This will form the basis of the future development of an international consensus-approved minimum core dataset to be used both in clinical care and inform research, allowing integration of data between centres. Methods A working group of internationally-representative JDM experts was formed to develop a provisional minimal dataset. Clinical and laboratory variables contained within current national and international collaborative databases of patients with idiopathic inflammatory myopathies were scrutinised. Judgements were informed by published literature and a more detailed analysis of the Juvenile Dermatomyositis Cohort Biomarker Study and Repository, UK and Ireland. Results A provisional minimal JDM dataset has been produced, with an associated glossary of definitions. The provisional minimal dataset will request information at time of patient diagnosis and during on-going prospective follow up. At time of patient diagnosis, information will be requested on patient demographics, diagnostic criteria and treatments given prior to diagnosis. During on-going prospective follow-up, variables will include the presence of active muscle or skin disease, major organ involvement or constitutional symptoms, investigations, treatment, physician global assessments and patient reported outcome measures. Conclusions An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. This preliminary dataset can now be developed into a consensus-approved minimum core dataset and tested in a wider setting with the aim of achieving international agreement. PMID:25075205
Changes to the Fossil Record of Insects through Fifteen Years of Discovery
Nicholson, David B.; Mayhew, Peter J.; Ross, Andrew J.
2015-01-01
The first and last occurrences of hexapod families in the fossil record are compiled from publications up to end-2009. The major features of these data are compared with those of previous datasets (1993 and 1994). About a third of families (>400) are new to the fossil record since 1994, over half of the earlier, existing families have experienced changes in their known stratigraphic range and only about ten percent have unchanged ranges. Despite these significant additions to knowledge, the broad pattern of described richness through time remains similar, with described richness increasing steadily through geological history and a shift in dominant taxa, from Palaeoptera and Polyneoptera to Paraneoptera and Holometabola, after the Palaeozoic. However, after detrending, described richness is not well correlated with the earlier datasets, indicating significant changes in shorter-term patterns. There is reduced Palaeozoic richness, peaking at a different time, and a less pronounced Permian decline. A pronounced Triassic peak and decline is shown, and the plateau from the mid Early Cretaceous to the end of the period remains, albeit at substantially higher richness compared to earlier datasets. Origination and extinction rates are broadly similar to before, with a broad decline in both through time but episodic peaks, including end-Permian turnover. Origination more consistently exceeds extinction compared to previous datasets and exceptions are mainly in the Palaeozoic. These changes suggest that some inferences about causal mechanisms in insect macroevolution are likely to differ as well. PMID:26176667
NASA Astrophysics Data System (ADS)
Patton, E. W.; West, P.; Greer, R.; Jin, B.
2011-12-01
Following on work presented at the 2010 AGU Fall Meeting, we present a number of real-world collections of semantically-enabled scientific metadata ingested into the Tetherless World RDF2HTML system as structured data and presented and edited using that system. Two separate datasets from two different domains (oceanography and solar sciences) are made available using existing web standards and services, e.g. encoded using ontologies represented with the Web Ontology Language (OWL) and stored in a SPARQL endpoint for querying. These datasets are deployed for use in three different web environments, i.e. Drupal, MediaWiki, and a custom web portal written in Java, to highlight the cross-platform nature of the data presentation. Stylesheets used to transform concepts in each domain as well as shared terms into HTML will be presented to show the power of using common ontologies to publish data and support reuse of existing terminologies. In addition, a single domain dataset is shared between two separate portal instances to demonstrate the ability for this system to offer distributed access and modification of content across the Internet. Lastly, we will highlight challenges that arose in the software engineering process, outline the design choices we made in solving those issues, and discuss how future improvements to this and other systems will enable the evolution of distributed, decentralized collaborations for scientific data sharing across multiple research groups.
Khosrow-Khavar, Farzad; Tavakolian, Kouhyar; Blaber, Andrew; Menon, Carlo
2016-10-12
The purpose of this research was to design a delineation algorithm that could detect specific fiducial points of the seismocardiogram (SCG) signal with or without using the electrocardiogram (ECG) R-wave as the reference point. The detected fiducial points were used to estimate cardiac time intervals. Due to complexity and sensitivity of the SCG signal, the algorithm was designed to robustly discard the low-quality cardiac cycles, which are the ones that contain unrecognizable fiducial points. The algorithm was trained on a dataset containing 48,318 manually annotated cardiac cycles. It was then applied to three test datasets: 65 young healthy individuals (dataset 1), 15 individuals above 44 years old (dataset 2), and 25 patients with previous heart conditions (dataset 3). The algorithm accomplished high prediction accuracy with the rootmean- square-error of less than 5 ms for all the test datasets. The algorithm overall mean detection rate per individual recordings (DRI) were 74, 68, and 42 percent for the three test datasets when concurrent ECG and SCG were used. For the standalone SCG case, the mean DRI was 32, 14 and 21 percent. When the proposed algorithm applied to concurrent ECG and SCG signals, the desired fiducial points of the SCG signal were successfully estimated with a high detection rate. For the standalone case, however, the algorithm achieved high prediction accuracy and detection rate for only the young individual dataset. The presented algorithm could be used for accurate and non-invasive estimation of cardiac time intervals.
Analyzing contentious relationships and outlier genes in phylogenomics.
Walker, Joseph F; Brown, Joseph W; Smith, Stephen A
2018-06-08
Recent studies have demonstrated that conflict is common among gene trees in phylogenomic studies, and that less than one percent of genes may ultimately drive species tree inference in supermatrix analyses. Here, we examined two datasets where supermatrix and coalescent-based species trees conflict. We identified two highly influential "outlier" genes in each dataset. When removed from each dataset, the inferred supermatrix trees matched the topologies obtained from coalescent analyses. We also demonstrate that, while the outlier genes in the vertebrate dataset have been shown in a previous study to be the result of errors in orthology detection, the outlier genes from a plant dataset did not exhibit any obvious systematic error and therefore may be the result of some biological process yet to be determined. While topological comparisons among a small set of alternate topologies can be helpful in discovering outlier genes, they can be limited in several ways, such as assuming all genes share the same topology. Coalescent species tree methods relax this assumption but do not explicitly facilitate the examination of specific edges. Coalescent methods often also assume that conflict is the result of incomplete lineage sorting (ILS). Here we explored a framework that allows for quickly examining alternative edges and support for large phylogenomic datasets that does not assume a single topology for all genes. For both datasets, these analyses provided detailed results confirming the support for coalescent-based topologies. This framework suggests that we can improve our understanding of the underlying signal in phylogenomic datasets by asking more targeted edge-based questions.