Lee, Kyubum; Kim, Byounggun; Jeon, Minji; Kim, Jihye; Tan, Aik Choon
2018-01-01
Background With the development of artificial intelligence (AI) technology centered on deep-learning, the computer has evolved to a point where it can read a given text and answer a question based on the context of the text. Such a specific task is known as the task of machine comprehension. Existing machine comprehension tasks mostly use datasets of general texts, such as news articles or elementary school-level storybooks. However, no attempt has been made to determine whether an up-to-date deep learning-based machine comprehension model can also process scientific literature containing expert-level knowledge, especially in the biomedical domain. Objective This study aims to investigate whether a machine comprehension model can process biomedical articles as well as general texts. Since there is no dataset for the biomedical literature comprehension task, our work includes generating a large-scale question answering dataset using PubMed and manually evaluating the generated dataset. Methods We present an attention-based deep neural model tailored to the biomedical domain. To further enhance the performance of our model, we used a pretrained word vector and biomedical entity type embedding. We also developed an ensemble method of combining the results of several independent models to reduce the variance of the answers from the models. Results The experimental results showed that our proposed deep neural network model outperformed the baseline model by more than 7% on the new dataset. We also evaluated human performance on the new dataset. The human evaluation result showed that our deep neural model outperformed humans in comprehension by 22% on average. Conclusions In this work, we introduced a new task of machine comprehension in the biomedical domain using a deep neural model. Since there was no large-scale dataset for training deep neural models in the biomedical domain, we created the new cloze-style datasets Biomedical Knowledge Comprehension Title (BMKC_T) and Biomedical Knowledge Comprehension Last Sentence (BMKC_LS) (together referred to as BioMedical Knowledge Comprehension) using the PubMed corpus. The experimental results showed that the performance of our model is much higher than that of humans. We observed that our model performed consistently better regardless of the degree of difficulty of a text, whereas humans have difficulty when performing biomedical literature comprehension tasks that require expert level knowledge. PMID:29305341
Kim, Seongsoon; Park, Donghyeon; Choi, Yonghwa; Lee, Kyubum; Kim, Byounggun; Jeon, Minji; Kim, Jihye; Tan, Aik Choon; Kang, Jaewoo
2018-01-05
With the development of artificial intelligence (AI) technology centered on deep-learning, the computer has evolved to a point where it can read a given text and answer a question based on the context of the text. Such a specific task is known as the task of machine comprehension. Existing machine comprehension tasks mostly use datasets of general texts, such as news articles or elementary school-level storybooks. However, no attempt has been made to determine whether an up-to-date deep learning-based machine comprehension model can also process scientific literature containing expert-level knowledge, especially in the biomedical domain. This study aims to investigate whether a machine comprehension model can process biomedical articles as well as general texts. Since there is no dataset for the biomedical literature comprehension task, our work includes generating a large-scale question answering dataset using PubMed and manually evaluating the generated dataset. We present an attention-based deep neural model tailored to the biomedical domain. To further enhance the performance of our model, we used a pretrained word vector and biomedical entity type embedding. We also developed an ensemble method of combining the results of several independent models to reduce the variance of the answers from the models. The experimental results showed that our proposed deep neural network model outperformed the baseline model by more than 7% on the new dataset. We also evaluated human performance on the new dataset. The human evaluation result showed that our deep neural model outperformed humans in comprehension by 22% on average. In this work, we introduced a new task of machine comprehension in the biomedical domain using a deep neural model. Since there was no large-scale dataset for training deep neural models in the biomedical domain, we created the new cloze-style datasets Biomedical Knowledge Comprehension Title (BMKC_T) and Biomedical Knowledge Comprehension Last Sentence (BMKC_LS) (together referred to as BioMedical Knowledge Comprehension) using the PubMed corpus. The experimental results showed that the performance of our model is much higher than that of humans. We observed that our model performed consistently better regardless of the degree of difficulty of a text, whereas humans have difficulty when performing biomedical literature comprehension tasks that require expert level knowledge. ©Seongsoon Kim, Donghyeon Park, Yonghwa Choi, Kyubum Lee, Byounggun Kim, Minji Jeon, Jihye Kim, Aik Choon Tan, Jaewoo Kang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 05.01.2018.
Harvard Aging Brain Study: Dataset and accessibility.
Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G; Chatwal, Jasmeer P; Papp, Kathryn V; Amariglio, Rebecca E; Blacker, Deborah; Rentz, Dorene M; Johnson, Keith A; Sperling, Reisa A; Schultz, Aaron P
2017-01-01
The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging. To promote more extensive analyses, imaging data was designed to be compatible with other publicly available datasets. A cloud-based system enables access to interested researchers with blinded data available contingent upon completion of a data usage agreement and administrative approval. Data collection is ongoing and currently in its fifth year. Copyright © 2015 Elsevier Inc. All rights reserved.
Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K
2015-06-04
Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.
Xia, Jiaqi; Peng, Zhenling; Qi, Dawei; Mu, Hongbo; Yang, Jianyi
2017-03-15
Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template-based fold assignment and the other is ab-initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before. We developed two algorithms, HH-fold and SVM-fold for protein fold classification. HH-fold is a template-based fold assignment algorithm using the HHsearch program. SVM-fold is a support vector machine-based ab-initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA-fold. We performed a comprehensive assessment for the proposed methods by comparing with ab-initio methods and template-based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4-11.7% over ab-initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from complementary sequence profiles that contain rich evolution information. http://yanglab.nankai.edu.cn/TA-fold/. yangjy@nankai.edu.cn or mhb-506@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Li, Zhucui; Lu, Yan; Guo, Yufeng; Cao, Haijie; Wang, Qinhong; Shui, Wenqing
2018-10-31
Data analysis represents a key challenge for untargeted metabolomics studies and it commonly requires extensive processing of more than thousands of metabolite peaks included in raw high-resolution MS data. Although a number of software packages have been developed to facilitate untargeted data processing, they have not been comprehensively scrutinized in the capability of feature detection, quantification and marker selection using a well-defined benchmark sample set. In this study, we acquired a benchmark dataset from standard mixtures consisting of 1100 compounds with specified concentration ratios including 130 compounds with significant variation of concentrations. Five software evaluated here (MS-Dial, MZmine 2, XCMS, MarkerView, and Compound Discoverer) showed similar performance in detection of true features derived from compounds in the mixtures. However, significant differences between untargeted metabolomics software were observed in relative quantification of true features in the benchmark dataset. MZmine 2 outperformed the other software in terms of quantification accuracy and it reported the most true discriminating markers together with the fewest false markers. Furthermore, we assessed selection of discriminating markers by different software using both the benchmark dataset and a real-case metabolomics dataset to propose combined usage of two software for increasing confidence of biomarker identification. Our findings from comprehensive evaluation of untargeted metabolomics software would help guide future improvements of these widely used bioinformatics tools and enable users to properly interpret their metabolomics results. Copyright © 2018 Elsevier B.V. All rights reserved.
A dataset of forest biomass structure for Eurasia.
Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael
2017-05-16
The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.
Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.
Ernst, Jason; Kellis, Manolis
2015-04-01
With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.
A dataset of forest biomass structure for Eurasia
NASA Astrophysics Data System (ADS)
Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael
2017-05-01
The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.
REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations
NASA Astrophysics Data System (ADS)
Moulik, P.; Lekic, V.; Romanowicz, B. A.
2017-12-01
A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history. By assessing inter-dataset consistency across similar paths, we quantify travel-time measurement errors for both surface and body waves. Finally, we discuss challenges associated with combining high frequency ( 1 Hz) and long period (10-20s) body-wave measurements into the REM-3D reference dataset.
The LANDFIRE Refresh strategy: updating the national dataset
Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley
2013-01-01
The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.
Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset
NASA Technical Reports Server (NTRS)
Ramasso, Emannuel; Saxena, Abhinav
2014-01-01
Benchmarking of prognostic algorithms has been challenging due to limited availability of common datasets suitable for prognostics. In an attempt to alleviate this problem several benchmarking datasets have been collected by NASA's prognostic center of excellence and made available to the Prognostics and Health Management (PHM) community to allow evaluation and comparison of prognostics algorithms. Among those datasets are five C-MAPSS datasets that have been extremely popular due to their unique characteristics making them suitable for prognostics. The C-MAPSS datasets pose several challenges that have been tackled by different methods in the PHM literature. In particular, management of high variability due to sensor noise, effects of operating conditions, and presence of multiple simultaneous fault modes are some factors that have great impact on the generalization capabilities of prognostics algorithms. More than 70 publications have used the C-MAPSS datasets for developing data-driven prognostic algorithms. The C-MAPSS datasets are also shown to be well-suited for development of new machine learning and pattern recognition tools for several key preprocessing steps such as feature extraction and selection, failure mode assessment, operating conditions assessment, health status estimation, uncertainty management, and prognostics performance evaluation. This paper summarizes a comprehensive literature review of publications using C-MAPSS datasets and provides guidelines and references to further usage of these datasets in a manner that allows clear and consistent comparison between different approaches.
Damage and protection cost curves for coastal floods within the 600 largest European cities
NASA Astrophysics Data System (ADS)
Prahl, Boris F.; Boettle, Markus; Costa, Luís; Kropp, Jürgen P.; Rybski, Diego
2018-03-01
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves.
Damage and protection cost curves for coastal floods within the 600 largest European cities.
Prahl, Boris F; Boettle, Markus; Costa, Luís; Kropp, Jürgen P; Rybski, Diego
2018-03-20
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves.
Damage and protection cost curves for coastal floods within the 600 largest European cities
Prahl, Boris F.; Boettle, Markus; Costa, Luís; Kropp, Jürgen P.; Rybski, Diego
2018-01-01
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves. PMID:29557944
Scaling of global input-output networks
NASA Astrophysics Data System (ADS)
Liang, Sai; Qi, Zhengling; Qu, Shen; Zhu, Ji; Chiu, Anthony S. F.; Jia, Xiaoping; Xu, Ming
2016-06-01
Examining scaling patterns of networks can help understand how structural features relate to the behavior of the networks. Input-output networks consist of industries as nodes and inter-industrial exchanges of products as links. Previous studies consider limited measures for node strengths and link weights, and also ignore the impact of dataset choice. We consider a comprehensive set of indicators in this study that are important in economic analysis, and also examine the impact of dataset choice, by studying input-output networks in individual countries and the entire world. Results show that Burr, Log-Logistic, Log-normal, and Weibull distributions can better describe scaling patterns of global input-output networks. We also find that dataset choice has limited impacts on the observed scaling patterns. Our findings can help examine the quality of economic statistics, estimate missing data in economic statistics, and identify key nodes and links in input-output networks to support economic policymaking.
Joint Sparse Representation for Robust Multimodal Biometrics Recognition
2012-01-01
described in III. Experimental evaluations on a comprehensive multimodal dataset and a face database have been described in section V. Finally, in...WVU Multimodal Dataset The WVU multimodal dataset is a comprehensive collection of different biometric modalities such as fingerprint, iris, palmprint ...Martnez and R. Benavente, “The AR face database ,” CVC Technical Report, June 1998. [29] U. Park and A. Jain, “Face matching and retrieval using soft
Diana, Mark L; Kazley, Abby Swanson; Menachemi, Nir
2011-01-01
Objective To assess the internal consistency and agreement between the Health Care Information and Management Systems Society (HIMSS) and the Leapfrog computerized provider order entry (CPOE) data. Data Sources Secondary hospital data collected by HIMSS Analytics, the Leapfrog Group, and the American Hospital Association from 2005 to 2007. Study Design Dichotomous measures of full CPOE status were created for the HIMSS and Leapfrog datasets in each year. We assessed internal consistency by calculating the percent of full adopters in a given year that report full CPOE status in subsequent years. We assessed the level of agreement between the two datasets by calculating the κ statistic and McNemar's test. We examined responsiveness by assessing the change in full CPOE status rates, over time, reported by HIMSS and Leapfrog data, respectively. Principal Findings Findings indicate minimal agreement between the two datasets regarding positive hospital CPOE status, but adequate agreement within a given dataset from year to year. Relative to each other, the HIMSS data tend to overestimate increases in full CPOE status over time, while the Leapfrog data may underestimate year over year increases in national CPOE status. Conclusions Both Leapfrog and HIMSS data have strengths and weaknesses. Those interested in studying outcomes associated with CPOE use or adoption should be aware of the strengths and limitations of the Leapfrog and HIMSS datasets. Future development of a standard definition of CPOE status in hospitals will allow for a more comprehensive validation of these data. PMID:21449956
Carbone, V; Fluit, R; Pellikaan, P; van der Krogt, M M; Janssen, D; Damsgaard, M; Vigneron, L; Feilkas, T; Koopman, H F J M; Verdonschot, N
2015-03-18
When analyzing complex biomechanical problems such as predicting the effects of orthopedic surgery, subject-specific musculoskeletal models are essential to achieve reliable predictions. The aim of this paper is to present the Twente Lower Extremity Model 2.0, a new comprehensive dataset of the musculoskeletal geometry of the lower extremity, which is based on medical imaging data and dissection performed on the right lower extremity of a fresh male cadaver. Bone, muscle and subcutaneous fat (including skin) volumes were segmented from computed tomography and magnetic resonance images scans. Inertial parameters were estimated from the image-based segmented volumes. A complete cadaver dissection was performed, in which bony landmarks, attachments sites and lines-of-action of 55 muscle actuators and 12 ligaments, bony wrapping surfaces, and joint geometry were measured. The obtained musculoskeletal geometry dataset was finally implemented in the AnyBody Modeling System (AnyBody Technology A/S, Aalborg, Denmark), resulting in a model consisting of 12 segments, 11 joints and 21 degrees of freedom, and including 166 muscle-tendon elements for each leg. The new TLEM 2.0 dataset was purposely built to be easily combined with novel image-based scaling techniques, such as bone surface morphing, muscle volume registration and muscle-tendon path identification, in order to obtain subject-specific musculoskeletal models in a quick and accurate way. The complete dataset, including CT and MRI scans and segmented volume and surfaces, is made available at http://www.utwente.nl/ctw/bw/research/projects/TLEMsafe for the biomechanical community, in order to accelerate the development and adoption of subject-specific models on large scale. TLEM 2.0 is freely shared for non-commercial use only, under acceptance of the TLEMsafe Research License Agreement. Copyright © 2014 Elsevier Ltd. All rights reserved.
Toward a complete dataset of drug-drug interaction information from publicly available sources.
Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D
2015-06-01
Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms, integrating high-quality PDDI information from the merged dataset into Wikidata, and making the combined dataset accessible as Semantic Web Linked Data. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
The FaceBase Consortium: A comprehensive program to facilitate craniofacial research
Hochheiser, Harry; Aronow, Bruce J.; Artinger, Kristin; Beaty, Terri H.; Brinkley, James F.; Chai, Yang; Clouthier, David; Cunningham, Michael L.; Dixon, Michael; Donahue, Leah Rae; Fraser, Scott E.; Hallgrimsson, Benedikt; Iwata, Junichi; Klein, Ophir; Marazita, Mary L.; Murray, Jeffrey C.; Murray, Stephen; de Villena, Fernando Pardo-Manuel; Postlethwait, John; Potter, Steven; Shapiro, Linda; Spritz, Richard; Visel, Axel; Weinberg, Seth M.; Trainor, Paul A.
2012-01-01
The FaceBase Consortium consists of ten interlinked research and technology projects whose goal is to generate craniofacial research data and technology for use by the research community through a central data management and integrated bioinformatics hub. Funded by the National Institute of Dental and Craniofacial Research (NIDCR) and currently focused on studying the development of the middle region of the face, the Consortium will produce comprehensive datasets of global gene expression patterns, regulatory elements and sequencing; will generate anatomical and molecular atlases; will provide human normative facial data and other phenotypes; conduct follow up studies of a completed genome-wide association study; generate independent data on the genetics of craniofacial development, build repositories of animal models and of human samples and data for community access and analysis; and will develop software tools and animal models for analyzing and functionally testing and integrating these data. The FaceBase website (http://www.facebase.org) will serve as a web home for these efforts, providing interactive tools for exploring these datasets, together with discussion forums and other services to support and foster collaboration within the craniofacial research community. PMID:21458441
FMAP: Functional Mapping and Analysis Pipeline for metagenomics and metatranscriptomics studies.
Kim, Jiwoong; Kim, Min Soo; Koh, Andrew Y; Xie, Yang; Zhan, Xiaowei
2016-10-10
Given the lack of a complete and comprehensive library of microbial reference genomes, determining the functional profile of diverse microbial communities is challenging. The available functional analysis pipelines lack several key features: (i) an integrated alignment tool, (ii) operon-level analysis, and (iii) the ability to process large datasets. Here we introduce our open-sourced, stand-alone functional analysis pipeline for analyzing whole metagenomic and metatranscriptomic sequencing data, FMAP (Functional Mapping and Analysis Pipeline). FMAP performs alignment, gene family abundance calculations, and statistical analysis (three levels of analyses are provided: differentially-abundant genes, operons and pathways). The resulting output can be easily visualized with heatmaps and functional pathway diagrams. FMAP functional predictions are consistent with currently available functional analysis pipelines. FMAP is a comprehensive tool for providing functional analysis of metagenomic/metatranscriptomic sequencing data. With the added features of integrated alignment, operon-level analysis, and the ability to process large datasets, FMAP will be a valuable addition to the currently available functional analysis toolbox. We believe that this software will be of great value to the wider biology and bioinformatics communities.
Status and interconnections of selected environmental issues in the global coastal zones
Shi, Hua; Singh, Ashbindu
2003-01-01
This study focuses on assessing the state of population distribution, land cover distribution, biodiversity hotspots, and protected areas in global coastal zones. The coastal zone is defined as land within 100 km of the coastline. This study attempts to answer such questions as: how crowded are the coastal zones, what is the pattern of land cover distribution in these areas, how much of these areas are designated as protected areas, what is the state of the biodiversity hotspots, and what are the interconnections between people and coastal environment. This study uses globally consistent and comprehensive geospatial datasets based on remote sensing and other sources. The application of Geographic Information System (GIS) layering methods and consistent datasets has made it possible to identify and quantify selected coastal zones environmental issues and their interconnections. It is expected that such information provide a scientific basis for global coastal zones management and assist in policy formulations at the national and international levels.
Reading Profiles in Multi-Site Data With Missingness.
Eckert, Mark A; Vaden, Kenneth I; Gebregziabher, Mulugeta
2018-01-01
Children with reading disability exhibit varied deficits in reading and cognitive abilities that contribute to their reading comprehension problems. Some children exhibit primary deficits in phonological processing, while others can exhibit deficits in oral language and executive functions that affect comprehension. This behavioral heterogeneity is problematic when missing data prevent the characterization of different reading profiles, which often occurs in retrospective data sharing initiatives without coordinated data collection. Here we show that reading profiles can be reliably identified based on Random Forest classification of incomplete behavioral datasets, after the missForest method is used to multiply impute missing values. Results from simulation analyses showed that reading profiles could be accurately classified across degrees of missingness (e.g., ∼5% classification error for 30% missingness across the sample). The application of missForest to a real multi-site dataset with missingness ( n = 924) showed that reading disability profiles significantly and consistently differed in reading and cognitive abilities for cases with and without missing data. The results of validation analyses indicated that the reading profiles (cases with and without missing data) exhibited significant differences for an independent set of behavioral variables that were not used to classify reading profiles. Together, the results show how multiple imputation can be applied to the classification of cases with missing data and can increase the integrity of results from multi-site open access datasets.
Gesch, D.; Williams, J.; Miller, W.
2001-01-01
Elevation models produced from Shuttle Radar Topography Mission (SRTM) data will be the most comprehensive, consistently processed, highest resolution topographic dataset ever produced for the Earth's land surface. Many applications that currently use elevation data will benefit from the increased availability of data with higher accuracy, quality, and resolution, especially in poorly mapped areas of the globe. SRTM data will be produced as seamless data, thereby avoiding many of the problems inherent in existing multi-source topographic databases. Serving as precursors to SRTM datasets, the U.S. Geological Survey (USGS) has produced and is distributing seamless elevation datasets that facilitate scientific use of elevation data over large areas. GTOPO30 is a global elevation model with a 30 arc-second resolution (approximately 1-kilometer). The National Elevation Dataset (NED) covers the United States at a resolution of 1 arc-second (approximately 30-meters). Due to their seamless format and broad area coverage, both GTOPO30 and NED represent an advance in the usability of elevation data, but each still includes artifacts from the highly variable source data used to produce them. The consistent source data and processing approach for SRTM data will result in elevation products that will be a significant addition to the current availability of seamless datasets, specifically for many areas outside the U.S. One application that demonstrates some advantages that may be realized with SRTM data is delineation of land surface drainage features (watersheds and stream channels). Seamless distribution of elevation data in which a user interactively specifies the area of interest and order parameters via a map server is already being successfully demonstrated with existing USGS datasets. Such an approach for distributing SRTM data is ideal for a dataset that undoubtedly will be of very high interest to the spatial data user community.
3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study.
Dolz, Jose; Desrosiers, Christian; Ben Ayed, Ismail
2018-04-15
This study investigates a 3D and fully convolutional neural network (CNN) for subcortical brain structure segmentation in MRI. 3D CNN architectures have been generally avoided due to their computational and memory requirements during inference. We address the problem via small kernels, allowing deeper architectures. We further model both local and global context by embedding intermediate-layer outputs in the final prediction, which encourages consistency between features extracted at different scales and embeds fine-grained information directly in the segmentation process. Our model is efficiently trained end-to-end on a graphics processing unit (GPU), in a single stage, exploiting the dense inference capabilities of fully CNNs. We performed comprehensive experiments over two publicly available datasets. First, we demonstrate a state-of-the-art performance on the ISBR dataset. Then, we report a large-scale multi-site evaluation over 1112 unregistered subject datasets acquired from 17 different sites (ABIDE dataset), with ages ranging from 7 to 64 years, showing that our method is robust to various acquisition protocols, demographics and clinical factors. Our method yielded segmentations that are highly consistent with a standard atlas-based approach, while running in a fraction of the time needed by atlas-based methods and avoiding registration/normalization steps. This makes it convenient for massive multi-site neuroanatomical imaging studies. To the best of our knowledge, our work is the first to study subcortical structure segmentation on such large-scale and heterogeneous data. Copyright © 2017 Elsevier Inc. All rights reserved.
Nilsson, R Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M; Bengtsson-Palme, Johan; Walker, Donald M; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C; Abarenkov, Kessy
2015-01-01
The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric-artificially joined-DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation.
Nilsson, R. Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M.; Bengtsson-Palme, Johan; Walker, Donald M.; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C.; Abarenkov, Kessy
2015-01-01
The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric—artificially joined—DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation. PMID:25786896
Individualized Prediction of Reading Comprehension Ability Using Gray Matter Volume.
Cui, Zaixu; Su, Mengmeng; Li, Liangjie; Shu, Hua; Gong, Gaolang
2018-05-01
Reading comprehension is a crucial reading skill for learning and putatively contains 2 key components: reading decoding and linguistic comprehension. Current understanding of the neural mechanism underlying these reading comprehension components is lacking, and whether and how neuroanatomical features can be used to predict these 2 skills remain largely unexplored. In the present study, we analyzed a large sample from the Human Connectome Project (HCP) dataset and successfully built multivariate predictive models for these 2 skills using whole-brain gray matter volume features. The results showed that these models effectively captured individual differences in these 2 skills and were able to significantly predict these components of reading comprehension for unseen individuals. The strict cross-validation using the HCP cohort and another independent cohort of children demonstrated the model generalizability. The identified gray matter regions contributing to the skill prediction consisted of a wide range of regions covering the putative reading, cerebellum, and subcortical systems. Interestingly, there were gender differences in the predictive models, with the female-specific model overestimating the males' abilities. Moreover, the identified contributing gray matter regions for the female-specific and male-specific models exhibited considerable differences, supporting a gender-dependent neuroanatomical substrate for reading comprehension.
Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae
Reguly, Teresa; Breitkreutz, Ashton; Boucher, Lorrie; Breitkreutz, Bobby-Joe; Hon, Gary C; Myers, Chad L; Parsons, Ainslie; Friesen, Helena; Oughtred, Rose; Tong, Amy; Stark, Chris; Ho, Yuen; Botstein, David; Andrews, Brenda; Boone, Charles; Troyanskya, Olga G; Ideker, Trey; Dolinski, Kara; Batada, Nizar N; Tyers, Mike
2006-01-01
Background The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference. Results We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID () and SGD () databases. Conclusion Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks. PMID:16762047
FAIMS Mobile: Flexible, open-source software for field research
NASA Astrophysics Data System (ADS)
Ballsun-Stanton, Brian; Ross, Shawn A.; Sobotkova, Adela; Crook, Penny
2018-01-01
FAIMS Mobile is a native Android application supported by an Ubuntu server facilitating human-mediated field research across disciplines. It consists of 'core' Java and Ruby software providing a platform for data capture, which can be deeply customised using 'definition packets' consisting of XML documents (data schema and UI) and Beanshell scripts (automation). Definition packets can also be generated using an XML-based domain-specific language, making customisation easier. FAIMS Mobile includes features allowing rich and efficient data capture tailored to the needs of fieldwork. It also promotes synthetic research and improves transparency and reproducibility through the production of comprehensive datasets that can be mapped to vocabularies or ontologies as they are created.
Wang, Yaping; Nie, Jingxin; Yap, Pew-Thian; Li, Gang; Shi, Feng; Geng, Xiujuan; Guo, Lei; Shen, Dinggang
2014-01-01
Accurate and robust brain extraction is a critical step in most neuroimaging analysis pipelines. In particular, for the large-scale multi-site neuroimaging studies involving a significant number of subjects with diverse age and diagnostic groups, accurate and robust extraction of the brain automatically and consistently is highly desirable. In this paper, we introduce population-specific probability maps to guide the brain extraction of diverse subject groups, including both healthy and diseased adult human populations, both developing and aging human populations, as well as non-human primates. Specifically, the proposed method combines an atlas-based approach, for coarse skull-stripping, with a deformable-surface-based approach that is guided by local intensity information and population-specific prior information learned from a set of real brain images for more localized refinement. Comprehensive quantitative evaluations were performed on the diverse large-scale populations of ADNI dataset with over 800 subjects (55∼90 years of age, multi-site, various diagnosis groups), OASIS dataset with over 400 subjects (18∼96 years of age, wide age range, various diagnosis groups), and NIH pediatrics dataset with 150 subjects (5∼18 years of age, multi-site, wide age range as a complementary age group to the adult dataset). The results demonstrate that our method consistently yields the best overall results across almost the entire human life span, with only a single set of parameters. To demonstrate its capability to work on non-human primates, the proposed method is further evaluated using a rhesus macaque dataset with 20 subjects. Quantitative comparisons with popularly used state-of-the-art methods, including BET, Two-pass BET, BET-B, BSE, HWA, ROBEX and AFNI, demonstrate that the proposed method performs favorably with superior performance on all testing datasets, indicating its robustness and effectiveness. PMID:24489639
Shao, Wenguang; Pedrioli, Patrick G A; Wolski, Witold; Scurtescu, Cristian; Schmid, Emanuel; Courcelles, Mathieu; Schuster, Heiko; Kowalewski, Daniel; Marino, Fabio; Arlehamn, Cecilia S L; Vaughan, Kerrie; Peters, Bjoern; Sette, Alessandro; Ottenhoff, Tom H M; Meijgaarden, Krista E; Nieuwenhuizen, Natalie; Kaufmann, Stefan H E; Schlapbach, Ralph; Castle, John C; Nesvizhskii, Alexey I; Nielsen, Morten; Deutsch, Eric W; Campbell, David S; Moritz, Robert L; Zubarev, Roman A; Ytterberg, Anders Jimmy; Purcell, Anthony W; Marcilla, Miguel; Paradela, Alberto; Wang, Qi; Costello, Catherine E; Ternette, Nicola; van Veelen, Peter A; van Els, Cécile A C M; de Souza, Gustavo A; Sollid, Ludvig M; Admon, Arie; Stevanovic, Stefan; Rammensee, Hans-Georg; Thibault, Pierre; Perreault, Claude; Bassani-Sternberg, Michal
2018-01-01
Abstract Mass spectrometry (MS)-based immunopeptidomics investigates the repertoire of peptides presented at the cell surface by major histocompatibility complex (MHC) molecules. The broad clinical relevance of MHC-associated peptides, e.g. in precision medicine, provides a strong rationale for the large-scale generation of immunopeptidomic datasets and recent developments in MS-based peptide analysis technologies now support the generation of the required data. Importantly, the availability of diverse immunopeptidomic datasets has resulted in an increasing need to standardize, store and exchange this type of data to enable better collaborations among researchers, to advance the field more efficiently and to establish quality measures required for the meaningful comparison of datasets. Here we present the SysteMHC Atlas (https://systemhcatlas.org), a public database that aims at collecting, organizing, sharing, visualizing and exploring immunopeptidomic data generated by MS. The Atlas includes raw mass spectrometer output files collected from several laboratories around the globe, a catalog of context-specific datasets of MHC class I and class II peptides, standardized MHC allele-specific peptide spectral libraries consisting of consensus spectra calculated from repeat measurements of the same peptide sequence, and links to other proteomics and immunology databases. The SysteMHC Atlas project was created and will be further expanded using a uniform and open computational pipeline that controls the quality of peptide identifications and peptide annotations. Thus, the SysteMHC Atlas disseminates quality controlled immunopeptidomic information to the public domain and serves as a community resource toward the generation of a high-quality comprehensive map of the human immunopeptidome and the support of consistent measurement of immunopeptidomic sample cohorts. PMID:28985418
Northern Hemisphere winter storm track trends since 1959 derived from multiple reanalysis datasets
NASA Astrophysics Data System (ADS)
Chang, Edmund K. M.; Yau, Albert M. W.
2016-09-01
In this study, a comprehensive comparison of Northern Hemisphere winter storm track trend since 1959 derived from multiple reanalysis datasets and rawinsonde observations has been conducted. In addition, trends in terms of variance and cyclone track statistics have been compared. Previous studies, based largely on the National Center for Environmental Prediction-National Center for Atmospheric Research Reanalysis (NNR), have suggested that both the Pacific and Atlantic storm tracks have significantly intensified between the 1950s and 1990s. Comparison with trends derived from rawinsonde observations suggest that the trends derived from NNR are significantly biased high, while those from the European Center for Medium Range Weather Forecasts 40-year Reanalysis and the Japanese 55-year Reanalysis are much less biased but still too high. Those from the two twentieth century reanalysis datasets are most consistent with observations but may exhibit slight biases of opposite signs. Between 1959 and 2010, Pacific storm track activity has likely increased by 10 % or more, while Atlantic storm track activity has likely increased by <10 %. Our analysis suggests that trends in Pacific and Atlantic basin wide storm track activity prior to the 1950s derived from the two twentieth century reanalysis datasets are unlikely to be reliable due to changes in density of surface observations. Nevertheless, these datasets may provide useful information on interannual variability, especially over the Atlantic.
Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000
NASA Astrophysics Data System (ADS)
Reba, Meredith; Reitsma, Femke; Seto, Karen C.
2016-06-01
How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski. The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point. The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends.
Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000
Reba, Meredith; Reitsma, Femke; Seto, Karen C.
2016-01-01
How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski. The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point. The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends. PMID:27271481
Cognitive coupling during reading.
Mills, Caitlin; Graesser, Art; Risko, Evan F; D'Mello, Sidney K
2017-06-01
We hypothesize that cognitively engaged readers dynamically adjust their reading times with respect to text complexity (i.e., reading times should increase for difficult sections and decrease for easier ones) and failure to do so should impair comprehension. This hypothesis is consistent with theories of text comprehension but has surprisingly been untested. We tested this hypothesis by analyzing 4 datasets in which participants (N = 484) read expository texts using a self-paced reading paradigm. Participants self-reported mind wandering in response to pseudorandom thought-probes during reading and completed comprehension assessments after reading. We computed two measures of cognitive coupling by regressing each participant's paragraph-level reading times on two measures of text complexity: Flesch-Kincaid Grade Level and Word Concreteness scores. The two coupling measures yielded convergent findings: coupling was a negative predictor of mind wandering and a positive predictor of both text- and inference-level comprehension. Goodness-of-fit, measured with Akaike information criterion, also improved after adding coupling to the reading-time only models. Furthermore, cognitive coupling mediated the relationship between mind wandering and comprehension, supporting the hypothesis that mind wandering engenders a decoupling of attention from external stimuli. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Time-resolved metabolomics reveals metabolic modulation in rice foliage
Sato, Shigeru; Arita, Masanori; Soga, Tomoyoshi; Nishioka, Takaaki; Tomita, Masaru
2008-01-01
Background To elucidate the interaction of dynamics among modules that constitute biological systems, comprehensive datasets obtained from "omics" technologies have been used. In recent plant metabolomics approaches, the reconstruction of metabolic correlation networks has been attempted using statistical techniques. However, the results were unsatisfactory and effective data-mining techniques that apply appropriate comprehensive datasets are needed. Results Using capillary electrophoresis mass spectrometry (CE-MS) and capillary electrophoresis diode-array detection (CE-DAD), we analyzed the dynamic changes in the level of 56 basic metabolites in plant foliage (Oryza sativa L. ssp. japonica) at hourly intervals over a 24-hr period. Unsupervised clustering of comprehensive metabolic profiles using Kohonen's self-organizing map (SOM) allowed classification of the biochemical pathways activated by the light and dark cycle. The carbon and nitrogen (C/N) metabolism in both periods was also visualized as a phenotypic linkage map that connects network modules on the basis of traditional metabolic pathways rather than pairwise correlations among metabolites. The regulatory networks of C/N assimilation/dissimilation at each time point were consistent with previous works on plant metabolism. In response to environmental stress, glutathione and spermidine fluctuated synchronously with their regulatory targets. Adenine nucleosides and nicotinamide coenzymes were regulated by phosphorylation and dephosphorylation. We also demonstrated that SOM analysis was applicable to the estimation of unidentifiable metabolites in metabolome analysis. Hierarchical clustering of a correlation coefficient matrix could help identify the bottleneck enzymes that regulate metabolic networks. Conclusion Our results showed that our SOM analysis with appropriate metabolic time-courses effectively revealed the synchronous dynamics among metabolic modules and elucidated the underlying biochemical functions. The application of discrimination of unidentified metabolites and the identification of bottleneck enzymatic steps even to non-targeted comprehensive analysis promise to facilitate an understanding of large-scale interactions among components in biological systems. PMID:18564421
Rapid Fine Conformational Epitope Mapping Using Comprehensive Mutagenesis and Deep Sequencing*
Kowalsky, Caitlin A.; Faber, Matthew S.; Nath, Aritro; Dann, Hailey E.; Kelly, Vince W.; Liu, Li; Shanker, Purva; Wagner, Ellen K.; Maynard, Jennifer A.; Chan, Christina; Whitehead, Timothy A.
2015-01-01
Knowledge of the fine location of neutralizing and non-neutralizing epitopes on human pathogens affords a better understanding of the structural basis of antibody efficacy, which will expedite rational design of vaccines, prophylactics, and therapeutics. However, full utilization of the wealth of information from single cell techniques and antibody repertoire sequencing awaits the development of a high throughput, inexpensive method to map the conformational epitopes for antibody-antigen interactions. Here we show such an approach that combines comprehensive mutagenesis, cell surface display, and DNA deep sequencing. We develop analytical equations to identify epitope positions and show the method effectiveness by mapping the fine epitope for different antibodies targeting TNF, pertussis toxin, and the cancer target TROP2. In all three cases, the experimentally determined conformational epitope was consistent with previous experimental datasets, confirming the reliability of the experimental pipeline. Once the comprehensive library is generated, fine conformational epitope maps can be prepared at a rate of four per day. PMID:26296891
Identifying Martian Hydrothermal Sites: Geological Investigation Utilizing Multiple Datasets
NASA Technical Reports Server (NTRS)
Dohm, J. M.; Baker, V. R.; Anderson, R. C.; Scott, D. H.; Rice, J. W., Jr.; Hare, T. M.
2000-01-01
Comprehensive geological investigations of martian landscapes that may have been modified by magmatic-driven hydrothermal activity, utilizing multiple datasets, will yield prime target sites for future hydrological, mineralogical, and biological investigations.
Detecting text in natural scenes with multi-level MSER and SWT
NASA Astrophysics Data System (ADS)
Lu, Tongwei; Liu, Renjun
2018-04-01
The detection of the characters in the natural scene is susceptible to factors such as complex background, variable viewing angle and diverse forms of language, which leads to poor detection results. Aiming at these problems, a new text detection method was proposed, which consisted of two main stages, candidate region extraction and text region detection. At first stage, the method used multiple scale transformations of original image and multiple thresholds of maximally stable extremal regions (MSER) to detect the text regions which could detect character regions comprehensively. At second stage, obtained SWT maps by using the stroke width transform (SWT) algorithm to compute the candidate regions, then using cascaded classifiers to propose non-text regions. The proposed method was evaluated on the standard benchmark datasets of ICDAR2011 and the datasets that we made our own data sets. The experiment results showed that the proposed method have greatly improved that compared to other text detection methods.
Missing value imputation for microarray data: a comprehensive comparison study and a web tool.
Chiu, Chia-Chun; Chan, Shih-Yao; Wang, Chung-Ching; Wu, Wei-Sheng
2013-01-01
Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
Hydrodynamic modelling and global datasets: Flow connectivity and SRTM data, a Bangkok case study.
NASA Astrophysics Data System (ADS)
Trigg, M. A.; Bates, P. B.; Michaelides, K.
2012-04-01
The rise in the global interconnected manufacturing supply chains requires an understanding and consistent quantification of flood risk at a global scale. Flood risk is often better quantified (or at least more precisely defined) in regions where there has been an investment in comprehensive topographical data collection such as LiDAR coupled with detailed hydrodynamic modelling. Yet in regions where these data and modelling are unavailable, the implications of flooding and the knock on effects for global industries can be dramatic, as evidenced by the recent floods in Bangkok, Thailand. There is a growing momentum in terms of global modelling initiatives to address this lack of a consistent understanding of flood risk and they will rely heavily on the application of available global datasets relevant to hydrodynamic modelling, such as Shuttle Radar Topography Mission (SRTM) data and its derivatives. These global datasets bring opportunities to apply consistent methodologies on an automated basis in all regions, while the use of coarser scale datasets also brings many challenges such as sub-grid process representation and downscaled hydrology data from global climate models. There are significant opportunities for hydrological science in helping define new, realistic and physically based methodologies that can be applied globally as well as the possibility of gaining new insights into flood risk through analysis of the many large datasets that will be derived from this work. We use Bangkok as a case study to explore some of the issues related to using these available global datasets for hydrodynamic modelling, with particular focus on using SRTM data to represent topography. Research has shown that flow connectivity on the floodplain is an important component in the dynamics of flood flows on to and off the floodplain, and indeed within different areas of the floodplain. A lack of representation of flow connectivity, often due to data resolution limitations, means that important subgrid processes are missing from hydrodynamic models leading to poor model predictive capabilities. Specifically here, the issue of flow connectivity during flood events is explored using geostatistical techniques to quantify the change of flow connectivity on floodplains due to grid rescaling methods. We also test whether this method of assessing connectivity can be used as new tool in the quantification of flood risk that moves beyond the simple flood extent approach, encapsulating threshold changes and data limitations.
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie
2018-01-01
Abstract Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. PMID:29106630
Shao, Wenguang; Pedrioli, Patrick G A; Wolski, Witold; Scurtescu, Cristian; Schmid, Emanuel; Vizcaíno, Juan A; Courcelles, Mathieu; Schuster, Heiko; Kowalewski, Daniel; Marino, Fabio; Arlehamn, Cecilia S L; Vaughan, Kerrie; Peters, Bjoern; Sette, Alessandro; Ottenhoff, Tom H M; Meijgaarden, Krista E; Nieuwenhuizen, Natalie; Kaufmann, Stefan H E; Schlapbach, Ralph; Castle, John C; Nesvizhskii, Alexey I; Nielsen, Morten; Deutsch, Eric W; Campbell, David S; Moritz, Robert L; Zubarev, Roman A; Ytterberg, Anders Jimmy; Purcell, Anthony W; Marcilla, Miguel; Paradela, Alberto; Wang, Qi; Costello, Catherine E; Ternette, Nicola; van Veelen, Peter A; van Els, Cécile A C M; Heck, Albert J R; de Souza, Gustavo A; Sollid, Ludvig M; Admon, Arie; Stevanovic, Stefan; Rammensee, Hans-Georg; Thibault, Pierre; Perreault, Claude; Bassani-Sternberg, Michal; Aebersold, Ruedi; Caron, Etienne
2018-01-04
Mass spectrometry (MS)-based immunopeptidomics investigates the repertoire of peptides presented at the cell surface by major histocompatibility complex (MHC) molecules. The broad clinical relevance of MHC-associated peptides, e.g. in precision medicine, provides a strong rationale for the large-scale generation of immunopeptidomic datasets and recent developments in MS-based peptide analysis technologies now support the generation of the required data. Importantly, the availability of diverse immunopeptidomic datasets has resulted in an increasing need to standardize, store and exchange this type of data to enable better collaborations among researchers, to advance the field more efficiently and to establish quality measures required for the meaningful comparison of datasets. Here we present the SysteMHC Atlas (https://systemhcatlas.org), a public database that aims at collecting, organizing, sharing, visualizing and exploring immunopeptidomic data generated by MS. The Atlas includes raw mass spectrometer output files collected from several laboratories around the globe, a catalog of context-specific datasets of MHC class I and class II peptides, standardized MHC allele-specific peptide spectral libraries consisting of consensus spectra calculated from repeat measurements of the same peptide sequence, and links to other proteomics and immunology databases. The SysteMHC Atlas project was created and will be further expanded using a uniform and open computational pipeline that controls the quality of peptide identifications and peptide annotations. Thus, the SysteMHC Atlas disseminates quality controlled immunopeptidomic information to the public domain and serves as a community resource toward the generation of a high-quality comprehensive map of the human immunopeptidome and the support of consistent measurement of immunopeptidomic sample cohorts. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.
Yooseph, Shibu; Sutton, Granger; Rusch, Douglas B; Halpern, Aaron L; Williamson, Shannon J; Remington, Karin; Eisen, Jonathan A; Heidelberg, Karla B; Manning, Gerard; Li, Weizhong; Jaroszewski, Lukasz; Cieplak, Piotr; Miller, Christopher S; Li, Huiying; Mashiyama, Susan T; Joachimiak, Marcin P; van Belle, Christopher; Chandonia, John-Marc; Soergel, David A; Zhai, Yufeng; Natarajan, Kannan; Lee, Shaun; Raphael, Benjamin J; Bafna, Vineet; Friedman, Robert; Brenner, Steven E; Godzik, Adam; Eisenberg, David; Dixon, Jack E; Taylor, Susan S; Strausberg, Robert L; Frazier, Marvin; Venter, J Craig
2007-03-01
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
Knoll, Michaela; Ciaccia, Ettore; Dekeling, René; Kvadsheim, Petter; Liddell, Kate; Gunnarsson, Stig-Lennart; Ludwig, Stefan; Nissen, Ivor; Lorenzen, Dirk; Kreimeyer, Roman; Pavan, Gianni; Meneghetti, Nello; Nordlund, Nina; Benders, Frank; van der Zwan, Timo; van Zon, Tim; Fraser, Leanne; Johansson, Torbjörn; Garmelius, Martin
2016-01-01
Within the European Defense Agency (EDA), the Protection of Marine Mammals (PoMM) project, a comprehensive common marine mammal database essential for risk mitigation tools, was established. The database, built on an extensive dataset collection with the focus on areas of operational interest for European navies, consists of annual and seasonal distribution and density maps, random and systematic sightings, an encyclopedia providing knowledge on the characteristics of 126 marine mammal species, data on marine mammal protection areas, and audio information including numerous examples of various vocalizations. Special investigations on marine mammal acoustics were carried out to improve the detection and classification capabilities.
Development of a global historic monthly mean precipitation dataset
NASA Astrophysics Data System (ADS)
Yang, Su; Xu, Wenhui; Xu, Yan; Li, Qingxiang
2016-04-01
Global historic precipitation dataset is the base for climate and water cycle research. There have been several global historic land surface precipitation datasets developed by international data centers such as the US National Climatic Data Center (NCDC), European Climate Assessment & Dataset project team, Met Office, etc., but so far there are no such datasets developed by any research institute in China. In addition, each dataset has its own focus of study region, and the existing global precipitation datasets only contain sparse observational stations over China, which may result in uncertainties in East Asian precipitation studies. In order to take into account comprehensive historic information, users might need to employ two or more datasets. However, the non-uniform data formats, data units, station IDs, and so on add extra difficulties for users to exploit these datasets. For this reason, a complete historic precipitation dataset that takes advantages of various datasets has been developed and produced in the National Meteorological Information Center of China. Precipitation observations from 12 sources are aggregated, and the data formats, data units, and station IDs are unified. Duplicated stations with the same ID are identified, with duplicated observations removed. Consistency test, correlation coefficient test, significance t-test at the 95% confidence level, and significance F-test at the 95% confidence level are conducted first to ensure the data reliability. Only those datasets that satisfy all the above four criteria are integrated to produce the China Meteorological Administration global precipitation (CGP) historic precipitation dataset version 1.0. It contains observations at 31 thousand stations with 1.87 × 107 data records, among which 4152 time series of precipitation are longer than 100 yr. This dataset plays a critical role in climate research due to its advantages in large data volume and high density of station network, compared to other datasets. Using the Penalized Maximal t-test method, significant inhomogeneity has been detected in historic precipitation datasets at 340 stations. The ratio method is then employed to effectively remove these remarkable change points. Global precipitation analysis based on CGP v1.0 shows that rainfall has been increasing during 1901-2013 with an increasing rate of 3.52 ± 0.5 mm (10 yr)-1, slightly higher than that in the NCDC data. Analysis also reveals distinguished long-term changing trends at different latitude zones.
Evaluation of the Global Land Data Assimilation System (GLDAS) air temperature data products
Ji, Lei; Senay, Gabriel B.; Verdin, James P.
2015-01-01
There is a high demand for agrohydrologic models to use gridded near-surface air temperature data as the model input for estimating regional and global water budgets and cycles. The Global Land Data Assimilation System (GLDAS) developed by combining simulation models with observations provides a long-term gridded meteorological dataset at the global scale. However, the GLDAS air temperature products have not been comprehensively evaluated, although the accuracy of the products was assessed in limited areas. In this study, the daily 0.25° resolution GLDAS air temperature data are compared with two reference datasets: 1) 1-km-resolution gridded Daymet data (2002 and 2010) for the conterminous United States and 2) global meteorological observations (2000–11) archived from the Global Historical Climatology Network (GHCN). The comparison of the GLDAS datasets with the GHCN datasets, including 13 511 weather stations, indicates a fairly high accuracy of the GLDAS data for daily temperature. The quality of the GLDAS air temperature data, however, is not always consistent in different regions of the world; for example, some areas in Africa and South America show relatively low accuracy. Spatial and temporal analyses reveal a high agreement between GLDAS and Daymet daily air temperature datasets, although spatial details in high mountainous areas are not sufficiently estimated by the GLDAS data. The evaluation of the GLDAS data demonstrates that the air temperature estimates are generally accurate, but caution should be taken when the data are used in mountainous areas or places with sparse weather stations.
Smith, Richard Gavin; Berry, Philippa A M
2011-06-01
The new ACE2 (Altimeter Corrected Elevations 2) Global Digital Elevation Model (GDEM) which has recently been released aims to provide the most accurate GDEM to date. ACE2 was created by synergistically merging the SRTM and altimetry datasets. The comprehensive comparison carried out between the two datasets yielded a myriad of information, with the areas of disagreement providing as much valuable information as the areas of agreement. Analysis of the comparison dataset revealed that certain topographic features displayed consistent differences between the two datasets. The largest differences globally are present over the rainforests, particularly the two largest, around the Amazon and the Congo. The differences range between 10 m and 40 m; these differences can be attributed to the height of the rainforest canopy, as the SRTM returned height values from somewhere within the uppermost reaches of the vegetation whereas the altimeter was able to penetrate through and return true ground heights. The second major class of terrain feature to demonstrate coherent differences are desert regions; here, different deserts present different characteristics. The final area of interest is that of Wetlands; these are areas of special significance because even a slight misrepresentation of the heights can have wide ranging effects in modelling wetland areas. These examples illustrate the valuable additional information content gleaned from the synergistic global combination of the two datasets.
Watson, Nathanial E; Parsons, Brendon A; Synovec, Robert E
2016-08-12
Performance of tile-based Fisher Ratio (F-ratio) data analysis, recently developed for discovery-based studies using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC-TOFMS), is evaluated with a metabolomics dataset that had been previously analyzed in great detail, but while taking a brute force approach. The previously analyzed data (referred to herein as the benchmark dataset) were intracellular extracts from Saccharomyces cerevisiae (yeast), either metabolizing glucose (repressed) or ethanol (derepressed), which define the two classes in the discovery-based analysis to find metabolites that are statistically different in concentration between the two classes. Beneficially, this previously analyzed dataset provides a concrete means to validate the tile-based F-ratio software. Herein, we demonstrate and validate the significant benefits of applying tile-based F-ratio analysis. The yeast metabolomics data are analyzed more rapidly in about one week versus one year for the prior studies with this dataset. Furthermore, a null distribution analysis is implemented to statistically determine an adequate F-ratio threshold, whereby the variables with F-ratio values below the threshold can be ignored as not class distinguishing, which provides the analyst with confidence when analyzing the hit table. Forty-six of the fifty-four benchmarked changing metabolites were discovered by the new methodology while consistently excluding all but one of the benchmarked nineteen false positive metabolites previously identified. Copyright © 2016 Elsevier B.V. All rights reserved.
A global distributed basin morphometric dataset
NASA Astrophysics Data System (ADS)
Shen, Xinyi; Anagnostou, Emmanouil N.; Mei, Yiwen; Hong, Yang
2017-01-01
Basin morphometry is vital information for relating storms to hydrologic hazards, such as landslides and floods. In this paper we present the first comprehensive global dataset of distributed basin morphometry at 30 arc seconds resolution. The dataset includes nine prime morphometric variables; in addition we present formulas for generating twenty-one additional morphometric variables based on combination of the prime variables. The dataset can aid different applications including studies of land-atmosphere interaction, and modelling of floods and droughts for sustainable water management. The validity of the dataset has been consolidated by successfully repeating the Hack's law.
Resource Purpose:The National Hydrography Dataset (NHD) is a comprehensive set of digital spatial data that contains information about surface water features such as lakes, ponds, streams, rivers, springs and wells. Within the NHD, surface water features are combined to fo...
Missing value imputation for microarray data: a comprehensive comparison study and a web tool
2013-01-01
Background Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. Results In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. Conclusions In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses. PMID:24565220
ChiLin: a comprehensive ChIP-seq and DNase-seq quality control and analysis pipeline.
Qin, Qian; Mei, Shenglin; Wu, Qiu; Sun, Hanfei; Li, Lewyn; Taing, Len; Chen, Sujun; Li, Fugen; Liu, Tao; Zang, Chongzhi; Xu, Han; Chen, Yiwen; Meyer, Clifford A; Zhang, Yong; Brown, Myles; Long, Henry W; Liu, X Shirley
2016-10-03
Transcription factor binding, histone modification, and chromatin accessibility studies are important approaches to understanding the biology of gene regulation. ChIP-seq and DNase-seq have become the standard techniques for studying protein-DNA interactions and chromatin accessibility respectively, and comprehensive quality control (QC) and analysis tools are critical to extracting the most value from these assay types. Although many analysis and QC tools have been reported, few combine ChIP-seq and DNase-seq data analysis and quality control in a unified framework with a comprehensive and unbiased reference of data quality metrics. ChiLin is a computational pipeline that automates the quality control and data analyses of ChIP-seq and DNase-seq data. It is developed using a flexible and modular software framework that can be easily extended and modified. ChiLin is ideal for batch processing of many datasets and is well suited for large collaborative projects involving ChIP-seq and DNase-seq from different designs. ChiLin generates comprehensive quality control reports that include comparisons with historical data derived from over 23,677 public ChIP-seq and DNase-seq samples (11,265 datasets) from eight literature-based classified categories. To the best of our knowledge, this atlas represents the most comprehensive ChIP-seq and DNase-seq related quality metric resource currently available. These historical metrics provide useful heuristic quality references for experiment across all commonly used assay types. Using representative datasets, we demonstrate the versatility of the pipeline by applying it to different assay types of ChIP-seq data. The pipeline software is available open source at https://github.com/cfce/chilin . ChiLin is a scalable and powerful tool to process large batches of ChIP-seq and DNase-seq datasets. The analysis output and quality metrics have been structured into user-friendly directories and reports. We have successfully compiled 23,677 profiles into a comprehensive quality atlas with fine classification for users.
Comparison of recent SnIa datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sanchez, J.C. Bueno; Perivolaropoulos, L.; Nesseris, S., E-mail: jbueno@cc.uoi.gr, E-mail: nesseris@nbi.ku.dk, E-mail: leandros@uoi.gr
2009-11-01
We rank the six latest Type Ia supernova (SnIa) datasets (Constitution (C), Union (U), ESSENCE (Davis) (E), Gold06 (G), SNLS 1yr (S) and SDSS-II (D)) in the context of the Chevalier-Polarski-Linder (CPL) parametrization w(a) = w{sub 0}+w{sub 1}(1−a), according to their Figure of Merit (FoM), their consistency with the cosmological constant (ΛCDM), their consistency with standard rulers (Cosmic Microwave Background (CMB) and Baryon Acoustic Oscillations (BAO)) and their mutual consistency. We find a significant improvement of the FoM (defined as the inverse area of the 95.4% parameter contour) with the number of SnIa of these datasets ((C) highest FoM, (U),more » (G), (D), (E), (S) lowest FoM). Standard rulers (CMB+BAO) have a better FoM by about a factor of 3, compared to the highest FoM SnIa dataset (C). We also find that the ranking sequence based on consistency with ΛCDM is identical with the corresponding ranking based on consistency with standard rulers ((S) most consistent, (D), (C), (E), (U), (G) least consistent). The ranking sequence of the datasets however changes when we consider the consistency with an expansion history corresponding to evolving dark energy (w{sub 0},w{sub 1}) = (−1.4,2) crossing the phantom divide line w = −1 (it is practically reversed to (G), (U), (E), (S), (D), (C)). The SALT2 and MLCS2k2 fitters are also compared and some peculiar features of the SDSS-II dataset when standardized with the MLCS2k2 fitter are pointed out. Finally, we construct a statistic to estimate the internal consistency of a collection of SnIa datasets. We find that even though there is good consistency among most samples taken from the above datasets, this consistency decreases significantly when the Gold06 (G) dataset is included in the sample.« less
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong
2018-01-04
Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Bhavnani, Suresh K.; Chen, Tianlong; Ayyaswamy, Archana; Visweswaran, Shyam; Bellala, Gowtham; Rohit, Divekar; Kevin E., Bassler
2017-01-01
A primary goal of precision medicine is to identify patient subgroups based on their characteristics (e.g., comorbidities or genes) with the goal of designing more targeted interventions. While network visualization methods such as Fruchterman-Reingold have been used to successfully identify such patient subgroups in small to medium sized data sets, they often fail to reveal comprehensible visual patterns in large and dense networks despite having significant clustering. We therefore developed an algorithm called ExplodeLayout, which exploits the existence of significant clusters in bipartite networks to automatically “explode” a traditional network layout with the goal of separating overlapping clusters, while at the same time preserving key network topological properties that are critical for the comprehension of patient subgroups. We demonstrate the utility of ExplodeLayout by visualizing a large dataset extracted from Medicare consisting of readmitted hip-fracture patients and their comorbidities, demonstrate its statistically significant improvement over a traditional layout algorithm, and discuss how the resulting network visualization enabled clinicians to infer mechanisms precipitating hospital readmission in specific patient subgroups. PMID:28815099
Biswas, Mithun; Islam, Rafiqul; Shom, Gautam Kumar; Shopon, Md; Mohammed, Nabeel; Momen, Sifat; Abedin, Anowarul
2017-06-01
BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.
Noormohammadpour, Pardis; Tavana, Bahareh; Mansournia, Mohammad Ali; Zeinalizadeh, Mehdi; Mirzashahi, Babak; Rostami, Mohsen; Kordi, Ramin
2018-05-01
Translation and cultural adaptation of the National Institutes of Health (NIH) Task Force's minimal dataset. The purpose of this study was to evaluate validity and reliability of the Farsi version of NIH Task Force's recommended multidimensional minimal dataset for research on chronic low back pain (CLBP). Considering the high treatment cost of CLBP and its increasing prevalence, NIH Pain Consortium developed research standards (including recommendations for definitions, a minimum dataset, and outcomes' report) for studies regarding CLBP. Application of these recommendations could standardize research and improve comparability of different studies in CLBP. This study has three phases: translation of dataset into Farsi and its cultural adaptation, assessment of pre-final version of dataset's comprehensibility via a pilot study, and investigation of the reliability and validity of final version of translated dataset. Subjects were 250 patients with CLBP. Test-retest reliability, content validity, and convergent validity (correlations among different dimensions of dataset and Farsi versions of Oswestry Disability Index, Roland Morris Disability Questionnaire, Fear-Avoidance Belief Questionnaire, and Beck Depression Inventory-II) were assessed. The Farsi version demonstrated good/excellent convergent validity (the correlation coefficient between impact dimension and ODI was r = 0.75 [P < 0.001], between impact dimension and Roland-Morris Disability Questionnaire was r = 0.80 [P < 0.001], and between psychological dimension and BDI was r = 0.62 [P < 0.001]). The test-retest reliability was also strong (intraclass correlation coefficient value ranged between 0.70 and 0.95) and the internal consistency was good/excellent (Chronbach's alpha coefficients' value for two main dimensions including impact dimension and psychological dimension were 0.91 and 0.82 [P < 0.001], respectively). In addition, its face validity and content validity were acceptable. The Farsi version of minimal dataset for research on CLBP is a reliable and valid instrument for data gathering in patients with CLBP. This minimum dataset can be a step toward standardization of research regarding CLBP. 3.
Kissling, Wilm Daniel; Dalby, Lars; Fløjgaard, Camilla; Lenoir, Jonathan; Sandel, Brody; Sandom, Christopher; Trøjelsgaard, Kristian; Svenning, Jens-Christian
2014-01-01
Ecological trait data are essential for understanding the broad-scale distribution of biodiversity and its response to global change. For animals, diet represents a fundamental aspect of species’ evolutionary adaptations, ecological and functional roles, and trophic interactions. However, the importance of diet for macroevolutionary and macroecological dynamics remains little explored, partly because of the lack of comprehensive trait datasets. We compiled and evaluated a comprehensive global dataset of diet preferences of mammals (“MammalDIET”). Diet information was digitized from two global and cladewide data sources and errors of data entry by multiple data recorders were assessed. We then developed a hierarchical extrapolation procedure to fill-in diet information for species with missing information. Missing data were extrapolated with information from other taxonomic levels (genus, other species within the same genus, or family) and this extrapolation was subsequently validated both internally (with a jack-knife approach applied to the compiled species-level diet data) and externally (using independent species-level diet information from a comprehensive continentwide data source). Finally, we grouped mammal species into trophic levels and dietary guilds, and their species richness as well as their proportion of total richness were mapped at a global scale for those diet categories with good validation results. The success rate of correctly digitizing data was 94%, indicating that the consistency in data entry among multiple recorders was high. Data sources provided species-level diet information for a total of 2033 species (38% of all 5364 terrestrial mammal species, based on the IUCN taxonomy). For the remaining 3331 species, diet information was mostly extrapolated from genus-level diet information (48% of all terrestrial mammal species), and only rarely from other species within the same genus (6%) or from family level (8%). Internal and external validation showed that: (1) extrapolations were most reliable for primary food items; (2) several diet categories (“Animal”, “Mammal”, “Invertebrate”, “Plant”, “Seed”, “Fruit”, and “Leaf”) had high proportions of correctly predicted diet ranks; and (3) the potential of correctly extrapolating specific diet categories varied both within and among clades. Global maps of species richness and proportion showed congruence among trophic levels, but also substantial discrepancies between dietary guilds. MammalDIET provides a comprehensive, unique and freely available dataset on diet preferences for all terrestrial mammals worldwide. It enables broad-scale analyses for specific trophic levels and dietary guilds, and a first assessment of trait conservatism in mammalian diet preferences at a global scale. The digitalization, extrapolation and validation procedures could be transferable to other trait data and taxa. PMID:25165528
Global climate shocks to agriculture from 1950 - 2015
NASA Astrophysics Data System (ADS)
Jackson, N. D.; Konar, M.; Debaere, P.; Sheffield, J.
2016-12-01
Climate shocks represent a major disruption to crop yields and agricultural production, yet a consistent and comprehensive database of agriculturally relevant climate shocks does not exist. To this end, we conduct a spatially and temporally disaggregated analysis of climate shocks to agriculture from 1950-2015 using a new gridded dataset. We quantify the occurrence and magnitude of climate shocks for all global agricultural areas during the growing season using a 0.25-degree spatial grid and daily time scale. We include all major crops and both temperature and precipitation extremes in our analysis. Critically, we evaluate climate shocks to all potential agricultural areas to improve projections within our time series. To do this, we use Global Agro-Ecological Zones maps from the Food and Agricultural Organization, the Princeton Global Meteorological Forcing dataset, and crop calendars from Sacks et al. (2010). We trace the dynamic evolution of climate shocks to agriculture, evaluate the spatial heterogeneity in agriculturally relevant climate shocks, and identify the crops and regions that are most prone to climate shocks.
MANTiS: a program for the analysis of X-ray spectromicroscopy data.
Lerotic, Mirna; Mak, Rachel; Wirick, Sue; Meirer, Florian; Jacobsen, Chris
2014-09-01
Spectromicroscopy combines spectral data with microscopy, where typical datasets consist of a stack of images taken across a range of energies over a microscopic region of the sample. Manual analysis of these complex datasets can be time-consuming, and can miss the important traits in the data. With this in mind we have developed MANTiS, an open-source tool developed in Python for spectromicroscopy data analysis. The backbone of the package involves principal component analysis and cluster analysis, classifying pixels according to spectral similarity. Our goal is to provide a data analysis tool which is comprehensive, yet intuitive and easy to use. MANTiS is designed to lead the user through the analysis using story boards that describe each step in detail so that both experienced users and beginners are able to analyze their own data independently. These capabilities are illustrated through analysis of hard X-ray imaging of iron in Roman ceramics, and soft X-ray imaging of a malaria-infected red blood cell.
A review on cluster estimation methods and their application to neural spike data.
Zhang, James; Nguyen, Thanh; Cogill, Steven; Bhatti, Asim; Luo, Lingkun; Yang, Samuel; Nahavandi, Saeid
2018-06-01
The extracellular action potentials recorded on an electrode result from the collective simultaneous electrophysiological activity of an unknown number of neurons. Identifying and assigning these action potentials to their firing neurons-'spike sorting'-is an indispensable step in studying the function and the response of an individual or ensemble of neurons to certain stimuli. Given the task of neural spike sorting, the determination of the number of clusters (neurons) is arguably the most difficult and challenging issue, due to the existence of background noise and the overlap and interactions among neurons in neighbouring regions. It is not surprising that some researchers still rely on visual inspection by experts to estimate the number of clusters in neural spike sorting. Manual inspection, however, is not suitable to processing the vast, ever-growing amount of neural data. To address this pressing need, in this paper, thirty-three clustering validity indices have been comprehensively reviewed and implemented to determine the number of clusters in neural datasets. To gauge the suitability of the indices to neural spike data, and inform the selection process, we then calculated the indices by applying k-means clustering to twenty widely used synthetic neural datasets and one empirical dataset, and compared the performance of these indices against pre-existing ground truth labels. The results showed that the top five validity indices work consistently well across variations in noise level, both for the synthetic datasets and the real dataset. Using these top performing indices provides strong support for the determination of the number of neural clusters, which is essential in the spike sorting process.
A review on cluster estimation methods and their application to neural spike data
NASA Astrophysics Data System (ADS)
Zhang, James; Nguyen, Thanh; Cogill, Steven; Bhatti, Asim; Luo, Lingkun; Yang, Samuel; Nahavandi, Saeid
2018-06-01
The extracellular action potentials recorded on an electrode result from the collective simultaneous electrophysiological activity of an unknown number of neurons. Identifying and assigning these action potentials to their firing neurons—‘spike sorting’—is an indispensable step in studying the function and the response of an individual or ensemble of neurons to certain stimuli. Given the task of neural spike sorting, the determination of the number of clusters (neurons) is arguably the most difficult and challenging issue, due to the existence of background noise and the overlap and interactions among neurons in neighbouring regions. It is not surprising that some researchers still rely on visual inspection by experts to estimate the number of clusters in neural spike sorting. Manual inspection, however, is not suitable to processing the vast, ever-growing amount of neural data. To address this pressing need, in this paper, thirty-three clustering validity indices have been comprehensively reviewed and implemented to determine the number of clusters in neural datasets. To gauge the suitability of the indices to neural spike data, and inform the selection process, we then calculated the indices by applying k-means clustering to twenty widely used synthetic neural datasets and one empirical dataset, and compared the performance of these indices against pre-existing ground truth labels. The results showed that the top five validity indices work consistently well across variations in noise level, both for the synthetic datasets and the real dataset. Using these top performing indices provides strong support for the determination of the number of neural clusters, which is essential in the spike sorting process.
The Everglades Depth Estimation Network (EDEN) for Support of Ecological and Biological Assessments
Telis, Pamela A.
2006-01-01
The Everglades Depth Estimation Network (EDEN) is an integrated network of real-time water-level monitoring, ground-elevation modeling, and water-surface modeling that provides scientists and managers with current (1999-present), online water-depth information for the entire freshwater portion of the Greater Everglades. Presented on a 400-square-meter grid spacing, EDEN offers a consistent and documented dataset that can be used by scientists and managers to (1) guide large-scale field operations, (2) integrate hydrologic and ecological responses, and (3) support biological and ecological assessments that measure ecosystem responses to the implementation of the Comprehensive Everglades Restoration Plan.
Tempest: Tools for Addressing the Needs of Next-Generation Climate Models
NASA Astrophysics Data System (ADS)
Ullrich, P. A.; Guerra, J. E.; Pinheiro, M. C.; Fong, J.
2015-12-01
Tempest is a comprehensive simulation-to-science infrastructure that tackles the needs of next-generation, high-resolution, data intensive climate modeling activities. This project incorporates three key components: TempestDynamics, a global modeling framework for experimental numerical methods and high-performance computing; TempestRemap, a toolset for arbitrary-order conservative and consistent remapping between unstructured grids; and TempestExtremes, a suite of detection and characterization tools for identifying weather extremes in large climate datasets. In this presentation, the latest advances with the implementation of this framework will be discussed, and a number of projects now utilizing these tools will be featured.
Chalkley, Robert J; Baker, Peter R; Hansen, Kirk C; Medzihradszky, Katalin F; Allen, Nadia P; Rexach, Michael; Burlingame, Alma L
2005-08-01
An in-depth analysis of a multidimensional chromatography-mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight (QqTOF) geometry instrument was carried out. A total of 3269 CID spectra were acquired. Through manual verification of database search results and de novo interpretation of spectra 2368 spectra could be confidently determined as predicted tryptic peptides. A detailed analysis of the non-matching spectra was also carried out, highlighting what the non-matching spectra in a database search typically are composed of. The results of this comprehensive dataset study demonstrate that QqTOF instruments produce information-rich data of which a high percentage of the data is readily interpretable.
de Dumast, Priscille; Mirabel, Clément; Cevidanes, Lucia; Ruellas, Antonio; Yatabe, Marilia; Ioshida, Marcos; Ribera, Nina Tubau; Michoud, Loic; Gomes, Liliane; Huang, Chao; Zhu, Hongtu; Muniz, Luciana; Shoukri, Brandon; Paniagua, Beatriz; Styner, Martin; Pieper, Steve; Budin, Francois; Vimort, Jean-Baptiste; Pascal, Laura; Prieto, Juan Carlos
2018-07-01
The purpose of this study is to describe the methodological innovations of a web-based system for storage, integration and computation of biomedical data, using a training imaging dataset to remotely compute a deep neural network classifier of temporomandibular joint osteoarthritis (TMJOA). This study imaging dataset consisted of three-dimensional (3D) surface meshes of mandibular condyles constructed from cone beam computed tomography (CBCT) scans. The training dataset consisted of 259 condyles, 105 from control subjects and 154 from patients with diagnosis of TMJ OA. For the image analysis classification, 34 right and left condyles from 17 patients (39.9 ± 11.7 years), who experienced signs and symptoms of the disease for less than 5 years, were included as the testing dataset. For the integrative statistical model of clinical, biological and imaging markers, the sample consisted of the same 17 test OA subjects and 17 age and sex matched control subjects (39.4 ± 15.4 years), who did not show any sign or symptom of OA. For these 34 subjects, a standardized clinical questionnaire, blood and saliva samples were also collected. The technological methodologies in this study include a deep neural network classifier of 3D condylar morphology (ShapeVariationAnalyzer, SVA), and a flexible web-based system for data storage, computation and integration (DSCI) of high dimensional imaging, clinical, and biological data. The DSCI system trained and tested the neural network, indicating 5 stages of structural degenerative changes in condylar morphology in the TMJ with 91% close agreement between the clinician consensus and the SVA classifier. The DSCI remotely ran with a novel application of a statistical analysis, the Multivariate Functional Shape Data Analysis, that computed high dimensional correlations between shape 3D coordinates, clinical pain levels and levels of biological markers, and then graphically displayed the computation results. The findings of this study demonstrate a comprehensive phenotypic characterization of TMJ health and disease at clinical, imaging and biological levels, using novel flexible and versatile open-source tools for a web-based system that provides advanced shape statistical analysis and a neural network based classification of temporomandibular joint osteoarthritis. Published by Elsevier Ltd.
National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) scientists have just released a comprehensive dataset of the proteomic analysis of high grade serous ovarian tumor samples, previously genomically analyzed by The Cancer Genome Atlas (TCGA). This is one of the largest public datasets covering the proteome, phosphoproteome and glycoproteome with complementary deep genomic sequencing data on the same tumor.
ISRUC-Sleep: A comprehensive public dataset for sleep researchers.
Khalighi, Sirvan; Sousa, Teresa; Santos, José Moutinho; Nunes, Urbano
2016-02-01
To facilitate the performance comparison of new methods for sleep patterns analysis, datasets with quality content, publicly-available, are very important and useful. We introduce an open-access comprehensive sleep dataset, called ISRUC-Sleep. The data were obtained from human adults, including healthy subjects, subjects with sleep disorders, and subjects under the effect of sleep medication. Each recording was randomly selected between PSG recordings that were acquired by the Sleep Medicine Centre of the Hospital of Coimbra University (CHUC). The dataset comprises three groups of data: (1) data concerning 100 subjects, with one recording session per subject; (2) data gathered from 8 subjects; two recording sessions were performed per subject, and (3) data collected from one recording session related to 10 healthy subjects. The polysomnography (PSG) recordings, associated with each subject, were visually scored by two human experts. Comparing the existing sleep-related public datasets, ISRUC-Sleep provides data of a reasonable number of subjects with different characteristics such as: data useful for studies involving changes in the PSG signals over time; and data of healthy subjects useful for studies involving comparison of healthy subjects with the patients, suffering from sleep disorders. This dataset was created aiming to complement existing datasets by providing easy-to-apply data collection with some characteristics not covered yet. ISRUC-Sleep can be useful for analysis of new contributions: (i) in biomedical signal processing; (ii) in development of ASSC methods; and (iii) on sleep physiology studies. To evaluate and compare new contributions, which use this dataset as a benchmark, results of applying a subject-independent automatic sleep stage classification (ASSC) method on ISRUC-Sleep dataset are presented. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Patterns, biases and prospects in the distribution and diversity of Neotropical snakes.
Guedes, Thaís B; Sawaya, Ricardo J; Zizka, Alexander; Laffan, Shawn; Faurby, Søren; Pyron, R Alexander; Bérnils, Renato S; Jansen, Martin; Passos, Paulo; Prudente, Ana L C; Cisneros-Heredia, Diego F; Braz, Henrique B; Nogueira, Cristiano de C; Antonelli, Alexandre; Meiri, Shai
2018-01-01
We generated a novel database of Neotropical snakes (one of the world's richest herpetofauna) combining the most comprehensive, manually compiled distribution dataset with publicly available data. We assess, for the first time, the diversity patterns for all Neotropical snakes as well as sampling density and sampling biases. We compiled three databases of species occurrences: a dataset downloaded from the Global Biodiversity Information Facility (GBIF), a verified dataset built through taxonomic work and specialized literature, and a combined dataset comprising a cleaned version of the GBIF dataset merged with the verified dataset. Neotropics, Behrmann projection equivalent to 1° × 1°. Specimens housed in museums during the last 150 years. Squamata: Serpentes. Geographical information system (GIS). The combined dataset provides the most comprehensive distribution database for Neotropical snakes to date. It contains 147,515 records for 886 species across 12 families, representing 74% of all species of snakes, spanning 27 countries in the Americas. Species richness and phylogenetic diversity show overall similar patterns. Amazonia is the least sampled Neotropical region, whereas most well-sampled sites are located near large universities and scientific collections. We provide a list and updated maps of geographical distribution of all snake species surveyed. The biodiversity metrics of Neotropical snakes reflect patterns previously documented for other vertebrates, suggesting that similar factors may determine the diversity of both ectothermic and endothermic animals. We suggest conservation strategies for high-diversity areas and sampling efforts be directed towards Amazonia and poorly known species.
Alcaraz-Segura, Domingo; Liras, Elisa; Tabik, Siham; Paruelo, José; Cabello, Javier
2010-01-01
Successive efforts have processed the Advanced Very High Resolution Radiometer (AVHRR) sensor archive to produce Normalized Difference Vegetation Index (NDVI) datasets (i.e., PAL, FASIR, GIMMS, and LTDR) under different corrections and processing schemes. Since NDVI datasets are used to evaluate carbon gains, differences among them may affect nations’ carbon budgets in meeting international targets (such as the Kyoto Protocol). This study addresses the consistency across AVHRR NDVI datasets in the Iberian Peninsula (Spain and Portugal) by evaluating whether their 1982–1999 NDVI trends show similar spatial patterns. Significant trends were calculated with the seasonal Mann-Kendall trend test and their spatial consistency with partial Mantel tests. Over 23% of the Peninsula (N, E, and central mountain ranges) showed positive and significant NDVI trends across the four datasets and an additional 18% across three datasets. In 20% of Iberia (SW quadrant), the four datasets exhibited an absence of significant trends and an additional 22% across three datasets. Significant NDVI decreases were scarce (croplands in the Guadalquivir and Segura basins, La Mancha plains, and Valencia). Spatial consistency of significant trends across at least three datasets was observed in 83% of the Peninsula, but it decreased to 47% when comparing across the four datasets. FASIR, PAL, and LTDR were the most spatially similar datasets, while GIMMS was the most different. The different performance of each AVHRR dataset to detect significant NDVI trends (e.g., LTDR detected greater significant trends (both positive and negative) and in 32% more pixels than GIMMS) has great implications to evaluate carbon budgets. The lack of spatial consistency across NDVI datasets derived from the same AVHRR sensor archive, makes it advisable to evaluate carbon gains trends using several satellite datasets and, whether possible, independent/additional data sources to contrast. PMID:22205868
Alcaraz-Segura, Domingo; Liras, Elisa; Tabik, Siham; Paruelo, José; Cabello, Javier
2010-01-01
Successive efforts have processed the Advanced Very High Resolution Radiometer (AVHRR) sensor archive to produce Normalized Difference Vegetation Index (NDVI) datasets (i.e., PAL, FASIR, GIMMS, and LTDR) under different corrections and processing schemes. Since NDVI datasets are used to evaluate carbon gains, differences among them may affect nations' carbon budgets in meeting international targets (such as the Kyoto Protocol). This study addresses the consistency across AVHRR NDVI datasets in the Iberian Peninsula (Spain and Portugal) by evaluating whether their 1982-1999 NDVI trends show similar spatial patterns. Significant trends were calculated with the seasonal Mann-Kendall trend test and their spatial consistency with partial Mantel tests. Over 23% of the Peninsula (N, E, and central mountain ranges) showed positive and significant NDVI trends across the four datasets and an additional 18% across three datasets. In 20% of Iberia (SW quadrant), the four datasets exhibited an absence of significant trends and an additional 22% across three datasets. Significant NDVI decreases were scarce (croplands in the Guadalquivir and Segura basins, La Mancha plains, and Valencia). Spatial consistency of significant trends across at least three datasets was observed in 83% of the Peninsula, but it decreased to 47% when comparing across the four datasets. FASIR, PAL, and LTDR were the most spatially similar datasets, while GIMMS was the most different. The different performance of each AVHRR dataset to detect significant NDVI trends (e.g., LTDR detected greater significant trends (both positive and negative) and in 32% more pixels than GIMMS) has great implications to evaluate carbon budgets. The lack of spatial consistency across NDVI datasets derived from the same AVHRR sensor archive, makes it advisable to evaluate carbon gains trends using several satellite datasets and, whether possible, independent/additional data sources to contrast.
NASA Astrophysics Data System (ADS)
Jiang, C.; Ryu, Y.; Fang, H.
2016-12-01
Proper usage of global satellite LAI products requires comprehensive evaluation. To address this issue, the Committee on Earth Observation Satellites (CEOS) Land Product Validation (LPV) subgroup proposed a four-stage validation hierarchy. During the past decade, great efforts have been made following this validation framework, mainly focused on absolute magnitude, seasonal trajectory, and spatial pattern of those global satellite LAI products. However, interannual variability and trends of global satellite LAI products have been investigated marginally. Targeting on this gap, we made an intercomparison between seven global satellite LAI datasets, including four short-term ones: MODIS C5, MODIS C6, GEOV1, MERIS, and three long-term products ones: LAI3g, GLASS, and GLOBMAP. We calculated global annual LAI time series for each dataset, among which we found substantial differences. During the overlapped period (2003 - 2011), MODIS C5, GLASS and GLOBMAP have positive correlation (r > 0.6) between each other, while MODIS C6, GEOV1, MERIS, and LAI3g are highly consistent (r > 0.7) in interannual variations. However, the previous three datasets show negative trends, all of which use MODIS C5 reflectance data, whereas the latter four show positive trends, using MODIS C6, SPOT/VGT, ENVISAT/MERIS, and NOAA/AVHRR, respectively. During the pre-MODIS era (1982 - 1999), the three AVHRR-based datasets (LAI3g, GLASS and GLOBMAP) agree well (r > 0.7), yet all of them show oscillation related with NOAA platform changes. In addition, both GLASS and GLOBMAP show clear cut-points around 2000 when they move from AVHRR to MODIS. Such inconsistency is also visible for GEOV1, which uses SPOT-4 and SPOT-5 before and after 2002. We further investigate the map-to-map deviations among these products. This study highlights that continuous sensor calibration and cross calibration are essential to obtain reliable global LAI time series.
The Graduate Outcome Project: Using Data from the Integrated Data Infrastructure Project
ERIC Educational Resources Information Center
Rees, Malcolm
2014-01-01
This paper reports on progress to date with a project underway in New Zealand involving the extraction of data from multiple government agencies that is then combined into one comprehensive longitudinal integrated dataset and made available to trial participants in a way never previously thought possible. The dataset includes school leaver…
Bayesian Network Webserver: a comprehensive tool for biological network modeling.
Ziebarth, Jesse D; Bhattacharya, Anindya; Cui, Yan
2013-11-01
The Bayesian Network Webserver (BNW) is a platform for comprehensive network modeling of systems genetics and other biological datasets. It allows users to quickly and seamlessly upload a dataset, learn the structure of the network model that best explains the data and use the model to understand relationships between network variables. Many datasets, including those used to create genetic network models, contain both discrete (e.g. genotype) and continuous (e.g. gene expression traits) variables, and BNW allows for modeling hybrid datasets. Users of BNW can incorporate prior knowledge during structure learning through an easy-to-use structural constraint interface. After structure learning, users are immediately presented with an interactive network model, which can be used to make testable hypotheses about network relationships. BNW, including a downloadable structure learning package, is available at http://compbio.uthsc.edu/BNW. (The BNW interface for adding structural constraints uses HTML5 features that are not supported by current version of Internet Explorer. We suggest using other browsers (e.g. Google Chrome or Mozilla Firefox) when accessing BNW). ycui2@uthsc.edu. Supplementary data are available at Bioinformatics online.
Comprehensive decision tree models in bioinformatics.
Stiglic, Gregor; Kocbek, Simon; Pernek, Igor; Kokol, Peter
2012-01-01
Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible. This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree. The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree. The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics.
Comprehensive Decision Tree Models in Bioinformatics
Stiglic, Gregor; Kocbek, Simon; Pernek, Igor; Kokol, Peter
2012-01-01
Purpose Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible. Methods This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree. Results The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree. Conclusions The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics. PMID:22479449
USDA-ARS?s Scientific Manuscript database
Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...
Association between COMT Val158Met and psychiatric disorders: A comprehensive meta-analysis.
Taylor, Steven
2018-03-01
Catechol-O-methyltransferase (COMT) Val158Met is widely regarded as potentially important for understanding the genetic etiology of many different psychiatric disorders. The present study appears to be the first comprehensive meta-analysis of COMT genetic association studies to cover all psychiatric disorders for which there were available data, published in any language, and with an emphasis on investigating disorder subtypes (defined clinically or by demographic or other variables). Studies were included if they reported one or more datasets (i.e., some studies examined more than one clinical group) in which there were sufficient information to compute effect sizes. A total of 363 datasets were included, consisting of 56,998 cases and 74,668 healthy controls from case control studies, and 2,547 trios from family based studies. Fifteen disorders were included. Attention-deficit hyperactivity disorder and panic disorder were associated with the Val allele for Caucasian samples. Substance-use disorder, defined by DSM-IV criteria, was associated with the Val allele for Asian samples. Bipolar disorder was associated with the Met allele in Asian samples. Obsessive-compulsive disorder tended to be associated with the Met allele only for males. There was suggestive evidence that the Met allele is associated with an earlier age of onset of schizophrenia. Results suggest pleiotropy and underscore the importance of examining subgroups-defined by variables such as age of onset, sex, ethnicity, and diagnostic system-rather than examining disorders as monolithic constructs. © 2017 Wiley Periodicals, Inc.
Consolidating drug data on a global scale using Linked Data.
Jovanovik, Milos; Trajanov, Dimitar
2017-01-21
Drug product data is available on the Web in a distributed fashion. The reasons lie within the regulatory domains, which exist on a national level. As a consequence, the drug data available on the Web are independently curated by national institutions from each country, leaving the data in varying languages, with a varying structure, granularity level and format, on different locations on the Web. Therefore, one of the main challenges in the realm of drug data is the consolidation and integration of large amounts of heterogeneous data into a comprehensive dataspace, for the purpose of developing data-driven applications. In recent years, the adoption of the Linked Data principles has enabled data publishers to provide structured data on the Web and contextually interlink them with other public datasets, effectively de-siloing them. Defining methodological guidelines and specialized tools for generating Linked Data in the drug domain, applicable on a global scale, is a crucial step to achieving the necessary levels of data consolidation and alignment needed for the development of a global dataset of drug product data. This dataset would then enable a myriad of new usage scenarios, which can, for instance, provide insight into the global availability of different drug categories in different parts of the world. We developed a methodology and a set of tools which support the process of generating Linked Data in the drug domain. Using them, we generated the LinkedDrugs dataset by seamlessly transforming, consolidating and publishing high-quality, 5-star Linked Drug Data from twenty-three countries, containing over 248,000 drug products, over 99,000,000 RDF triples and over 278,000 links to generic drugs from the LOD Cloud. Using the linked nature of the dataset, we demonstrate its ability to support advanced usage scenarios in the drug domain. The process of generating the LinkedDrugs dataset demonstrates the applicability of the methodological guidelines and the supporting tools in transforming drug product data from various, independent and distributed sources, into a comprehensive Linked Drug Data dataset. The presented user-centric and analytical usage scenarios over the dataset show the advantages of having a de-siloed, consolidated and comprehensive dataspace of drug data available via the existing infrastructure of the Web.
A longitudinal dataset of five years of public activity in the Scratch online community.
Hill, Benjamin Mako; Monroy-Hernández, Andrés
2017-01-31
Scratch is a programming environment and an online community where young people can create, share, learn, and communicate. In collaboration with the Scratch Team at MIT, we created a longitudinal dataset of public activity in the Scratch online community during its first five years (2007-2012). The dataset comprises 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more. To help researchers understand this dataset, and to establish the validity of the data, we also include the source code of every version of the software that operated the website, as well as the software used to generate this dataset. We believe this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.
Phylogenetic placement of the enigmatic parasite, Polypodium hydriforme, within the Phylum Cnidaria
2008-01-01
Background Polypodium hydriforme is a parasite with an unusual life cycle and peculiar morphology, both of which have made its systematic position uncertain. Polypodium has traditionally been considered a cnidarian because it possesses nematocysts, the stinging structures characteristic of this phylum. However, recent molecular phylogenetic studies using 18S rDNA sequence data have challenged this interpretation, and have shown that Polypodium is a close relative to myxozoans and together they share a closer affinity to bilaterians than cnidarians. Due to the variable rates of 18S rDNA sequences, these results have been suggested to be an artifact of long-branch attraction (LBA). A recent study, using multiple protein coding markers, shows that the myxozoan Buddenbrockia, is nested within cnidarians. Polypodium was not included in this study. To further investigate the phylogenetic placement of Polypodium, we have performed phylogenetic analyses of metazoans with 18S and partial 28S rDNA sequences in a large dataset that includes Polypodium and a comprehensive sampling of cnidarian taxa. Results Analyses of a combined dataset of 18S and partial 28S sequences, and partial 28S alone, support the placement of Polypodium within Cnidaria. Removal of the long-branched myxozoans from the 18S dataset also results in Polypodium being nested within Cnidaria. These results suggest that previous reports showing that Polypodium and Myxozoa form a sister group to Bilateria were an artifact of long-branch attraction. Conclusion By including 28S rDNA sequences and a comprehensive sampling of cnidarian taxa, we demonstrate that previously conflicting hypotheses concerning the phylogenetic placement of Polypodium can be reconciled. Specifically, the data presented provide evidence that Polypodium is indeed a cnidarian and is either the sister taxon to Hydrozoa, or part of the hydrozoan clade, Leptothecata. The former hypothesis is consistent with the traditional view that Polypodium should be placed in its own cnidarian class, Polypodiozoa. PMID:18471296
Phylogenetic placement of the enigmatic parasite, Polypodium hydriforme, within the Phylum Cnidaria.
Evans, Nathaniel M; Lindner, Alberto; Raikova, Ekaterina V; Collins, Allen G; Cartwright, Paulyn
2008-05-09
Polypodium hydriforme is a parasite with an unusual life cycle and peculiar morphology, both of which have made its systematic position uncertain. Polypodium has traditionally been considered a cnidarian because it possesses nematocysts, the stinging structures characteristic of this phylum. However, recent molecular phylogenetic studies using 18S rDNA sequence data have challenged this interpretation, and have shown that Polypodium is a close relative to myxozoans and together they share a closer affinity to bilaterians than cnidarians. Due to the variable rates of 18S rDNA sequences, these results have been suggested to be an artifact of long-branch attraction (LBA). A recent study, using multiple protein coding markers, shows that the myxozoan Buddenbrockia, is nested within cnidarians. Polypodium was not included in this study. To further investigate the phylogenetic placement of Polypodium, we have performed phylogenetic analyses of metazoans with 18S and partial 28S rDNA sequences in a large dataset that includes Polypodium and a comprehensive sampling of cnidarian taxa. Analyses of a combined dataset of 18S and partial 28S sequences, and partial 28S alone, support the placement of Polypodium within Cnidaria. Removal of the long-branched myxozoans from the 18S dataset also results in Polypodium being nested within Cnidaria. These results suggest that previous reports showing that Polypodium and Myxozoa form a sister group to Bilateria were an artifact of long-branch attraction. By including 28S rDNA sequences and a comprehensive sampling of cnidarian taxa, we demonstrate that previously conflicting hypotheses concerning the phylogenetic placement of Polypodium can be reconciled. Specifically, the data presented provide evidence that Polypodium is indeed a cnidarian and is either the sister taxon to Hydrozoa, or part of the hydrozoan clade, Leptothecata. The former hypothesis is consistent with the traditional view that Polypodium should be placed in its own cnidarian class, Polypodiozoa.
Zahn, Stephen G.
2015-07-13
LANDFIRE data products are primarily designed and developed to be used at the landscape level to facilitate national and regional strategic planning and reporting of wild land fire and other natural resource management activities. However, LANDFIRE’s spatially comprehensive dataset can also be adapted to support a variety of local management applications that need current and comprehensive geospatial data.
Numerical modeling of inorganic aerosol processes is useful in air quality management, but comprehensive evaluation of modeled aerosol processes is rarely possible due to the lack of comprehensive datasets. During the Nitrogen, Aerosol Composition, and Halogens on a Tall Tower (N...
Babajani-Feremi, Abbas
2017-09-01
Comprehension of narratives constitutes a fundamental part of our everyday life experience. Although the neural mechanism of auditory narrative comprehension has been investigated in some studies, the neural correlates underlying this mechanism and its heritability remain poorly understood. We investigated comprehension of naturalistic speech in a large, healthy adult population (n = 429; 176/253 M/F; 22-36 years of age) consisting of 192 twin pairs (49 monozygotic and 47 dizygotic pairs) and 237 of their siblings. We used high quality functional MRI datasets from the Human Connectome Project (HCP) in which a story-based paradigm was utilized for the auditory narrative comprehension. Our results revealed that narrative comprehension was associated with activations of the classical language regions including superior temporal gyrus (STG), middle temporal gyrus (MTG), and inferior frontal gyrus (IFG) in both hemispheres, though STG and MTG were activated symmetrically and activation in IFG were left-lateralized. Our results further showed that the narrative comprehension was associated with activations in areas beyond the classical language regions, e.g. medial superior frontal gyrus (SFGmed), middle frontal gyrus (MFG), and supplementary motor area (SMA). Of subcortical structures, only the hippocampus was involved. The results of heritability analysis revealed that the oral reading recognition and picture vocabulary comprehension were significantly heritable (h 2 > 0.56, p < 10 - 13 ). In addition, the extent of activation of five areas in the left hemisphere, i.e. STG, IFG pars opercularis, SFGmed, SMA, and precuneus, and one area in the right hemisphere, i.e. MFG, were significantly heritable (h 2 > 0.33, p < 0.0004). The current study, to the best of our knowledge, is the first to investigate auditory narrative comprehension and its heritability in a large healthy population. Referring to the excellent quality of the HCP data, our results can clarify the functional contributions of linguistic and extra-linguistic cortices during narrative comprehension.
Patterns, biases and prospects in the distribution and diversity of Neotropical snakes
Sawaya, Ricardo J.; Zizka, Alexander; Laffan, Shawn; Faurby, Søren; Pyron, R. Alexander; Bérnils, Renato S.; Jansen, Martin; Passos, Paulo; Prudente, Ana L. C.; Cisneros‐Heredia, Diego F.; Braz, Henrique B.; Nogueira, Cristiano de C.; Antonelli, Alexandre; Meiri, Shai
2017-01-01
Abstract Motivation We generated a novel database of Neotropical snakes (one of the world's richest herpetofauna) combining the most comprehensive, manually compiled distribution dataset with publicly available data. We assess, for the first time, the diversity patterns for all Neotropical snakes as well as sampling density and sampling biases. Main types of variables contained We compiled three databases of species occurrences: a dataset downloaded from the Global Biodiversity Information Facility (GBIF), a verified dataset built through taxonomic work and specialized literature, and a combined dataset comprising a cleaned version of the GBIF dataset merged with the verified dataset. Spatial location and grain Neotropics, Behrmann projection equivalent to 1° × 1°. Time period Specimens housed in museums during the last 150 years. Major taxa studied Squamata: Serpentes. Software format Geographical information system (GIS). Results The combined dataset provides the most comprehensive distribution database for Neotropical snakes to date. It contains 147,515 records for 886 species across 12 families, representing 74% of all species of snakes, spanning 27 countries in the Americas. Species richness and phylogenetic diversity show overall similar patterns. Amazonia is the least sampled Neotropical region, whereas most well‐sampled sites are located near large universities and scientific collections. We provide a list and updated maps of geographical distribution of all snake species surveyed. Main conclusions The biodiversity metrics of Neotropical snakes reflect patterns previously documented for other vertebrates, suggesting that similar factors may determine the diversity of both ectothermic and endothermic animals. We suggest conservation strategies for high‐diversity areas and sampling efforts be directed towards Amazonia and poorly known species. PMID:29398972
Cormier, Nathan; Kolisnik, Tyler; Bieda, Mark
2016-07-05
There has been an enormous expansion of use of chromatin immunoprecipitation followed by sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and production of several specialized graphical outputs. A number of systems have emphasized custom development of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking. We present a set of 20 ChIP-seq analysis software modules implemented in the Kepler workflow system; most (18/20) were also implemented as standalone, fully functional R scripts. The set consists of four full turnkey pipelines and 16 component modules. The turnkey pipelines in Kepler allow data provenance tracking. Implementation emphasized use of common R packages and widely-used external tools (e.g., MACS for peak finding), along with custom programming. This software presents comprehensive solutions and easily repurposed code blocks for ChIP-seq analysis and pipeline creation. Tasks include mapping raw reads, peakfinding via MACS, summary statistics, peak location statistics, summary plots centered on the transcription start site (TSS), gene ontology, pathway analysis, and de novo motif finding, among others. These pipelines range from those performing a single task to those performing full analyses of ChIP-seq data. The pipelines are supplied as both Kepler workflows, which allow data provenance tracking, and, in the majority of cases, as standalone R scripts. These pipelines are designed for ease of modification and repurposing.
A recipe for consistent 3D management of velocity data and time-depth conversion using Vel-IO 3D
NASA Astrophysics Data System (ADS)
Maesano, Francesco E.; D'Ambrogi, Chiara
2017-04-01
3D geological model production and related basin analyses need large and consistent seismic dataset and hopefully well logs to support correlation and calibration; the workflow and tools used to manage and integrate different type of data control the soundness of the final 3D model. Even though seismic interpretation is a basic early step in such workflow, the most critical step to obtain a comprehensive 3D model useful for further analyses is represented by the construction of an effective 3D velocity model and a well constrained time-depth conversion. We present a complex workflow that includes comprehensive management of large seismic dataset and velocity data, the construction of a 3D instantaneous multilayer-cake velocity model, the time-depth conversion of highly heterogeneous geological framework, including both depositional and structural complexities. The core of the workflow is the construction of the 3D velocity model using Vel-IO 3D tool (Maesano and D'Ambrogi, 2017; https://github.com/framae80/Vel-IO3D) that is composed by the following three scripts, written in Python 2.7.11 under ArcGIS ArcPy environment: i) the 3D instantaneous velocity model builder creates a preliminary 3D instantaneous velocity model using key horizons in time domain and velocity data obtained from the analysis of well and pseudo-well logs. The script applies spatial interpolation to the velocity parameters and calculates the value of depth of each point on each horizon bounding the layer-cake velocity model. ii) the velocity model optimizer improves the consistency of the velocity model by adding new velocity data indirectly derived from measured depths, thus reducing the geometrical uncertainties in the areas located far from the original velocity data. iii) the time-depth converter runs the time-depth conversion of any object located inside the 3D velocity model The Vel-IO 3D tool allows one to create 3D geological models consistent with the primary geological constraints (e.g. depth of the markers on wells). The workflow and Vel-IO 3D tool have been developed and tested for the construction of the 3D geological model of a flat region, 5700 km2 in area, located in the central part of the Po Plain (Northern Italy) in the frame of the European funded Project GeoMol. The study area was covered by a dense dataset of seismic lines (ca. 12000 km) and exploration wells (130 drilling), mainly deriving from oil and gas exploration activities. The interpretation of the seismic dataset leads to the construction of a 3D model in time domain that has been depth converted using Vel-IO 3D, with a 4 layer-cake 3D instantaneous velocity model. The resulting final 3D geological model, composed of 15 horizons and 150 faults, has been used for basin analysis at regional scale, for geothermal assessment, and for the update of the seismotectonic knowledge of the Po Plain. The Vel-IO 3D has been further used for the depth conversion of the accretionary prism of the Calabrian subduction (Southern Italy) and for a basin scale analysis of the Po Plain Plio-Pleistocene evolution. Maesano F.E. and D'Ambrogi C., (2017), Computers and Geosciences, doi: 10.1016/j.cageo.2016.11.013 Vel-IO 3D is available at: https://github.com/framae80/Vel-IO3D
Sharma, Priyanka; Saraya, Anoop; Sharma, Rinu
2018-01-30
To evaluate the diagnostic potential of a six microRNAs (miRNAs) panel consisting of miR-21, miR-144, miR-107, miR-342, miR-93 and miR-152 for esophageal cancer (EC) detection. The expression of miRNAs was analyzed in EC sera samples using quantitative real-time PCR. Risk score analysis was performed and linear regression models were then fitted to generate the six-miRNA panel. In addition, we made an effort to identify significantly dysregulated miRNAs and mRNAs in EC using the Cancer Genome Atlas (TCGA) miRNAseq and RNAseq datasets, respectively. Further, we identified significantly correlated miRNA-mRNA target pairs by integrating TCGA EC miRNAseq dataset with RNAseq dataset. The panel of circulating miRNAs showed enhanced sensitivity (87.5%) and specificity (90.48%) in terms of discriminating EC patients from normal subjects (area under the curve [AUC] = 0.968). Pathway enrichment analysis for potential targets of six miRNAs revealed 48 significant (P < 0.05) pathways, viz. pathways in cancer, mRNA surveillance, MAPK, Wnt, mTOR signaling, and so on. The expression data for mRNAs and miRNAs, downloaded from TCGA database, lead to identification of 2309 differentially expressed genes and 189 miRNAs. Gene ontology and pathway enrichment analysis showed that cell-cycle processes were most significantly enriched for differentially expressed mRNA. Integrated analysis of TCGA miRNAseq and RNAseq datasets resulted in identification of 53 063 significantly and negatively correlated miRNA-mRNA pairs. In summary, a novel and highly sensitive signature of serum miRNAs was identified for EC detection. Moreover, this is the first report identifying miRNA-mRNA target pairs from EC TCGA dataset, thus providing a comprehensive resource for understanding the interactions existing between miRNA and their target mRNAs in EC. © 2018 John Wiley & Sons Australia, Ltd.
2018-03-16
comprehensive survey of personality, values, institutional trust, mass media usage, and political attitudes and ideology (including a comprehensive...including China, Russia, and the USA, September 2015). A follow-up survey was administered to the same individuals 6 months later (N=9165), and a behavioral...dataset has been collected in the first year and a half of a three year research plan. In Wave 1, a comprehensive survey of personality, values
The TARGET Osteosarcoma (OS) project elucidates comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of high-risk or hard-to-treat childhood cancers.The OS project has produced comprehensive genomic profiles of nearly 100 clinically annotated patient cases within the discovery dataset. Each fully-characterized TARGET OS case includes data from nucleic acid samples extracted from tumor and normal tissue.
Underage alcohol policies across 50 California cities: an assessment of best practices
2012-01-01
Background We pursue two primary goals in this article: (1) to test a methodology and develop a dataset on U.S. local-level alcohol policy ordinances, and (2) to evaluate the presence, comprehensiveness, and stringency of eight local alcohol policies in 50 diverse California cities in relationship to recommended best practices in both public health literature and governmental recommendations to reduce underage drinking. Methods Following best practice recommendations from a wide array of authoritative sources, we selected eight local alcohol policy topics (e.g., conditional use permits, responsible beverage service training, social host ordinances, window/billboard advertising ordinances), and determined the presence or absence as well as the stringency (restrictiveness) and comprehensiveness (number of provisions) of each ordinance in each of the 50 cities in 2009. Following the alcohol policy literature, we created scores for each city on each type of ordinance and its associated components. We used these data to evaluate the extent to which recommendations for best practices to reduce underage alcohol use are being followed. Results (1) Compiling datasets of local-level alcohol policy laws and their comprehensiveness and stringency is achievable, even absent comprehensive, on-line, or other legal research tools. (2) We find that, with some exceptions, most of the 50 cities do not have high scores for presence, comprehensiveness, or stringency across the eight key policies. Critical policies such as responsible beverage service and deemed approved ordinances are uncommon, and, when present, they are generally neither comprehensive nor stringent. Even within policies that have higher adoption rates, central elements are missing across many or most cities’ ordinances. Conclusion This study demonstrates the viability of original legal data collection in the U.S. pertaining to local ordinances and of creating quantitative scores for each policy type to reflect comprehensiveness and stringency. Analysis of the resulting dataset reveals that, although the 50 cities have taken important steps to improve public health with regard to underage alcohol use and abuse, there is a great deal more that needs to be done to bring these cities into compliance with best practice recommendations. PMID:22734468
A Systematic Review of Global Drivers of Ant Elevational Diversity
Szewczyk, Tim; McCain, Christy M.
2016-01-01
Ant diversity shows a variety of patterns across elevational gradients, though the patterns and drivers have not been evaluated comprehensively. In this systematic review and reanalysis, we use published data on ant elevational diversity to detail the observed patterns and to test the predictions and interactions of four major diversity hypotheses: thermal energy, the mid-domain effect, area, and the elevational climate model. Of sixty-seven published datasets from the literature, only those with standardized, comprehensive sampling were used. Datasets included both local and regional ant diversity and spanned 80° in latitude across six biogeographical provinces. We used a combination of simulations, linear regressions, and non-parametric statistics to test multiple quantitative predictions of each hypothesis. We used an environmentally and geometrically constrained model as well as multiple regression to test their interactions. Ant diversity showed three distinct patterns across elevations: most common were hump-shaped mid-elevation peaks in diversity, followed by low-elevation plateaus and monotonic decreases in the number of ant species. The elevational climate model, which proposes that temperature and precipitation jointly drive diversity, and area were partially supported as independent drivers. Thermal energy and the mid-domain effect were not supported as primary drivers of ant diversity globally. The interaction models supported the influence of multiple drivers, though not a consistent set. In contrast to many vertebrate taxa, global ant elevational diversity patterns appear more complex, with the best environmental model contingent on precipitation levels. Differences in ecology and natural history among taxa may be crucial to the processes influencing broad-scale diversity patterns. PMID:27175999
Assessment of the NASA-USGS Global Land Survey (GLS) Datasets
Gutman, Garik; Huang, Chengquan; Chander, Gyanesh; Noojipady, Praveen; Masek, Jeffery G.
2013-01-01
The Global Land Survey (GLS) datasets are a collection of orthorectified, cloud-minimized Landsat-type satellite images, providing near complete coverage of the global land area decadally since the early 1970s. The global mosaics are centered on 1975, 1990, 2000, 2005, and 2010, and consist of data acquired from four sensors: Enhanced Thematic Mapper Plus, Thematic Mapper, Multispectral Scanner, and Advanced Land Imager. The GLS datasets have been widely used in land-cover and land-use change studies at local, regional, and global scales. This study evaluates the GLS datasets with respect to their spatial coverage, temporal consistency, geodetic accuracy, radiometric calibration consistency, image completeness, extent of cloud contamination, and residual gaps. In general, the three latest GLS datasets are of a better quality than the GLS-1990 and GLS-1975 datasets, with most of the imagery (85%) having cloud cover of less than 10%, the acquisition years clustered much more tightly around their target years, better co-registration relative to GLS-2000, and better radiometric absolute calibration. Probably, the most significant impediment to scientific use of the datasets is the variability of image phenology (i.e., acquisition day of year). This paper provides end-users with an assessment of the quality of the GLS datasets for specific applications, and where possible, suggestions for mitigating their deficiencies.
Gravity, aeromagnetic and rock-property data of the central California Coast Ranges
Langenheim, V.E.
2014-01-01
Gravity, aeromagnetic, and rock-property data were collected to support geologic-mapping, water-resource, and seismic-hazard studies for the central California Coast Ranges. These data are combined with existing data to provide gravity, aeromagnetic, and physical-property datasets for this region. The gravity dataset consists of approximately 18,000 measurements. The aeromagnetic dataset consists of total-field anomaly values from several detailed surveys that have been merged and gridded at an interval of 200 m. The physical property dataset consists of approximately 800 density measurements and 1,100 magnetic-susceptibility measurements from rock samples, in addition to previously published borehole gravity surveys from Santa Maria Basin, density logs from Salinas Valley, and intensities of natural remanent magnetization.
Li, Bo; Tang, Jing; Yang, Qingxia; Cui, Xuejiao; Li, Shuang; Chen, Sijie; Cao, Quanxing; Xue, Weiwei; Chen, Na; Zhu, Feng
2016-12-13
In untargeted metabolomics analysis, several factors (e.g., unwanted experimental &biological variations and technical errors) may hamper the identification of differential metabolic features, which requires the data-driven normalization approaches before feature selection. So far, ≥16 normalization methods have been widely applied for processing the LC/MS based metabolomics data. However, the performance and the sample size dependence of those methods have not yet been exhaustively compared and no online tool for comparatively and comprehensively evaluating the performance of all 16 normalization methods has been provided. In this study, a comprehensive comparison on these methods was conducted. As a result, 16 methods were categorized into three groups based on their normalization performances across various sample sizes. The VSN, the Log Transformation and the PQN were identified as methods of the best normalization performance, while the Contrast consistently underperformed across all sub-datasets of different benchmark data. Moreover, an interactive web tool comprehensively evaluating the performance of 16 methods specifically for normalizing LC/MS based metabolomics data was constructed and hosted at http://server.idrb.cqu.edu.cn/MetaPre/. In summary, this study could serve as a useful guidance to the selection of suitable normalization methods in analyzing the LC/MS based metabolomics data.
Li, Bo; Tang, Jing; Yang, Qingxia; Cui, Xuejiao; Li, Shuang; Chen, Sijie; Cao, Quanxing; Xue, Weiwei; Chen, Na; Zhu, Feng
2016-01-01
In untargeted metabolomics analysis, several factors (e.g., unwanted experimental & biological variations and technical errors) may hamper the identification of differential metabolic features, which requires the data-driven normalization approaches before feature selection. So far, ≥16 normalization methods have been widely applied for processing the LC/MS based metabolomics data. However, the performance and the sample size dependence of those methods have not yet been exhaustively compared and no online tool for comparatively and comprehensively evaluating the performance of all 16 normalization methods has been provided. In this study, a comprehensive comparison on these methods was conducted. As a result, 16 methods were categorized into three groups based on their normalization performances across various sample sizes. The VSN, the Log Transformation and the PQN were identified as methods of the best normalization performance, while the Contrast consistently underperformed across all sub-datasets of different benchmark data. Moreover, an interactive web tool comprehensively evaluating the performance of 16 methods specifically for normalizing LC/MS based metabolomics data was constructed and hosted at http://server.idrb.cqu.edu.cn/MetaPre/. In summary, this study could serve as a useful guidance to the selection of suitable normalization methods in analyzing the LC/MS based metabolomics data. PMID:27958387
Kavakiotis, Ioannis; Samaras, Patroklos; Triantafyllidis, Alexandros; Vlahavas, Ioannis
2017-11-01
Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for biological analyses involving a wide range of applications with great medical, biological, economic and environmental interest. Classification tasks i.e. the assignment of individuals to groups of origin based on their (multi-locus) genotypes, are performed in many fields such as forensic investigations, discrimination between wild and/or farmed populations and others. Τhese tasks, should be performed with a small number of loci, for computational as well as biological reasons. Thus, feature selection should precede classification tasks, especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of features can amount to hundreds of thousands or millions. In this paper, we present a novel data mining approach, called FIFS - Frequent Item Feature Selection, based on the use of frequent items for selection of the most informative markers from population genomic data. It is a modular method, consisting of two main components. The first one identifies the most frequent and unique genotypes for each sampled population. The second one selects the most appropriate among them, in order to create the informative SNP subsets to be returned. The proposed method (FIFS) was tested on a real dataset, which comprised of a comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446 individuals divided in 14 sub-populations, genotyped at 59,436 SNPs. Our method outperforms the state-of-the-art and baseline methods in every case. More specifically, our method surpassed the assignment accuracy threshold of 95% needing only half the number of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In: 100 SNPs.) CONCLUSION: Our approach successfully deals with the problem of informative marker selection in high dimensional genomic datasets. It offers better results compared to existing approaches and can aid biologists in selecting the most informative markers with maximum discrimination power for optimization of cost-effective panels with applications related to e.g. species identification, wildlife management, and forensics. Copyright © 2017 Elsevier Ltd. All rights reserved.
Multiclass fMRI data decoding and visualization using supervised self-organizing maps.
Hausfeld, Lars; Valente, Giancarlo; Formisano, Elia
2014-08-01
When multivariate pattern decoding is applied to fMRI studies entailing more than two experimental conditions, a most common approach is to transform the multiclass classification problem into a series of binary problems. Furthermore, for decoding analyses, classification accuracy is often the only outcome reported although the topology of activation patterns in the high-dimensional features space may provide additional insights into underlying brain representations. Here we propose to decode and visualize voxel patterns of fMRI datasets consisting of multiple conditions with a supervised variant of self-organizing maps (SSOMs). Using simulations and real fMRI data, we evaluated the performance of our SSOM-based approach. Specifically, the analysis of simulated fMRI data with varying signal-to-noise and contrast-to-noise ratio suggested that SSOMs perform better than a k-nearest-neighbor classifier for medium and large numbers of features (i.e. 250 to 1000 or more voxels) and similar to support vector machines (SVMs) for small and medium numbers of features (i.e. 100 to 600voxels). However, for a larger number of features (>800voxels), SSOMs performed worse than SVMs. When applied to a challenging 3-class fMRI classification problem with datasets collected to examine the neural representation of three human voices at individual speaker level, the SSOM-based algorithm was able to decode speaker identity from auditory cortical activation patterns. Classification performances were similar between SSOMs and other decoding algorithms; however, the ability to visualize decoding models and underlying data topology of SSOMs promotes a more comprehensive understanding of classification outcomes. We further illustrated this visualization ability of SSOMs with a re-analysis of a dataset examining the representation of visual categories in the ventral visual cortex (Haxby et al., 2001). This analysis showed that SSOMs could retrieve and visualize topography and neighborhood relations of the brain representation of eight visual categories. We conclude that SSOMs are particularly suited for decoding datasets consisting of more than two classes and are optimally combined with approaches that reduce the number of voxels used for classification (e.g. region-of-interest or searchlight approaches). Copyright © 2014. Published by Elsevier Inc.
Internal Consistency of the NVAP Water Vapor Dataset
NASA Technical Reports Server (NTRS)
Suggs, Ronnie J.; Jedlovec, Gary J.; Arnold, James E. (Technical Monitor)
2001-01-01
The NVAP (NASA Water Vapor Project) dataset is a global dataset at 1 x 1 degree spatial resolution consisting of daily, pentad, and monthly atmospheric precipitable water (PW) products. The analysis blends measurements from the Television and Infrared Operational Satellite (TIROS) Operational Vertical Sounder (TOVS), the Special Sensor Microwave/Imager (SSM/I), and radiosonde observations into a daily collage of PW. The original dataset consisted of five years of data from 1988 to 1992. Recent updates have added three additional years (1993-1995) and incorporated procedural and algorithm changes from the original methodology. Since each of the PW sources (TOVS, SSM/I, and radiosonde) do not provide global coverage, each of these sources compliment one another by providing spatial coverage over regions and during times where the other is not available. For this type of spatial and temporal blending to be successful, each of the source components should have similar or compatible accuracies. If this is not the case, regional and time varying biases may be manifested in the NVAP dataset. This study examines the consistency of the NVAP source data by comparing daily collocated TOVS and SSM/I PW retrievals with collocated radiosonde PW observations. The daily PW intercomparisons are performed over the time period of the dataset and for various regions.
Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein
2017-01-01
An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).
Slab2 - Updated Subduction Zone Geometries and Modeling Tools
NASA Astrophysics Data System (ADS)
Moore, G.; Hayes, G. P.; Portner, D. E.; Furtney, M.; Flamme, H. E.; Hearne, M. G.
2017-12-01
The U.S. Geological Survey database of global subduction zone geometries (Slab1.0), is a highly utilized dataset that has been applied to a wide range of geophysical problems. In 2017, these models have been improved and expanded upon as part of the Slab2 modeling effort. With a new data driven approach that can be applied to a broader range of tectonic settings and geophysical data sets, we have generated a model set that will serve as a more comprehensive, reliable, and reproducible resource for three-dimensional slab geometries at all of the world's convergent margins. The newly developed framework of Slab2 is guided by: (1) a large integrated dataset, consisting of a variety of geophysical sources (e.g., earthquake hypocenters, moment tensors, active-source seismic survey images of the shallow slab, tomography models, receiver functions, bathymetry, trench ages, and sediment thickness information); (2) a dynamic filtering scheme aimed at constraining incorporated seismicity to only slab related events; (3) a 3-D data interpolation approach which captures both high resolution shallow geometries and instances of slab rollback and overlap at depth; and (4) an algorithm which incorporates uncertainties of contributing datasets to identify the most probable surface depth over the extent of each subduction zone. Further layers will also be added to the base geometry dataset, such as historic moment release, earthquake tectonic providence, and interface coupling. Along with access to several queryable data formats, all components have been wrapped into an open source library in Python, such that suites of updated models can be released as further data becomes available. This presentation will discuss the extent of Slab2 development, as well as the current availability of the model and modeling tools.
Using spectral imaging for the analysis of abnormalities for colorectal cancer: When is it helpful?
Awan, Ruqayya; Al-Maadeed, Somaya; Al-Saady, Rafif
2018-01-01
The spectral imaging technique has been shown to provide more discriminative information than the RGB images and has been proposed for a range of problems. There are many studies demonstrating its potential for the analysis of histopathology images for abnormality detection but there have been discrepancies among previous studies as well. Many multispectral based methods have been proposed for histopathology images but the significance of the use of whole multispectral cube versus a subset of bands or a single band is still arguable. We performed comprehensive analysis using individual bands and different subsets of bands to determine the effectiveness of spectral information for determining the anomaly in colorectal images. Our multispectral colorectal dataset consists of four classes, each represented by infra-red spectrum bands in addition to the visual spectrum bands. We performed our analysis of spectral imaging by stratifying the abnormalities using both spatial and spectral information. For our experiments, we used a combination of texture descriptors with an ensemble classification approach that performed best on our dataset. We applied our method to another dataset and got comparable results with those obtained using the state-of-the-art method and convolutional neural network based method. Moreover, we explored the relationship of the number of bands with the problem complexity and found that higher number of bands is required for a complex task to achieve improved performance. Our results demonstrate a synergy between infra-red and visual spectrum by improving the classification accuracy (by 6%) on incorporating the infra-red representation. We also highlight the importance of how the dataset should be divided into training and testing set for evaluating the histopathology image-based approaches, which has not been considered in previous studies on multispectral histopathology images.
Using spectral imaging for the analysis of abnormalities for colorectal cancer: When is it helpful?
Al-Maadeed, Somaya; Al-Saady, Rafif
2018-01-01
The spectral imaging technique has been shown to provide more discriminative information than the RGB images and has been proposed for a range of problems. There are many studies demonstrating its potential for the analysis of histopathology images for abnormality detection but there have been discrepancies among previous studies as well. Many multispectral based methods have been proposed for histopathology images but the significance of the use of whole multispectral cube versus a subset of bands or a single band is still arguable. We performed comprehensive analysis using individual bands and different subsets of bands to determine the effectiveness of spectral information for determining the anomaly in colorectal images. Our multispectral colorectal dataset consists of four classes, each represented by infra-red spectrum bands in addition to the visual spectrum bands. We performed our analysis of spectral imaging by stratifying the abnormalities using both spatial and spectral information. For our experiments, we used a combination of texture descriptors with an ensemble classification approach that performed best on our dataset. We applied our method to another dataset and got comparable results with those obtained using the state-of-the-art method and convolutional neural network based method. Moreover, we explored the relationship of the number of bands with the problem complexity and found that higher number of bands is required for a complex task to achieve improved performance. Our results demonstrate a synergy between infra-red and visual spectrum by improving the classification accuracy (by 6%) on incorporating the infra-red representation. We also highlight the importance of how the dataset should be divided into training and testing set for evaluating the histopathology image-based approaches, which has not been considered in previous studies on multispectral histopathology images. PMID:29874262
The interfacial character of antibody paratopes: analysis of antibody-antigen structures.
Nguyen, Minh N; Pradhan, Mohan R; Verma, Chandra; Zhong, Pingyu
2017-10-01
In this study, computational methods are applied to investigate the general properties of antigen engaging residues of a paratope from a non-redundant dataset of 403 antibody-antigen complexes to dissect the contribution of hydrogen bonds, hydrophobic, van der Waals contacts and ionic interactions, as well as role of water molecules in the antigen-antibody interface. Consistent with previous reports using smaller datasets, we found that Tyr, Trp, Ser, Asn, Asp, Thr, Arg, Gly, His contribute substantially to the interactions between antibody and antigen. Furthermore, antibody-antigen interactions can be mediated by interfacial waters. However, there is no reported comprehensive analysis for a large number of structured waters that engage in higher ordered structures at the antibody-antigen interface. From our dataset, we have found the presence of interfacial waters in 242 complexes. We present evidence that suggests a compelling role of these interfacial waters in interactions of antibodies with a range of antigens differing in shape complementarity. Finally, we carry out 296 835 pairwise 3D structure comparisons of 771 structures of contact residues of antibodies with their interfacial water molecules from our dataset using CLICK method. A heuristic clustering algorithm is used to obtain unique structural similarities, and found to separate into 368 different clusters. These clusters are used to identify structural motifs of contact residues of antibodies for epitope binding. This clustering database of contact residues is freely accessible at http://mspc.bii.a-star.edu.sg/minhn/pclick.html. minhn@bii.a-star.edu.sg, chandra@bii.a-star.edu.sg or zhong_pingyu@immunol.a-star.edu.sg. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Analyzing and synthesizing phylogenies using tree alignment graphs.
Smith, Stephen A; Brown, Joseph W; Hinchliff, Cody E
2013-01-01
Phylogenetic trees are used to analyze and visualize evolution. However, trees can be imperfect datatypes when summarizing multiple trees. This is especially problematic when accommodating for biological phenomena such as horizontal gene transfer, incomplete lineage sorting, and hybridization, as well as topological conflict between datasets. Additionally, researchers may want to combine information from sets of trees that have partially overlapping taxon sets. To address the problem of analyzing sets of trees with conflicting relationships and partially overlapping taxon sets, we introduce methods for aligning, synthesizing and analyzing rooted phylogenetic trees within a graph, called a tree alignment graph (TAG). The TAG can be queried and analyzed to explore uncertainty and conflict. It can also be synthesized to construct trees, presenting an alternative to supertrees approaches. We demonstrate these methods with two empirical datasets. In order to explore uncertainty, we constructed a TAG of the bootstrap trees from the Angiosperm Tree of Life project. Analysis of the resulting graph demonstrates that areas of the dataset that are unresolved in majority-rule consensus tree analyses can be understood in more detail within the context of a graph structure, using measures incorporating node degree and adjacency support. As an exercise in synthesis (i.e., summarization of a TAG constructed from the alignment trees), we also construct a TAG consisting of the taxonomy and source trees from a recent comprehensive bird study. We synthesized this graph into a tree that can be reconstructed in a repeatable fashion and where the underlying source information can be updated. The methods presented here are tractable for large scale analyses and serve as a basis for an alternative to consensus tree and supertree methods. Furthermore, the exploration of these graphs can expose structures and patterns within the dataset that are otherwise difficult to observe.
Analyzing and Synthesizing Phylogenies Using Tree Alignment Graphs
Smith, Stephen A.; Brown, Joseph W.; Hinchliff, Cody E.
2013-01-01
Phylogenetic trees are used to analyze and visualize evolution. However, trees can be imperfect datatypes when summarizing multiple trees. This is especially problematic when accommodating for biological phenomena such as horizontal gene transfer, incomplete lineage sorting, and hybridization, as well as topological conflict between datasets. Additionally, researchers may want to combine information from sets of trees that have partially overlapping taxon sets. To address the problem of analyzing sets of trees with conflicting relationships and partially overlapping taxon sets, we introduce methods for aligning, synthesizing and analyzing rooted phylogenetic trees within a graph, called a tree alignment graph (TAG). The TAG can be queried and analyzed to explore uncertainty and conflict. It can also be synthesized to construct trees, presenting an alternative to supertrees approaches. We demonstrate these methods with two empirical datasets. In order to explore uncertainty, we constructed a TAG of the bootstrap trees from the Angiosperm Tree of Life project. Analysis of the resulting graph demonstrates that areas of the dataset that are unresolved in majority-rule consensus tree analyses can be understood in more detail within the context of a graph structure, using measures incorporating node degree and adjacency support. As an exercise in synthesis (i.e., summarization of a TAG constructed from the alignment trees), we also construct a TAG consisting of the taxonomy and source trees from a recent comprehensive bird study. We synthesized this graph into a tree that can be reconstructed in a repeatable fashion and where the underlying source information can be updated. The methods presented here are tractable for large scale analyses and serve as a basis for an alternative to consensus tree and supertree methods. Furthermore, the exploration of these graphs can expose structures and patterns within the dataset that are otherwise difficult to observe. PMID:24086118
Chen, Zhenyu; Li, Jianping; Wei, Liwei
2007-10-01
Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM. A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity. Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.
The National Hydrography Dataset
,
1999-01-01
The National Hydrography Dataset (NHD) is a newly combined dataset that provides hydrographic data for the United States. The NHD is the culmination of recent cooperative efforts of the U.S. Environmental Protection Agency (USEPA) and the U.S. Geological Survey (USGS). It combines elements of USGS digital line graph (DLG) hydrography files and the USEPA Reach File (RF3). The NHD supersedes RF3 and DLG files by incorporating them, not by replacing them. Users of RF3 or DLG files will find the same data in a new, more flexible format. They will find that the NHD is familiar but greatly expanded and refined. The DLG files contribute a national coverage of millions of features, including water bodies such as lakes and ponds, linear water features such as streams and rivers, and also point features such as springs and wells. These files provide standardized feature types, delineation, and spatial accuracy. From RF3, the NHD acquires hydrographic sequencing, upstream and downstream navigation for modeling applications, and reach codes. The reach codes provide a way to integrate data from organizations at all levels by linking the data to this nationally consistent hydrographic network. The feature names are from the Geographic Names Information System (GNIS). The NHD provides comprehensive coverage of hydrographic data for the United States. Some of the anticipated end-user applications of the NHD are multiuse hydrographic modeling and water-quality studies of fish habitats. Although based on 1:100,000-scale data, the NHD is planned so that it can incorporate and encourage the development of the higher resolution data that many users require. The NHD can be used to promote the exchange of data between users at the national, State, and local levels. Many users will benefit from the NHD and will want to contribute to the dataset as well.
SEMIPARAMETRIC QUANTILE REGRESSION WITH HIGH-DIMENSIONAL COVARIATES
Zhu, Liping; Huang, Mian; Li, Runze
2012-01-01
This paper is concerned with quantile regression for a semiparametric regression model, in which both the conditional mean and conditional variance function of the response given the covariates admit a single-index structure. This semiparametric regression model enables us to reduce the dimension of the covariates and simultaneously retains the flexibility of nonparametric regression. Under mild conditions, we show that the simple linear quantile regression offers a consistent estimate of the index parameter vector. This is a surprising and interesting result because the single-index model is possibly misspecified under the linear quantile regression. With a root-n consistent estimate of the index vector, one may employ a local polynomial regression technique to estimate the conditional quantile function. This procedure is computationally efficient, which is very appealing in high-dimensional data analysis. We show that the resulting estimator of the quantile function performs asymptotically as efficiently as if the true value of the index vector were known. The methodologies are demonstrated through comprehensive simulation studies and an application to a real dataset. PMID:24501536
The evolution of parental cooperation in birds.
Remeš, Vladimír; Freckleton, Robert P; Tökölyi, Jácint; Liker, András; Székely, Tamás
2015-11-03
Parental care is one of the most variable social behaviors and it is an excellent model system to understand cooperation between unrelated individuals. Three major hypotheses have been proposed to explain the extent of parental cooperation: sexual selection, social environment, and environmental harshness. Using the most comprehensive dataset on parental care that includes 659 bird species from 113 families covering both uniparental and biparental taxa, we show that the degree of parental cooperation is associated with both sexual selection and social environment. Consistent with recent theoretical models parental cooperation decreases with the intensity of sexual selection and with skewed adult sex ratios. These effects are additive and robust to the influence of life-history variables. However, parental cooperation is unrelated to environmental factors (measured at the scale of whole species ranges) as indicated by a lack of consistent relationship with ambient temperature, rainfall or their fluctuations within and between years. These results highlight the significance of social effects for parental cooperation and suggest that several parental strategies may coexist in a given set of ambient environment.
Boyd, Philip W.; Rynearson, Tatiana A.; Armstrong, Evelyn A.; Fu, Feixue; Hayashi, Kendra; Hu, Zhangxi; Hutchins, David A.; Kudela, Raphael M.; Litchman, Elena; Mulholland, Margaret R.; Passow, Uta; Strzepek, Robert F.; Whittaker, Kerry A.; Yu, Elizabeth; Thomas, Mridul K.
2013-01-01
“It takes a village to finish (marine) science these days” Paraphrased from Curtis Huttenhower (the Human Microbiome project) The rapidity and complexity of climate change and its potential effects on ocean biota are challenging how ocean scientists conduct research. One way in which we can begin to better tackle these challenges is to conduct community-wide scientific studies. This study provides physiological datasets fundamental to understanding functional responses of phytoplankton growth rates to temperature. While physiological experiments are not new, our experiments were conducted in many laboratories using agreed upon protocols and 25 strains of eukaryotic and prokaryotic phytoplankton isolated across a wide range of marine environments from polar to tropical, and from nearshore waters to the open ocean. This community-wide approach provides both comprehensive and internally consistent datasets produced over considerably shorter time scales than conventional individual and often uncoordinated lab efforts. Such datasets can be used to parameterise global ocean model projections of environmental change and to provide initial insights into the magnitude of regional biogeographic change in ocean biota in the coming decades. Here, we compare our datasets with a compilation of literature data on phytoplankton growth responses to temperature. A comparison with prior published data suggests that the optimal temperatures of individual species and, to a lesser degree, thermal niches were similar across studies. However, a comparison of the maximum growth rate across studies revealed significant departures between this and previously collected datasets, which may be due to differences in the cultured isolates, temporal changes in the clonal isolates in cultures, and/or differences in culture conditions. Such methodological differences mean that using particular trait measurements from the prior literature might introduce unknown errors and bias into modelling projections. Using our community-wide approach we can reduce such protocol-driven variability in culture studies, and can begin to address more complex issues such as the effect of multiple environmental drivers on ocean biota. PMID:23704890
Fan, Qiuyun; Nummenmaa, Aapo; Wichtmann, Barbara; Witzel, Thomas; Mekkaoui, Choukri; Schneider, Walter; Wald, Lawrence L; Huang, Susie Y
2018-06-01
We provide a comprehensive diffusion MRI dataset acquired with a novel biomimetic phantom mimicking human white matter. The fiber substrates in the diffusion phantom were constructed from hollow textile axons ("taxons") with an inner diameter of 11.8±1.2 µm and outer diameter of 33.5±2.3 µm. Data were acquired on the 3 T CONNECTOM MRI scanner with multiple diffusion times and multiple q-values per diffusion time, which is a dedicated acquisition for validation of microstructural imaging methods, such as compartment size and volume fraction mapping. Minimal preprocessing was performed to correct for susceptibility and eddy current distortions. Data were deposited in the XNAT Central database (project ID: dMRI_Phant_MGH).
Weidner, Christopher; Fischer, Cornelius; Sauer, Sascha
2014-12-01
We introduce PHOXTRACK (PHOsphosite-X-TRacing Analysis of Causal Kinases), a user-friendly freely available software tool for analyzing large datasets of post-translational modifications of proteins, such as phosphorylation, which are commonly gained by mass spectrometry detection. In contrast to other currently applied data analysis approaches, PHOXTRACK uses full sets of quantitative proteomics data and applies non-parametric statistics to calculate whether defined kinase-specific sets of phosphosite sequences indicate statistically significant concordant differences between various biological conditions. PHOXTRACK is an efficient tool for extracting post-translational information of comprehensive proteomics datasets to decipher key regulatory proteins and to infer biologically relevant molecular pathways. PHOXTRACK will be maintained over the next years and is freely available as an online tool for non-commercial use at http://phoxtrack.molgen.mpg.de. Users will also find a tutorial at this Web site and can additionally give feedback at https://groups.google.com/d/forum/phoxtrack-discuss. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Climate Model Diagnostic Analyzer
NASA Technical Reports Server (NTRS)
Lee, Seungwon; Pan, Lei; Zhai, Chengxing; Tang, Benyang; Kubar, Terry; Zhang, Zia; Wang, Wei
2015-01-01
The comprehensive and innovative evaluation of climate models with newly available global observations is critically needed for the improvement of climate model current-state representation and future-state predictability. A climate model diagnostic evaluation process requires physics-based multi-variable analyses that typically involve large-volume and heterogeneous datasets, making them both computation- and data-intensive. With an exploratory nature of climate data analyses and an explosive growth of datasets and service tools, scientists are struggling to keep track of their datasets, tools, and execution/study history, let alone sharing them with others. In response, we have developed a cloud-enabled, provenance-supported, web-service system called Climate Model Diagnostic Analyzer (CMDA). CMDA enables the physics-based, multivariable model performance evaluations and diagnoses through the comprehensive and synergistic use of multiple observational data, reanalysis data, and model outputs. At the same time, CMDA provides a crowd-sourcing space where scientists can organize their work efficiently and share their work with others. CMDA is empowered by many current state-of-the-art software packages in web service, provenance, and semantic search.
Dose coverage calculation using a statistical shape model—applied to cervical cancer radiotherapy
NASA Astrophysics Data System (ADS)
Tilly, David; van de Schoot, Agustinus J. A. J.; Grusell, Erik; Bel, Arjan; Ahnesjö, Anders
2017-05-01
A comprehensive methodology for treatment simulation and evaluation of dose coverage probabilities is presented where a population based statistical shape model (SSM) provide samples of fraction specific patient geometry deformations. The learning data consists of vector fields from deformable image registration of repeated imaging giving intra-patient deformations which are mapped to an average patient serving as a common frame of reference. The SSM is created by extracting the most dominating eigenmodes through principal component analysis of the deformations from all patients. The sampling of a deformation is thus reduced to sampling weights for enough of the most dominating eigenmodes that describe the deformations. For the cervical cancer patient datasets in this work, we found seven eigenmodes to be sufficient to capture 90% of the variance in the deformations of the, and only three eigenmodes for stability in the simulated dose coverage probabilities. The normality assumption of the eigenmode weights was tested and found relevant for the 20 most dominating eigenmodes except for the first. Individualization of the SSM is demonstrated to be improved using two deformation samples from a new patient. The probabilistic evaluation provided additional information about the trade-offs compared to the conventional single dataset treatment planning.
NASA Technical Reports Server (NTRS)
Vila, Daniel; deGoncalves, Luis Gustavo; Toll, David L.; Rozante, Jose Roberto
2008-01-01
This paper describes a comprehensive assessment of a new high-resolution, high-quality gauge-satellite based analysis of daily precipitation over continental South America during 2004. This methodology is based on a combination of additive and multiplicative bias correction schemes in order to get the lowest bias when compared with the observed values. Inter-comparisons and cross-validations tests have been carried out for the control algorithm (TMPA real-time algorithm) and different merging schemes: additive bias correction (ADD), ratio bias correction (RAT) and TMPA research version, for different months belonging to different seasons and for different network densities. All compared merging schemes produce better results than the control algorithm, but when finer temporal (daily) and spatial scale (regional networks) gauge datasets is included in the analysis, the improvement is remarkable. The Combined Scheme (CoSch) presents consistently the best performance among the five techniques. This is also true when a degraded daily gauge network is used instead of full dataset. This technique appears a suitable tool to produce real-time, high-resolution, high-quality gauge-satellite based analyses of daily precipitation over land in regional domains.
Simulation of Smart Home Activity Datasets
Synnott, Jonathan; Nugent, Chris; Jeffers, Paul
2015-01-01
A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371
Simulation of Smart Home Activity Datasets.
Synnott, Jonathan; Nugent, Chris; Jeffers, Paul
2015-06-16
A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
Data on the interaction between thermal comfort and building control research.
Park, June Young; Nagy, Zoltan
2018-04-01
This dataset contains bibliography information regarding thermal comfort and building control research. In addition, the instruction of a data-driven literature survey method guides readers to reproduce their own literature survey on related bibliography datasets. Based on specific search terms, all relevant bibliographic datasets are downloaded. We explain the keyword co-occurrences of historical developments and recent trends, and the citation network which represents the interaction between thermal comfort and building control research. Results and discussions are described in the research article entitled "Comprehensive analysis of the relationship between thermal comfort and building control research - A data-driven literature review" (Park and Nagy, 2018).
Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein
2017-01-01
An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled “biosigdata.com.” It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf). PMID:28487832
V-FOR-WaTer - a new virtual research environment for environmental research
NASA Astrophysics Data System (ADS)
Strobl, Marcus; Azmi, Elnaz; Hassler, Sibylle; Mälicke, Mirko; Meyer, Jörg; Zehe, Erwin
2017-04-01
The preparation of heterogeneous datasets for scientific analysis is still a demanding task. Data preprocessing for hydrological models typically involves gathering datasets from different sources, extensive work within geoinformation systems, data transformation, the generation of computational grids and the definition of initial and boundary conditions. V-FOR-WaTer, a standardized and scalable data hub with compatible analysis tools, will ease comprehensive studies and significantly reduce data preparation time. The idea behind V-FOR-WaTer is to bring together various datasets (e.g. point measurements, 2D/3D data, time series data) from different sources (e.g. gathered in research projects, or as part of regular monitoring of state offices) and to provide common as well as innovative scaling tools in space and time to generate a coherent data grid. Each dataset holds detailed standardized metadata to ensure usability of the data, offer a comprehensive search function and provide reference information for appropriate citation of the dataset creators. V-FOR-WaTer includes a basis of data and tools, but its purpose is to grow by users who extend the virtual research environment with their own tools and research data. Researchers who upload new data or tools can receive a digital object identifier, or protect their data and tools from others until publication. Access to data and tools provided from V-FOR-WaTer happens via an easy-to-use web portal. Due to its modular architecture the portal is ready to be extended with new tools and features and also offers interfaces to Matlab, Python and R.
Hemsworth, David; Baregheh, Anahita; Aoun, Samar; Kazanjian, Arminee
2018-02-01
This study had conducted a comprehensive analysis of the psychometric properties of Proqol 5, professional quality of work instrument among nurses and palliative care-workers on the basis of three independent datasets. The goal is to see the general applicability of this instrument across multiple populations. Although the Proqol scale has been widely adopted, there are few attempts that have thoroughly analyzed this instrument across multiple datasets using multiple populations. A questionnaire was developed and distributed to palliative care-workers in Canada and Nurses at two hospitals in Australia and Canada, this resulted in 273 datasets from the Australian and 303 datasets from the Canadian nurses and 503 datasets from the Canadian palliative care-workers. A comprehensive psychometric property analysis was conducted including inter-item correlations, tests of reliability, and both convergent and discriminant validity as well as construct validity analyses. In addition, to test for the reverse coding artifacts in the BO scale, exploratory factor analysis was adopted. The psychometric property analysis of Proqol 5 was satisfactory for the compassion satisfaction construct. However, there are concerns with respect to the burnout and secondary trauma stress scales and recommendations are made regarding the coding and specific items which should improve the reliability and validity of these scales. This research establishes the strengths and weaknesses of the Proqol instrument and demonstrates how it can be improved. Through specific recommendations, the academic community is invited to revise the burnout and secondary traumatic stress scales in an effort to improve Proqol 5 measures. Copyright © 2017. Published by Elsevier Inc.
Does a global DNA barcoding gap exist in Annelida?
Kvist, Sebastian
2016-05-01
Accurate identification of unknown specimens by means of DNA barcoding is contingent on the presence of a DNA barcoding gap, among other factors, as its absence may result in dubious specimen identifications - false negatives or positives. Whereas the utility of DNA barcoding would be greatly reduced in the absence of a distinct and sufficiently sized barcoding gap, the limits of intraspecific and interspecific distances are seldom thoroughly inspected across comprehensive sampling. The present study aims to illuminate this aspect of barcoding in a comprehensive manner for the animal phylum Annelida. All cytochrome c oxidase subunit I sequences (cox1 gene; the chosen region for zoological DNA barcoding) present in GenBank for Annelida, as well as for "Polychaeta", "Oligochaeta", and Hirudinea separately, were downloaded and curated for length, coverage and potential contaminations. The final datasets consisted of 9782 (Annelida), 5545 ("Polychaeta"), 3639 ("Oligochaeta"), and 598 (Hirudinea) cox1 sequences and these were either (i) used as is in an automated global barcoding gap detection analysis or (ii) further analyzed for genetic distances, separated into bins containing intraspecific and interspecific comparisons and plotted in a graph to visualize any potential global barcoding gap. Over 70 million pairwise genetic comparisons were made and results suggest that although there is a tendency towards separation, no distinct or sufficiently sized global barcoding gap exists in either of the datasets rendering future barcoding efforts at risk of erroneous specimen identifications (but local barcoding gaps may still exist allowing for the identification of specimens at lower taxonomic ranks). This seems to be especially true for earthworm taxa, which account for fully 35% of the total number of interspecific comparisons that show 0% divergence.
2014-01-01
Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner. PMID:25077800
Pongor, Lőrinc S; Vera, Roberto; Ligeti, Balázs
2014-01-01
Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner.
GRRATS: A New Approach to Inland Altimetry Processing for Major World Rivers
NASA Astrophysics Data System (ADS)
Coss, S. P.
2016-12-01
Here we present work-in-progress results aimed at generating a new radar altimetry dataset GRRATS (Global River Radar Altimetry Time Series) extracted over global ocean-draining rivers wider than 900 m. GRATTS was developed as a component of the NASA MEaSUREs project (PI: Dennis Lettenmaier, UCLA) to generate pre-SWOT data products for decadal or longer global river elevation changes from multi-mission satellite radar altimetry data. The dataset at present includes 909 time series from 39 rivers. A new method of filtering VS (virtual station) height time series is presented where, DEM based heights were used to establish limits for the ice1 retracked Jason2 and Envisat heights at present. While GRRATS is following in the footsteps of several predecessors, it contributes to one of the critical climate data records in generating a validated and comprehensive hydrologic observations in river height. The current data product includes VSs in north and south Americas, Africa and Eurasia, with the most comprehensive set of Jason-2 and Envisat RA time series available for North America and Eurasia. We present a semi-automated procedure to process returns from river locations, identified with Landsat images and updated water mask extent. Consistent methodologies for flagging ice cover are presented. DEM heights used in height filtering were retained and can be used as river height profiles. All non-validated VS have been assigned a letter grade A-D to aid end users in selection of data. Validated VS are accompanied with a suite of fit statistics. Due to the inclusiveness of the dataset, not all VS were able to undergo validation (415 of 909), but those that were demonstrate that confidence in the data product is warranted. Validation was accomplished using records from 45 in situ gauges from 12 rivers. Meta-analysis was performed to compare each gauge with each VS by relative height. Preliminary validation results are as follows. 89.3% of the data have positive Nash Sutcliff Efficiency (NES) values, and the median NSE value is 0.73. The median standard deviation of error (STDE) is .92 m. GRRATS will soon be publicly available in NetCDF format with CF compliant metadata.
Laituri, Tony R; Henry, Scott; El-Jawahri, Raed; Muralidharan, Nirmal; Li, Guosong; Nutt, Marvin
2015-11-01
A provisional, age-dependent thoracic risk equation (or, "risk curve") was derived to estimate moderate-to-fatal injury potential (AIS2+), pertaining to men with responses gaged by the advanced mid-sized male test dummy (THOR50). The derivation involved two distinct data sources: cases from real-world crashes (e.g., the National Automotive Sampling System, NASS) and cases involving post-mortem human subjects (PMHS). The derivation was therefore more comprehensive, as NASS datasets generally skew towards younger occupants, and PMHS datasets generally skew towards older occupants. However, known deficiencies had to be addressed (e.g., the NASS cases had unknown stimuli, and the PMHS tests required transformation of known stimuli into THOR50 stimuli). For the NASS portion of the analysis, chest-injury outcomes for adult male drivers about the size of the THOR50 were collected from real-world, 11-1 o'clock, full-engagement frontal crashes (NASS, 1995-2012 calendar years, 1985-2012 model-year light passenger vehicles). The screening for THOR50-sized men involved application of a set of newly-derived "correction" equations for self-reported height and weight data in NASS. Finally, THOR50 stimuli were estimated via field simulations involving attendant representative restraint systems, and those stimuli were then assigned to corresponding NASS cases (n=508). For the PMHS portion of the analysis, simulation-based closure equations were developed to convert PMHS stimuli into THOR50 stimuli. Specifically, closure equations were derived for the four measurement locations on the THOR50 chest by cross-correlating the results of matched-loading simulations between the test dummy and the age-dependent, Ford Human Body Model. The resulting closure equations demonstrated acceptable fidelity (n=75 matched simulations, R2≥0.99). These equations were applied to the THOR50-sized men in the PMHS dataset (n=20). The NASS and PMHS datasets were combined and subjected to survival analysis with event-frequency weighting and arbitrary censoring. The resulting risk curve--a function of peak THOR50 chest compression and age--demonstrated acceptable fidelity for recovering the AIS2+ chest injury rate of the combined dataset (i.e., IR_dataset=1.97% vs. curve-based IR_dataset=1.98%). Additional sensitivity analyses showed that (a) binary logistic regression yielded a risk curve with nearly-identical fidelity, (b) there was only a slight advantage of combining the small-sample PMHS dataset with the large-sample NASS dataset, (c) use of the PMHS-based risk curve for risk estimation of the combined dataset yielded relatively poor performance (194% difference), and (d) when controlling for the type of contact (lab-consistent or not), the resulting risk curves were similar.
Wu, Wei-Sheng; Jhou, Meng-Jhun
2017-01-13
Missing value imputation is important for microarray data analyses because microarray data with missing values would significantly degrade the performance of the downstream analyses. Although many microarray missing value imputation algorithms have been developed, an objective and comprehensive performance comparison framework is still lacking. To solve this problem, we previously proposed a framework which can perform a comprehensive performance comparison of different existing algorithms. Also the performance of a new algorithm can be evaluated by our performance comparison framework. However, constructing our framework is not an easy task for the interested researchers. To save researchers' time and efforts, here we present an easy-to-use web tool named MVIAeval (Missing Value Imputation Algorithm evaluator) which implements our performance comparison framework. MVIAeval provides a user-friendly interface allowing users to upload the R code of their new algorithm and select (i) the test datasets among 20 benchmark microarray (time series and non-time series) datasets, (ii) the compared algorithms among 12 existing algorithms, (iii) the performance indices from three existing ones, (iv) the comprehensive performance scores from two possible choices, and (v) the number of simulation runs. The comprehensive performance comparison results are then generated and shown as both figures and tables. MVIAeval is a useful tool for researchers to easily conduct a comprehensive and objective performance evaluation of their newly developed missing value imputation algorithm for microarray data or any data which can be represented as a matrix form (e.g. NGS data or proteomics data). Thus, MVIAeval will greatly expedite the progress in the research of missing value imputation algorithms.
Closing the data gap: Creating an open data environment
NASA Astrophysics Data System (ADS)
Hester, J. R.
2014-02-01
Poor data management brought on by increasing volumes of complex data undermines both the integrity of the scientific process and the usefulness of datasets. Researchers should endeavour both to make their data citeable and to cite data whenever possible. The reusability of datasets is improved by community adoption of comprehensive metadata standards and public availability of reversibly reduced data. Where standards are not yet defined, as much information as possible about the experiment and samples should be preserved in datafiles written in a standard format.
NASA Astrophysics Data System (ADS)
Casson, David; Werner, Micha; Weerts, Albrecht; Schellekens, Jaap; Solomatine, Dimitri
2017-04-01
Hydrological modelling in the Canadian Sub-Arctic is hindered by the limited spatial and temporal coverage of local meteorological data. Local watershed modelling often relies on data from a sparse network of meteorological stations with a rough density of 3 active stations per 100,000 km2. Global datasets hold great promise for application due to more comprehensive spatial and extended temporal coverage. A key objective of this study is to demonstrate the application of global datasets and data assimilation techniques for hydrological modelling of a data sparse, Sub-Arctic watershed. Application of available datasets and modelling techniques is currently limited in practice due to a lack of local capacity and understanding of available tools. Due to the importance of snow processes in the region, this study also aims to evaluate the performance of global SWE products for snowpack modelling. The Snare Watershed is a 13,300 km2 snowmelt driven sub-basin of the Mackenzie River Basin, Northwest Territories, Canada. The Snare watershed is data sparse in terms of meteorological data, but is well gauged with consistent discharge records since the late 1970s. End of winter snowpack surveys have been conducted every year from 1978-present. The application of global re-analysis datasets from the EU FP7 eartH2Observe project are investigated in this study. Precipitation data are taken from Multi-Source Weighted-Ensemble Precipitation (MSWEP) and temperature data from Watch Forcing Data applied to European Reanalysis (ERA)-Interim data (WFDEI). GlobSnow-2 is a global Snow Water Equivalent (SWE) measurement product funded by the European Space Agency (ESA) and is also evaluated over the local watershed. Downscaled precipitation, temperature and potential evaporation datasets are used as forcing data in a distributed version of the HBV model implemented in the WFLOW framework. Results demonstrate the successful application of global datasets in local watershed modelling, but that validation of actual frozen precipitation and snowpack conditions is very difficult. The distributed hydrological model shows good streamflow simulation performance based on statistical model evaluation techniques. Results are also promising for inter-annual variability, spring snowmelt onset and time to peak flows. It is expected that data assimilation of stream flow using an Ensemble Kalman Filter will further improve model performance. This study shows that global re-analysis datasets hold great potential for understanding the hydrology and snowpack dynamics of the expansive and data sparse sub-Arctic. However, global SWE products will require further validation and algorithm improvements, particularly over boreal forest and lake-rich regions.
Big Data in Organ Transplantation: Registries and Administrative Claims
Massie, Allan B.; Kucirka, Lauren; Segev, Dorry L.
2015-01-01
The field of organ transplantation benefits from large, comprehensive, transplant-specific national datasets available to researchers. In addition to the widely-used OPTN-based registries (the UNOS and SRTR datasets) and USRDS datasets, there are other publicly available national datasets, not specific to transplantation, which have historically been underutilized in the field of transplantation. Of particular interest are the Nationwide Inpatient Sample (NIS) and State Inpatient Databases (SID), produced by the Agency for Healthcare Research and Quality (AHRQ). The United States Renal Data System (USRDS) database provides extensive data relevant to studies of kidney transplantation. Linkage of publicly available datasets to external data sources such as private claims or pharmacy data provides further resources for registry-based research. Although these resources can transcend some limitations of OPTN-based registry data, they come with their own limitations, which must be understood to avoid biased inference. This review discusses different registry-based data sources available in the United States, as well as the proper design and conduct of registry-based research. PMID:25040084
Pearlstine, Leonard; Higer, Aaron; Palaseanu, Monica; Fujisaki, Ikuko; Mazzotti, Frank
2007-01-01
The Everglades Depth Estimation Network (EDEN) is an integrated network of real-time water-level monitoring, ground-elevation modeling, and water-surface modeling that provides scientists and managers with current (2000-present), online water-stage and water-depth information for the entire freshwater portion of the Greater Everglades. Continuous daily spatial interpolations of the EDEN network stage data are presented on a 400-square-meter grid spacing. EDEN offers a consistent and documented dataset that can be used by scientists and managers to (1) guide large-scale field operations, (2) integrate hydrologic and ecological responses, and (3) support biological and ecological assessments that measure ecosystem responses to the implementation of the Comprehensive Everglades Restoration Plan (CERP) The target users are biologists and ecologists examining trophic level responses to hydrodynamic changes in the Everglades.
The economic demography of passenger intermodal transportation : opportunities and challenges.
DOT National Transportation Integrated Search
2015-12-01
The research on intermodal transportation is vast. However, most efforts have focused on freight transportation. There is much less research on intermodal passenger transportationlargely due to lack of a comprehensive dataset for effectively study...
Audigier, Chloé; Mansi, Tommaso; Delingette, Hervé; Rapaka, Saikiran; Passerini, Tiziano; Mihalef, Viorel; Jolly, Marie-Pierre; Pop, Raoul; Diana, Michele; Soler, Luc; Kamen, Ali; Comaniciu, Dorin; Ayache, Nicholas
2017-09-01
We aim at developing a framework for the validation of a subject-specific multi-physics model of liver tumor radiofrequency ablation (RFA). The RFA computation becomes subject specific after several levels of personalization: geometrical and biophysical (hemodynamics, heat transfer and an extended cellular necrosis model). We present a comprehensive experimental setup combining multimodal, pre- and postoperative anatomical and functional images, as well as the interventional monitoring of intra-operative signals: the temperature and delivered power. To exploit this dataset, an efficient processing pipeline is introduced, which copes with image noise, variable resolution and anisotropy. The validation study includes twelve ablations from five healthy pig livers: a mean point-to-mesh error between predicted and actual ablation extent of 5.3 ± 3.6 mm is achieved. This enables an end-to-end preclinical validation framework that considers the available dataset.
Empirical Studies on the Network of Social Groups: The Case of Tencent QQ
You, Zhi-Qiang; Han, Xiao-Pu; Lü, Linyuan; Yeung, Chi Ho
2015-01-01
Background Participation in social groups are important but the collective behaviors of human as a group are difficult to analyze due to the difficulties to quantify ordinary social relation, group membership, and to collect a comprehensive dataset. Such difficulties can be circumvented by analyzing online social networks. Methodology/Principal Findings In this paper, we analyze a comprehensive dataset released from Tencent QQ, an instant messenger with the highest market share in China. Specifically, we analyze three derivative networks involving groups and their members—the hypergraph of groups, the network of groups and the user network—to reveal social interactions at microscopic and mesoscopic level. Conclusions/Significance Our results uncover interesting behaviors on the growth of user groups, the interactions between groups, and their relationship with member age and gender. These findings lead to insights which are difficult to obtain in social networks based on personal contacts. PMID:26176850
Empirical Studies on the Network of Social Groups: The Case of Tencent QQ.
You, Zhi-Qiang; Han, Xiao-Pu; Lü, Linyuan; Yeung, Chi Ho
2015-01-01
Participation in social groups are important but the collective behaviors of human as a group are difficult to analyze due to the difficulties to quantify ordinary social relation, group membership, and to collect a comprehensive dataset. Such difficulties can be circumvented by analyzing online social networks. In this paper, we analyze a comprehensive dataset released from Tencent QQ, an instant messenger with the highest market share in China. Specifically, we analyze three derivative networks involving groups and their members-the hypergraph of groups, the network of groups and the user network-to reveal social interactions at microscopic and mesoscopic level. Our results uncover interesting behaviors on the growth of user groups, the interactions between groups, and their relationship with member age and gender. These findings lead to insights which are difficult to obtain in social networks based on personal contacts.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shi, CY; Yang, H; Wei, CL
Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from poly (A){sup +} RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled intomore » 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis.« less
2011-01-01
Background Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Results Using high-throughput Illumina RNA-seq, the transcriptome from poly (A)+ RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled into 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). Conclusions An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis. PMID:21356090
BiPACE 2D--graph-based multiple alignment for comprehensive 2D gas chromatography-mass spectrometry.
Hoffmann, Nils; Wilhelm, Mathias; Doebbe, Anja; Niehaus, Karsten; Stoye, Jens
2014-04-01
Comprehensive 2D gas chromatography-mass spectrometry is an established method for the analysis of complex mixtures in analytical chemistry and metabolomics. It produces large amounts of data that require semiautomatic, but preferably automatic handling. This involves the location of significant signals (peaks) and their matching and alignment across different measurements. To date, there exist only a few openly available algorithms for the retention time alignment of peaks originating from such experiments that scale well with increasing sample and peak numbers, while providing reliable alignment results. We describe BiPACE 2D, an automated algorithm for retention time alignment of peaks from 2D gas chromatography-mass spectrometry experiments and evaluate it on three previously published datasets against the mSPA, SWPA and Guineu algorithms. We also provide a fourth dataset from an experiment studying the H2 production of two different strains of Chlamydomonas reinhardtii that is available from the MetaboLights database together with the experimental protocol, peak-detection results and manually curated multiple peak alignment for future comparability with newly developed algorithms. BiPACE 2D is contained in the freely available Maltcms framework, version 1.3, hosted at http://maltcms.sf.net, under the terms of the L-GPL v3 or Eclipse Open Source licenses. The software used for the evaluation along with the underlying datasets is available at the same location. The C.reinhardtii dataset is freely available at http://www.ebi.ac.uk/metabolights/MTBLS37.
Highly scalable and robust rule learner: performance evaluation and comparison.
Kurgan, Lukasz A; Cios, Krzysztof J; Dick, Scott
2006-02-01
Business intelligence and bioinformatics applications increasingly require the mining of datasets consisting of millions of data points, or crafting real-time enterprise-level decision support systems for large corporations and drug companies. In all cases, there needs to be an underlying data mining system, and this mining system must be highly scalable. To this end, we describe a new rule learner called DataSqueezer. The learner belongs to the family of inductive supervised rule extraction algorithms. DataSqueezer is a simple, greedy, rule builder that generates a set of production rules from labeled input data. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact, comprehensible, and have accuracy comparable to rules generated by other state-of-the-art rule extraction algorithms. The main advantages of DataSqueezer are very high efficiency, and missing data resistance. DataSqueezer exhibits log-linear asymptotic complexity with the number of training examples, and it is faster than other state-of-the-art rule learners. The learner is also robust to large quantities of missing data, as verified by extensive experimental comparison with the other learners. DataSqueezer is thus well suited to modern data mining and business intelligence tasks, which commonly involve huge datasets with a large fraction of missing data.
ASSISTments Dataset from Multiple Randomized Controlled Experiments
ERIC Educational Resources Information Center
Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil
2016-01-01
In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…
Arend, Daniel; Lange, Matthias; Pape, Jean-Michel; Weigelt-Fischer, Kathleen; Arana-Ceballos, Fernando; Mücke, Ingo; Klukas, Christian; Altmann, Thomas; Scholz, Uwe; Junker, Astrid
2016-01-01
With the implementation of novel automated, high throughput methods and facilities in the last years, plant phenomics has developed into a highly interdisciplinary research domain integrating biology, engineering and bioinformatics. Here we present a dataset of a non-invasive high throughput plant phenotyping experiment, which uses image- and image analysis- based approaches to monitor the growth and development of 484 Arabidopsis thaliana plants (thale cress). The result is a comprehensive dataset of images and extracted phenotypical features. Such datasets require detailed documentation, standardized description of experimental metadata as well as sustainable data storage and publication in order to ensure the reproducibility of experiments, data reuse and comparability among the scientific community. Therefore the here presented dataset has been annotated using the standardized ISA-Tab format and considering the recently published recommendations for the semantical description of plant phenotyping experiments. PMID:27529152
National Hydropower Plant Dataset, Version 2 (FY18Q3)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Samu, Nicole; Kao, Shih-Chieh; O'Connor, Patrick
The National Hydropower Plant Dataset, Version 2 (FY18Q3) is a geospatially comprehensive point-level dataset containing locations and key characteristics of U.S. hydropower plants that are currently either in the hydropower development pipeline (pre-operational), operational, withdrawn, or retired. These data are provided in GIS and tabular formats with corresponding metadata for each. In addition, we include access to download 2 versions of the National Hydropower Map, which was produced with these data (i.e. Map 1 displays the geospatial distribution and characteristics of all operational hydropower plants; Map 2 displays the geospatial distribution and characteristics of operational hydropower plants with pumped storagemore » and mixed capabilities only). This dataset is a subset of ORNL's Existing Hydropower Assets data series, updated quarterly as part of ORNL's National Hydropower Asset Assessment Program.« less
Arend, Daniel; Lange, Matthias; Pape, Jean-Michel; Weigelt-Fischer, Kathleen; Arana-Ceballos, Fernando; Mücke, Ingo; Klukas, Christian; Altmann, Thomas; Scholz, Uwe; Junker, Astrid
2016-08-16
With the implementation of novel automated, high throughput methods and facilities in the last years, plant phenomics has developed into a highly interdisciplinary research domain integrating biology, engineering and bioinformatics. Here we present a dataset of a non-invasive high throughput plant phenotyping experiment, which uses image- and image analysis- based approaches to monitor the growth and development of 484 Arabidopsis thaliana plants (thale cress). The result is a comprehensive dataset of images and extracted phenotypical features. Such datasets require detailed documentation, standardized description of experimental metadata as well as sustainable data storage and publication in order to ensure the reproducibility of experiments, data reuse and comparability among the scientific community. Therefore the here presented dataset has been annotated using the standardized ISA-Tab format and considering the recently published recommendations for the semantical description of plant phenotyping experiments.
Evaluation and inter-comparison of modern day reanalysis datasets over Africa and the Middle East
NASA Astrophysics Data System (ADS)
Shukla, S.; Arsenault, K. R.; Hobbins, M.; Peters-Lidard, C. D.; Verdin, J. P.
2015-12-01
Reanalysis datasets are potentially very valuable for otherwise data-sparse regions such as Africa and the Middle East. They are potentially useful for long-term climate and hydrologic analyses and, given their availability in real-time, they are particularity attractive for real-time hydrologic monitoring purposes (e.g. to monitor flood and drought events). Generally in data-sparse regions, reanalysis variables such as precipitation, temperature, radiation and humidity are used in conjunction with in-situ and/or satellite-based datasets to generate long-term gridded atmospheric forcing datasets. These atmospheric forcing datasets are used to drive offline land surface models and simulate soil moisture and runoff, which are natural indicators of hydrologic conditions. Therefore, any uncertainty or bias in the reanalysis datasets contributes to uncertainties in hydrologic monitoring estimates. In this presentation, we report on a comprehensive analysis that evaluates several modern-day reanalysis products (such as NASA's MERRA-1 and -2, ECMWF's ERA-Interim and NCEP's CFS Reanalysis) over Africa and the Middle East region. We compare the precipitation and temperature from the reanalysis products with other independent gridded datasets such as GPCC, CRU, and USGS/UCSB's CHIRPS precipitation datasets, and CRU's temperature datasets. The evaluations are conducted at a monthly time scale, since some of these independent datasets are only available at this temporal resolution. The evaluations range from the comparison of the monthly mean climatology to inter-annual variability and long-term changes. Finally, we also present the results of inter-comparisons of radiation and humidity variables from the different reanalysis datasets.
An iterative approach to optimize change classification in SAR time series data
NASA Astrophysics Data System (ADS)
Boldt, Markus; Thiele, Antje; Schulz, Karsten; Hinz, Stefan
2016-10-01
The detection of changes using remote sensing imagery has become a broad field of research with many approaches for many different applications. Besides the simple detection of changes between at least two images acquired at different times, analyses which aim on the change type or category are at least equally important. In this study, an approach for a semi-automatic classification of change segments is presented. A sparse dataset is considered to ensure the fast and simple applicability for practical issues. The dataset is given by 15 high resolution (HR) TerraSAR-X (TSX) amplitude images acquired over a time period of one year (11/2013 to 11/2014). The scenery contains the airport of Stuttgart (GER) and its surroundings, including urban, rural, and suburban areas. Time series imagery offers the advantage of analyzing the change frequency of selected areas. In this study, the focus is set on the analysis of small-sized high frequently changing regions like parking areas, construction sites and collecting points consisting of high activity (HA) change objects. For each HA change object, suitable features are extracted and a k-means clustering is applied as the categorization step. Resulting clusters are finally compared to a previously introduced knowledge-based class catalogue, which is modified until an optimal class description results. In other words, the subjective understanding of the scenery semantics is optimized by the data given reality. Doing so, an even sparsely dataset containing only amplitude imagery can be evaluated without requiring comprehensive training datasets. Falsely defined classes might be rejected. Furthermore, classes which were defined too coarsely might be divided into sub-classes. Consequently, classes which were initially defined too narrowly might be merged. An optimal classification results when the combination of previously defined key indicators (e.g., number of clusters per class) reaches an optimum.
Multimodal Event Detection in Twitter Hashtag Networks
Yilmaz, Yasin; Hero, Alfred O.
2016-07-01
In this study, event detection in a multimodal Twitter dataset is considered. We treat the hashtags in the dataset as instances with two modes: text and geolocation features. The text feature consists of a bag-of-words representation. The geolocation feature consists of geotags (i.e., geographical coordinates) of the tweets. Fusing the multimodal data we aim to detect, in terms of topic and geolocation, the interesting events and the associated hashtags. To this end, a generative latent variable model is assumed, and a generalized expectation-maximization (EM) algorithm is derived to learn the model parameters. The proposed method is computationally efficient, and lendsmore » itself to big datasets. Lastly, experimental results on a Twitter dataset from August 2014 show the efficacy of the proposed method.« less
Enrichment of Data Publications in Earth Sciences - Data Reports as a Missing Link
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Bertelmann, Roland; Haberland, Christian; Evans, Peter L.
2015-04-01
During the past decade, the relevance of research data stewardship has been rising significantly. Preservation and publication of scientific data for long-term use, including the storage in adequate repositories has been identified as a key issue by the scientific community as well as by bodies like research agencies. Essential for any kind of re-use is a proper description of the datasets. As a result of the increasing interest, data repositories have been developed and the included research data is accompanied with at least a minimum set of metadata. This metadata is useful for data discovery and a first insight to the content of a dataset. But often data re-use needs more and extended information. Many datasets are accompanied by a small 'readme file' with basic information on the data structure, or other accompanying documents. A source of additional information could be an article published in one of the newly emerging data journals (e.g. Copernicus's ESSD Earth System Science Data or Nature's Scientific Data). Obviously there is an information gap between a 'readme file', that is only accessible after data download (which often leads to less usage of published datasets than if the information was available beforehand) and the much larger effort to prepare an article for a peer-reviewed data journal. For many years, GFZ German Research Centre for Geosciences publishes 'Scientific Technical Reports (STR)' as a report series which is electronically persistently available and citable with assigned DOIs. This series was opened for the description of parallel published datasets as 'STR Data'. These are internally reviewed and offer a flexible publication format describing published data in depth, suitable for different datasets ranging from long-term monitoring time series of observatories to field data, to (meta-)databases, and software publications. STR Data offer a full and consistent overview and description to all relevant parameters of a linked published dataset. These reports are readable and citable on their own, but are, of course, closely connected to the respective datasets. Therefore, they give full insight into the framework of the data before data download. This is especially relevant for large and often heterogeneous datasets, like e.g. controlled-source seismic data gathered with instruments of the 'Geophysical Instrument Pool Potsdam GIPP'. Here, details of the instrumentation, data organization, data format, accuracy, geographical coordinates, timing and data completeness, etc. need to be documented. STR Data are also attractive for the publication of historic datasets, e.g. 30-40 years old seismic experiments. It is also possible for one STR Data to describe several datasets, e.g. from multiple diverse instruments types, or distinct regions of interest. The publication of DOI-assigned data reports is a helpful tool to fill the gap between basic metadata and restricted 'readme' information on the one hand and preparing extended journal articles on the other hand. They open the way for informed re-use and, with their comprehensive data description, may act as 'appetizer' for the re-use of published datasets.
Validation project. This report describes the procedure used to generate the noise models output dataset , and then it compares that dataset to the...benchmark, the Engineer Research and Development Centers Long-Range Sound Propagation dataset . It was found that the models consistently underpredict the
Graph theoretic analysis of protein interaction networks of eukaryotes
NASA Astrophysics Data System (ADS)
Goh, K.-I.; Kahng, B.; Kim, D.
2005-11-01
Owing to the recent progress in high-throughput experimental techniques, the datasets of large-scale protein interactions of prototypical multicellular species, the nematode worm Caenorhabditis elegans and the fruit fly Drosophila melanogaster, have been assayed. The datasets are obtained mainly by using the yeast hybrid method, which contains false-positive and false-negative simultaneously. Accordingly, while it is desirable to test such datasets through further wet experiments, here we invoke recent developed network theory to test such high-throughput datasets in a simple way. Based on the fact that the key biological processes indispensable to maintaining life are conserved across eukaryotic species, and the comparison of structural properties of the protein interaction networks (PINs) of the two species with those of the yeast PIN, we find that while the worm and yeast PIN datasets exhibit similar structural properties, the current fly dataset, though most comprehensively screened ever, does not reflect generic structural properties correctly as it is. The modularity is suppressed and the connectivity correlation is lacking. Addition of interologs to the current fly dataset increases the modularity and enhances the occurrence of triangular motifs as well. The connectivity correlation function of the fly, however, remains distinct under such interolog additions, for which we present a possible scenario through an in silico modeling.
Sinfonevada: Dataset of Floristic diversity in Sierra Nevada forests (SE Spain)
Pérez-Luque, Antonio Jesús; Bonet, Francisco Javier; Pérez-Pérez, Ramón; Rut Aspizua; Lorite, Juan; Zamora, Regino
2014-01-01
Abstract The Sinfonevada database is a forest inventory that contains information on the forest ecosystem in the Sierra Nevada mountains (SE Spain). The Sinfonevada dataset contains more than 7,500 occurrence records belonging to 270 taxa (24 of these threatened) from floristic inventories of the Sinfonevada Forest inventory. Expert field workers collected the information. The whole dataset underwent a quality control by botanists with broad expertise in Sierra Nevada flora. This floristic inventory was created to gather useful information for the proper management of Pinus plantations in Sierra Nevada. This is the only dataset that shows a comprehensive view of the forest flora in Sierra Nevada. This is the reason why it is being used to assess the biodiversity in the very dense pine plantations on this massif. With this dataset, managers have improved their ability to decide where to apply forest treatments in order to avoid biodiversity loss. The dataset forms part of the Sierra Nevada Global Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:24843285
NASA Technical Reports Server (NTRS)
Armstrong, Edward; Tauer, Eric
2013-01-01
The presentation focused on describing a new dataset lifecycle policy that the NASA Physical Oceanography DAAC (PO.DAAC) has implemented for its new and current datasets to foster improved stewardship and consistency across its archive. The overarching goal is to implement this dataset lifecycle policy for all new GHRSST GDS2 datasets and bridge the mission statements from the GHRSST Project Office and PO.DAAC to provide the best quality SST data in a cost-effective, efficient manner, preserving its integrity so that it will be available and usable to a wide audience.
Validation of MIMGO: a method to identify differentially expressed GO terms in a microarray dataset
2012-01-01
Background We previously proposed an algorithm for the identification of GO terms that commonly annotate genes whose expression is upregulated or downregulated in some microarray data compared with in other microarray data. We call these “differentially expressed GO terms” and have named the algorithm “matrix-assisted identification method of differentially expressed GO terms” (MIMGO). MIMGO can also identify microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. However, MIMGO has not yet been validated on a real microarray dataset using all available GO terms. Findings We combined Gene Set Enrichment Analysis (GSEA) with MIMGO to identify differentially expressed GO terms in a yeast cell cycle microarray dataset. GSEA followed by MIMGO (GSEA + MIMGO) correctly identified (p < 0.05) microarray data in which genes annotated to differentially expressed GO terms are upregulated. We found that GSEA + MIMGO was slightly less effective than, or comparable to, GSEA (Pearson), a method that uses Pearson’s correlation as a metric, at detecting true differentially expressed GO terms. However, unlike other methods including GSEA (Pearson), GSEA + MIMGO can comprehensively identify the microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. Conclusions MIMGO is a reliable method to identify differentially expressed GO terms comprehensively. PMID:23232071
The global compendium of Aedes aegypti and Ae. albopictus occurrence
NASA Astrophysics Data System (ADS)
Kraemer, Moritz U. G.; Sinka, Marianne E.; Duda, Kirsten A.; Mylne, Adrian; Shearer, Freya M.; Brady, Oliver J.; Messina, Jane P.; Barker, Christopher M.; Moore, Chester G.; Carvalho, Roberta G.; Coelho, Giovanini E.; van Bortel, Wim; Hendrickx, Guy; Schaffner, Francis; Wint, G. R. William; Elyazar, Iqbal R. F.; Teng, Hwa-Jen; Hay, Simon I.
2015-07-01
Aedes aegypti and Ae. albopictus are the main vectors transmitting dengue and chikungunya viruses. Despite being pathogens of global public health importance, knowledge of their vectors’ global distribution remains patchy and sparse. A global geographic database of known occurrences of Ae. aegypti and Ae. albopictus between 1960 and 2014 was compiled. Herein we present the database, which comprises occurrence data linked to point or polygon locations, derived from peer-reviewed literature and unpublished studies including national entomological surveys and expert networks. We describe all data collection processes, as well as geo-positioning methods, database management and quality-control procedures. This is the first comprehensive global database of Ae. aegypti and Ae. albopictus occurrence, consisting of 19,930 and 22,137 geo-positioned occurrence records respectively. Both datasets can be used for a variety of mapping and spatial analyses of the vectors and, by inference, the diseases they transmit.
The global compendium of Aedes aegypti and Ae. albopictus occurrence
Kraemer, Moritz U. G.; Sinka, Marianne E.; Duda, Kirsten A.; Mylne, Adrian; Shearer, Freya M.; Brady, Oliver J.; Messina, Jane P.; Barker, Christopher M.; Moore, Chester G.; Carvalho, Roberta G.; Coelho, Giovanini E.; Van Bortel, Wim; Hendrickx, Guy; Schaffner, Francis; Wint, G. R. William; Elyazar, Iqbal R. F.; Teng, Hwa-Jen; Hay, Simon I.
2015-01-01
Aedes aegypti and Ae. albopictus are the main vectors transmitting dengue and chikungunya viruses. Despite being pathogens of global public health importance, knowledge of their vectors’ global distribution remains patchy and sparse. A global geographic database of known occurrences of Ae. aegypti and Ae. albopictus between 1960 and 2014 was compiled. Herein we present the database, which comprises occurrence data linked to point or polygon locations, derived from peer-reviewed literature and unpublished studies including national entomological surveys and expert networks. We describe all data collection processes, as well as geo-positioning methods, database management and quality-control procedures. This is the first comprehensive global database of Ae. aegypti and Ae. albopictus occurrence, consisting of 19,930 and 22,137 geo-positioned occurrence records respectively. Both datasets can be used for a variety of mapping and spatial analyses of the vectors and, by inference, the diseases they transmit. PMID:26175912
Modeling Urban Scenarios & Experiments: Fort Indiantown Gap Data Collections Summary and Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Archer, Daniel E.; Bandstra, Mark S.; Davidson, Gregory G.
This report summarizes experimental radiation detector, contextual sensor, weather, and global positioning system (GPS) data collected to inform and validate a comprehensive, operational radiation transport modeling framework to evaluate radiation detector system and algorithm performance. This framework will be used to study the influence of systematic effects (such as geometry, background activity, background variability, environmental shielding, etc.) on detector responses and algorithm performance using synthetic time series data. This work consists of performing data collection campaigns at a canonical, controlled environment for complete radiological characterization to help construct and benchmark a high-fidelity model with quantified system geometries, detector response functions,more » and source terms for background and threat objects. This data also provides an archival, benchmark dataset that can be used by the radiation detection community. The data reported here spans four data collection campaigns conducted between May 2015 and September 2016.« less
NASA Technical Reports Server (NTRS)
Claverie, Martin; Matthews, Jessica L.; Vermote, Eric F.; Justice, Christopher O.
2016-01-01
In- land surface models, which are used to evaluate the role of vegetation in the context ofglobal climate change and variability, LAI and FAPAR play a key role, specifically with respect to thecarbon and water cycles. The AVHRR-based LAIFAPAR dataset offers daily temporal resolution,an improvement over previous products. This climate data record is based on a carefully calibratedand corrected land surface reflectance dataset to provide a high-quality, consistent time-series suitablefor climate studies. It spans from mid-1981 to the present. Further, this operational dataset is availablein near real-time allowing use for monitoring purposes. The algorithm relies on artificial neuralnetworks calibrated using the MODIS LAI/FAPAR dataset. Evaluation based on cross-comparisonwith MODIS products and in situ data show the dataset is consistent and reliable with overalluncertainties of 1.03 and 0.15 for LAI and FAPAR, respectively. However, a clear saturation effect isobserved in the broadleaf forest biomes with high LAI (greater than 4.5) and FAPAR (greater than 0.8) values.
Federal standards and procedures for the National Watershed Boundary Dataset (WBD)
,; ,; ,
2013-01-01
The Watershed Boundary Dataset (WBD) is a comprehensive aggregated collection of hydrologic unit data consistent with the national criteria for delineation and resolution. This document establishes Federal standards and procedures for creating the WBD as seamless and hierarchical hydrologic unit data, based on topographic and hydrologic features at a 1:24,000 scale in the United States, except for Alaska at 1:63,360 scale, and 1:25,000 scale in the Caribbean. The data within the WBD have been reviewed for certification through the 12-digit hydrologic unit for compliance with the criteria outlined in this document. Any edits to certified data will be reviewed against this standard prior to inclusion. Although not required as part of the framework WBD, the guidelines contain details for compiling and delineating the boundaries of two additional levels, the 14- and 16-digit hydrologic units, as well as the use of higher resolution base information to improve delineations. The guidelines presented herein are designed to enable local, regional, and national partners to delineate hydrologic units consistently and accurately. Such consistency improves watershed management through efficient sharing of information and resources and by ensuring that digital geographic data are usable with other related Geographic Information System (GIS) data.Terminology, definitions, and procedural information are provided to ensure uniformity in hydrologic unit boundaries, names, and numerical codes. Detailed standards and specifications for data are included. The document also includes discussion of objectives, communications required for revising the data resolution in the United States and the Caribbean, as well as final review and data-quality criteria. Instances of unusual landforms or artificial features that affect the hydrologic units are described with metadata standards. Up-to-date information and availability of the hydrologic units are listed at http:// www.nrcs.usda.gov/wps/portal/nrcs/detail/national/technical/ nra/dma/?&cid=nrcs143_021630/.
Federal standards and procedures for the National Watershed Boundary Dataset (WBD)
U.S. Geological Survey and U.S. Department of Agriculture, Natural Resources Conservation Service
2012-01-01
The Watershed Boundary Dataset (WBD) is a comprehensive aggregated collection of hydrologic unit data consistent with the national criteria for delineation and resolution. This document establishes Federal standards and procedures for creating the WBD as seamless and hierarchical hydrologic unit data, based on topographic and hydrologic features at a 1:24,000 scale in the United States, except for Alaska at 1:63,360 scale, and 1:25,000 scale in the Caribbean. The data within the WBD have been reviewed for certification through the 12-digit hydrologic unit for compliance with the criteria outlined in this document. Any edits to certified data will be reviewed against this standard prior to inclusion. Although not required as part of the framework WBD, the guidelines contain details for compiling and delineating the boundaries of two additional levels, the 14- and 16-digit hydrologic units, as well as the use of higher resolution base information to improve delineations. The guidelines presented herein are designed to enable local, regional, and national partners to delineate hydrologic units consistently and accurately. Such consistency improves watershed management through efficient sharing of information and resources and by ensuring that digital geographic data are usable with other related Geographic Information System (GIS) data. Terminology, definitions, and procedural information are provided to ensure uniformity in hydrologic unit boundaries, names, and numerical codes. Detailed standards and specifications for data are included. The document also includes discussion of objectives, communications required for revising the data resolution in the United States and the Caribbean, as well as final review and data-quality criteria. Instances of unusual landforms or artificial features that affect the hydrologic units are described with metadata standards. Up-to-date information and availability of the hydrologic units are listed at http://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/water/watersheds/?cid=nrcs143_021630/.
Analysis of the IJCNN 2011 UTL Challenge
2012-01-13
large datasets from various application domains: handwriting recognition, image recognition, video processing, text processing, and ecology. The goal...http //clopinet.com/ul). We made available large datasets from various application domains handwriting recognition, image recognition, video...evaluation sets consist of 4096 examples each. Dataset Domain Features Sparsity Devel. Transf. AVICENNA Handwriting 120 0% 150205 50000 HARRY Video 5000 98.1
ERIC Educational Resources Information Center
Haelermans, Carla; Ghysels, Joris; Prince, Fernao
2015-01-01
This paper describes a dataset with data from three individually randomized educational technology experiments on differentiation, formative testing and feedback during one school year for a group of 8th grade students in the Netherlands, using administrative data and the online motivation questionnaire of Boekaerts. The dataset consists of pre-…
USDA-ARS?s Scientific Manuscript database
Arylamine N-acetyltransferases (NATs) are xenobiotic metabolizing enzymes characterized in several bacteria and eukaryotic organisms. We report a comprehensive phylogenetic analysis employing an exhaustive dataset of NAT-homologous sequences recovered through inspection of 2445 genomes. We describe ...
An overview of results from the GEWEX radiation flux assessment
NASA Astrophysics Data System (ADS)
Raschke, E.; Stackhouse, P.; Kinne, S.; Contributors from Europe; the USA
2013-05-01
Multi-annual radiative flux averages of the International Cloud Climatology Project (ISCCP), of the GEWEX - Surface Radiation Budget Project (SRB) and of the Clouds and Earth Radiative Energy System (CERES) are compared and analyzed to characterize the Earth's radiative budget, assess differences and identify possible causes. These satellite based data-sets are also compared to results of a median model, which represents 20 climate models, that participated in the 4th IPCC assessment. Consistent distribution patterns and seasonal variations among the satellite data-sets demonstrate their scientific value, which would further increase if the datasets would be reanalyzed with more accurate and consistent ancillary data.
Dataset of anomalies and malicious acts in a cyber-physical subsystem.
Laso, Pedro Merino; Brosset, David; Puentes, John
2017-10-01
This article presents a dataset produced to investigate how data and information quality estimations enable to detect aNomalies and malicious acts in cyber-physical systems. Data were acquired making use of a cyber-physical subsystem consisting of liquid containers for fuel or water, along with its automated control and data acquisition infrastructure. Described data consist of temporal series representing five operational scenarios - Normal, aNomalies, breakdown, sabotages, and cyber-attacks - corresponding to 15 different real situations. The dataset is publicly available in the .zip file published with the article, to investigate and compare faulty operation detection and characterization methods for cyber-physical systems.
Zheng, Yalin; Kwong, Man Ting; MacCormick, Ian J. C.; Beare, Nicholas A. V.; Harding, Simon P.
2014-01-01
Capillary non-perfusion (CNP) in the retina is a characteristic feature used in the management of a wide range of retinal diseases. There is no well-established computation tool for assessing the extent of CNP. We propose a novel texture segmentation framework to address this problem. This framework comprises three major steps: pre-processing, unsupervised total variation texture segmentation, and supervised segmentation. It employs a state-of-the-art multiphase total variation texture segmentation model which is enhanced by new kernel based region terms. The model can be applied to texture and intensity-based multiphase problems. A supervised segmentation step allows the framework to take expert knowledge into account, an AdaBoost classifier with weighted cost coefficient is chosen to tackle imbalanced data classification problems. To demonstrate its effectiveness, we applied this framework to 48 images from malarial retinopathy and 10 images from ischemic diabetic maculopathy. The performance of segmentation is satisfactory when compared to a reference standard of manual delineations: accuracy, sensitivity and specificity are 89.0%, 73.0%, and 90.8% respectively for the malarial retinopathy dataset and 80.8%, 70.6%, and 82.1% respectively for the diabetic maculopathy dataset. In terms of region-wise analysis, this method achieved an accuracy of 76.3% (45 out of 59 regions) for the malarial retinopathy dataset and 73.9% (17 out of 26 regions) for the diabetic maculopathy dataset. This comprehensive segmentation framework can quantify capillary non-perfusion in retinopathy from two distinct etiologies, and has the potential to be adopted for wider applications. PMID:24747681
funRiceGenes dataset for comprehensive understanding and application of rice functional genes.
Yao, Wen; Li, Guangwei; Yu, Yiming; Ouyang, Yidan
2018-01-01
As a main staple food, rice is also a model plant for functional genomic studies of monocots. Decoding of every DNA element of the rice genome is essential for genetic improvement to address increasing food demands. The past 15 years have witnessed extraordinary advances in rice functional genomics. Systematic characterization and proper deposition of every rice gene are vital for both functional studies and crop genetic improvement. We built a comprehensive and accurate dataset of ∼2800 functionally characterized rice genes and ∼5000 members of different gene families by integrating data from available databases and reviewing every publication on rice functional genomic studies. The dataset accounts for 19.2% of the 39 045 annotated protein-coding rice genes, which provides the most exhaustive archive for investigating the functions of rice genes. We also constructed 214 gene interaction networks based on 1841 connections between 1310 genes. The largest network with 762 genes indicated that pleiotropic genes linked different biological pathways. Increasing degree of conservation of the flowering pathway was observed among more closely related plants, implying substantial value of rice genes for future dissection of flowering regulation in other crops. All data are deposited in the funRiceGenes database (https://funricegenes.github.io/). Functionality for advanced search and continuous updating of the database are provided by a Shiny application (http://funricegenes.ncpgr.cn/). The funRiceGenes dataset would enable further exploring of the crosslink between gene functions and natural variations in rice, which can also facilitate breeding design to improve target agronomic traits of rice. © The Authors 2017. Published by Oxford University Press.
Validating Variational Bayes Linear Regression Method With Multi-Central Datasets.
Murata, Hiroshi; Zangwill, Linda M; Fujino, Yuri; Matsuura, Masato; Miki, Atsuya; Hirasawa, Kazunori; Tanito, Masaki; Mizoue, Shiro; Mori, Kazuhiko; Suzuki, Katsuyoshi; Yamashita, Takehiro; Kashiwagi, Kenji; Shoji, Nobuyuki; Asaoka, Ryo
2018-04-01
To validate the prediction accuracy of variational Bayes linear regression (VBLR) with two datasets external to the training dataset. The training dataset consisted of 7268 eyes of 4278 subjects from the University of Tokyo Hospital. The Japanese Archive of Multicentral Databases in Glaucoma (JAMDIG) dataset consisted of 271 eyes of 177 patients, and the Diagnostic Innovations in Glaucoma Study (DIGS) dataset includes 248 eyes of 173 patients, which were used for validation. Prediction accuracy was compared between the VBLR and ordinary least squared linear regression (OLSLR). First, OLSLR and VBLR were carried out using total deviation (TD) values at each of the 52 test points from the second to fourth visual fields (VFs) (VF2-4) to 2nd to 10th VF (VF2-10) of each patient in JAMDIG and DIGS datasets, and the TD values of the 11th VF test were predicted every time. The predictive accuracy of each method was compared through the root mean squared error (RMSE) statistic. OLSLR RMSEs with the JAMDIG and DIGS datasets were between 31 and 4.3 dB, and between 19.5 and 3.9 dB. On the other hand, VBLR RMSEs with JAMDIG and DIGS datasets were between 5.0 and 3.7, and between 4.6 and 3.6 dB. There was statistically significant difference between VBLR and OLSLR for both datasets at every series (VF2-4 to VF2-10) (P < 0.01 for all tests). However, there was no statistically significant difference in VBLR RMSEs between JAMDIG and DIGS datasets at any series of VFs (VF2-2 to VF2-10) (P > 0.05). VBLR outperformed OLSLR to predict future VF progression, and the VBLR has a potential to be a helpful tool at clinical settings.
Griscom, Bronson W; Ellis, Peter W; Baccini, Alessandro; Marthinus, Delon; Evans, Jeffrey S; Ruslandi
2016-01-01
Forest conservation efforts are increasingly being implemented at the scale of sub-national jurisdictions in order to mitigate global climate change and provide other ecosystem services. We see an urgent need for robust estimates of historic forest carbon emissions at this scale, as the basis for credible measures of climate and other benefits achieved. Despite the arrival of a new generation of global datasets on forest area change and biomass, confusion remains about how to produce credible jurisdictional estimates of forest emissions. We demonstrate a method for estimating the relevant historic forest carbon fluxes within the Regency of Berau in eastern Borneo, Indonesia. Our method integrates best available global and local datasets, and includes a comprehensive analysis of uncertainty at the regency scale. We find that Berau generated 8.91 ± 1.99 million tonnes of net CO2 emissions per year during 2000-2010. Berau is an early frontier landscape where gross emissions are 12 times higher than gross sequestration. Yet most (85%) of Berau's original forests are still standing. The majority of net emissions were due to conversion of native forests to unspecified agriculture (43% of total), oil palm (28%), and fiber plantations (9%). Most of the remainder was due to legal commercial selective logging (17%). Our overall uncertainty estimate offers an independent basis for assessing three other estimates for Berau. Two other estimates were above the upper end of our uncertainty range. We emphasize the importance of including an uncertainty range for all parameters of the emissions equation to generate a comprehensive uncertainty estimate-which has not been done before. We believe comprehensive estimates of carbon flux uncertainty are increasingly important as national and international institutions are challenged with comparing alternative estimates and identifying a credible range of historic emissions values.
Development and Applications of a Comprehensive Land Use Classification and Map for the US
Theobald, David M.
2014-01-01
Land cover maps reasonably depict areas that are strongly converted by human activities, but typically are unable to resolve low-density but widespread development patterns. Data products specifically designed to resolve land uses complement land cover datasets and likely improve our ability to understand the extent and complexity of human modification. Methods for developing a comprehensive land use classification system are described, and a map of land use for the conterminous United States is presented to reveal what we are doing on the land. The comprehensive, detailed and high-resolution dataset was developed through spatial analysis of nearly two-dozen publicly-available, national spatial datasets – predominately based on census housing, employment, and infrastructure, as well as land cover from satellite imagery. This effort resulted in 79 land use classes that fit within five main land use groups: built-up, production, recreation, conservation, and water. Key findings from this study are that built-up areas occupy 13.6% of mainland US, but that the majority of this occurs as low-density exurban/rural residential (9.1% of the US), while more intensive built-up land uses occupy 4.5%. For every acre of urban and suburban residential land, there are 0.13 commercial, 0.07 industrial, 0.48 institutional, and 0.29 acres of interstates/highways. This database can be used to address a variety of natural resource applications, and I provide three examples here: an entropy index of the diversity of land uses for smart-growth planning, a power-law scaling of metropolitan area population to developed footprint, and identifying potential conflict areas by delineating the urban interface. PMID:24728210
NASA Astrophysics Data System (ADS)
Blyverket, J.; Hamer, P.; Bertino, L.; Lahoz, W. A.
2017-12-01
The European Space Agency Climate Change Initiative for soil moisture (ESA CCI SM) was initiated in 2012 for a period of six years, the objective for this period was to produce the most complete and consistent global soil moisture data record based on both active and passive sensors. The ESA CCI SM products consist of three surface soil moisture datasets: The ACTIVE product and the PASSIVE product were created by fusing scatterometer and radiometer soil moisture data, respectively. The COMBINED product is a blended product based on the former two datasets. In this study we assimilate globally both the ACTIVE and PASSIVE product at a 25 km spatial resolution. The different satellite platforms have different overpass times, an observation is mapped to the hours 00.00, 06.00, 12.00 or 18.00 if it falls within a 3 hour window centred at these times. We use the SURFEX land surface model with the ISBA diffusion scheme for the soil hydrology. For the assimilation routine we apply the Ensemble Transform Kalman Filter (ETKF). The land surface model is driven by perturbed MERRA-2 atmospheric forcing data, which has a temporal resolution of one hour and is mapped to the SURFEX model grid. Bias between the land surface model and the ESA CCI product is removed by cumulative distribution function (CDF) matching. This work is a step towards creating a global root zone soil moisture product from the most comprehensive satellite surface soil moisture product available. As a first step we consider the period from 2010 - 2016. This allows for comparison against other global root zone soil moisture products (SMAP Level 4, which is independent of the ESA CCI SM product).
NASA Astrophysics Data System (ADS)
Belabbassi, L.; Garzio, L. M.; Smith, M. J.; Knuth, F.; Vardaro, M.; Kerfoot, J.
2016-02-01
The Ocean Observatories Initiative (OOI), funded by the National Science Foundation, provides users with access to long-term datasets from a variety of deployed oceanographic sensors. The Pioneer Array in the Atlantic Ocean off the Coast of New England hosts 10 moorings and 6 gliders. Each mooring is outfitted with 6 to 19 different instruments telemetering more than 1000 data streams. These data are available to science users to collaborate on common scientific goals such as water quality monitoring and scale variability measures of continental shelf processes and coastal open ocean exchanges. To serve this purpose, the acquired datasets undergo an iterative multi-step quality assurance and quality control procedure automated to work with all types of data. Data processing involves several stages, including a fundamental pre-processing step when the data are prepared for processing. This takes a considerable amount of processing time and is often not given enough thought in development initiatives. The volume and complexity of OOI data necessitates the development of a systematic diagnostic tool to enable the management of a comprehensive data information system for the OOI arrays. We present two examples to demonstrate the current OOI pre-processing diagnostic tool. First, Data Filtering is used to identify incomplete, incorrect, or irrelevant parts of the data and then replaces, modifies or deletes the coarse data. This provides data consistency with similar datasets in the system. Second, Data Normalization occurs when the database is organized in fields and tables to minimize redundancy and dependency. At the end of this step, the data are stored in one place to reduce the risk of data inconsistency and promote easy and efficient mapping to the database.
NASA Astrophysics Data System (ADS)
Christensen, C.; Liu, S.; Scorzelli, G.; Lee, J. W.; Bremer, P. T.; Summa, B.; Pascucci, V.
2017-12-01
The creation, distribution, analysis, and visualization of large spatiotemporal datasets is a growing challenge for the study of climate and weather phenomena in which increasingly massive domains are utilized to resolve finer features, resulting in datasets that are simply too large to be effectively shared. Existing workflows typically consist of pipelines of independent processes that preclude many possible optimizations. As data sizes increase, these pipelines are difficult or impossible to execute interactively and instead simply run as large offline batch processes. Rather than limiting our conceptualization of such systems to pipelines (or dataflows), we propose a new model for interactive data analysis and visualization systems in which we comprehensively consider the processes involved from data inception through analysis and visualization in order to describe systems composed of these processes in a manner that facilitates interactive implementations of the entire system rather than of only a particular component. We demonstrate the application of this new model with the implementation of an interactive system that supports progressive execution of arbitrary user scripts for the analysis and visualization of massive, disparately located climate data ensembles. It is currently in operation as part of the Earth System Grid Federation server running at Lawrence Livermore National Lab, and accessible through both web-based and desktop clients. Our system facilitates interactive analysis and visualization of massive remote datasets up to petabytes in size, such as the 3.5 PB 7km NASA GEOS-5 Nature Run simulation, previously only possible offline or at reduced resolution. To support the community, we have enabled general distribution of our application using public frameworks including Docker and Anaconda.
NASA Astrophysics Data System (ADS)
Theologou, I.; Patelaki, M.; Karantzalos, K.
2015-04-01
Assessing and monitoring water quality status through timely, cost effective and accurate manner is of fundamental importance for numerous environmental management and policy making purposes. Therefore, there is a current need for validated methodologies which can effectively exploit, in an unsupervised way, the enormous amount of earth observation imaging datasets from various high-resolution satellite multispectral sensors. To this end, many research efforts are based on building concrete relationships and empirical algorithms from concurrent satellite and in-situ data collection campaigns. We have experimented with Landsat 7 and Landsat 8 multi-temporal satellite data, coupled with hyperspectral data from a field spectroradiometer and in-situ ground truth data with several physico-chemical and other key monitoring indicators. All available datasets, covering a 4 years period, in our case study Lake Karla in Greece, were processed and fused under a quantitative evaluation framework. The performed comprehensive analysis posed certain questions regarding the applicability of single empirical models across multi-temporal, multi-sensor datasets towards the accurate prediction of key water quality indicators for shallow inland systems. Single linear regression models didn't establish concrete relations across multi-temporal, multi-sensor observations. Moreover, the shallower parts of the inland system followed, in accordance with the literature, different regression patterns. Landsat 7 and 8 resulted in quite promising results indicating that from the recreation of the lake and onward consistent per-sensor, per-depth prediction models can be successfully established. The highest rates were for chl-a (r2=89.80%), dissolved oxygen (r2=88.53%), conductivity (r2=88.18%), ammonium (r2=87.2%) and pH (r2=86.35%), while the total phosphorus (r2=70.55%) and nitrates (r2=55.50%) resulted in lower correlation rates.
NASA Astrophysics Data System (ADS)
Chegwidden, O.; Nijssen, B.; Rupp, D. E.; Kao, S. C.; Clark, M. P.
2017-12-01
We describe results from a large hydrologic climate change dataset developed across the Pacific Northwestern United States and discuss how the analysis of those results can be seen as a framework for other large hydrologic ensemble investigations. This investigation will better inform future modeling efforts and large ensemble analyses across domains within and beyond the Pacific Northwest. Using outputs from the Coupled Model Intercomparison Project Phase 5 (CMIP5), we provide projections of hydrologic change for the domain through the end of the 21st century. The dataset is based upon permutations of four methodological choices: (1) ten global climate models (2) two representative concentration pathways (3) three meteorological downscaling methods and (4) four unique hydrologic model set-ups (three of which entail the same hydrologic model using independently calibrated parameter sets). All simulations were conducted across the Columbia River Basin and Pacific coastal drainages at a 1/16th ( 6 km) resolution and at a daily timestep. In total, the 172 distinct simulations offer an updated, comprehensive view of climate change projections through the end of the 21st century. The results consist of routed streamflow at 400 sites throughout the domain as well as distributed spatial fields of relevant hydrologic variables like snow water equivalent and soil moisture. In this presentation, we discuss the level of agreement with previous hydrologic projections for the study area and how these projections differ with specific methodological choices. By controlling for some methodological choices we can show how each choice affects key climatic change metrics. We discuss how the spread in results varies across hydroclimatic regimes. We will use this large dataset as a case study for distilling a wide range of hydroclimatological projections into useful climate change assessments.
Banas, Krzysztof; Banas, Agnieszka; Gajda, Mariusz; Kwiatek, Wojciech M; Pawlicki, Bohdan; Breese, Mark B H
2014-07-15
Assessment of the performance and up-to-date diagnostics of scientific equipment is one of the key components in contemporary laboratories. Most reliable checks are performed by real test experiments while varying the experimental conditions (typically, in the case of infrared spectroscopic measurements, the size of the beam aperture, the duration of the experiment, the spectral range, the scanner velocity, etc.). On the other hand, the stability of the instrument response in time is another key element of the great value. Source stability (or easy predictable temporal changes, similar to those observed in the case of synchrotron radiation-based sources working in non top-up mode), detector stability (especially in the case of liquid nitrogen- or liquid helium-cooled detectors) should be monitored. In these cases, recorded datasets (spectra) include additional variables such as time stamp when a particular spectrum was recorded (in the case of time trial experiments). A favorable approach in evaluating these data is building hyperspectral object that consist of all spectra and all additional parameters at which these spectra were recorded. Taking into account that these datasets could be considerably large in size, there is a need for the tools for semiautomatic data evaluation and information extraction. A comprehensive R archive network--the open-source R Environment--with its flexibility and growing potential, fits these requirements nicely. In this paper, examples of practical implementation of methods available in R for real-life Fourier transform infrared (FTIR) spectroscopic data problems are presented. However, this approach could easily be adopted to many various laboratory scenarios with other spectroscopic techniques.
Explain the CERES file naming convention
Atmospheric Science Data Center
2014-12-08
... using the dataset name, configuration code and date information which make each file name unique. A Dataset name consists ...
Agricultural land use alters the seasonality and magnitude of stream metabolism
Streams are active processors of organic carbon; however, spatial and temporal variation in the rates and controls on metabolism are not well quantified in streams draining intensively-farmed landscapes. We present a comprehensive dataset of gross primary production (GPP) and ec...
Eckhard, Ulrich; Huesgen, Pitter F; Schilling, Oliver; Bellac, Caroline L; Butler, Georgina S; Cox, Jennifer H; Dufour, Antoine; Goebeler, Verena; Kappelhoff, Reinhild; Auf dem Keller, Ulrich; Klein, Theo; Lange, Philipp F; Marino, Giada; Morrison, Charlotte J; Prudova, Anna; Rodriguez, David; Starr, Amanda E; Wang, Yili; Overall, Christopher M
2016-06-01
The data described provide a comprehensive resource for the family-wide active site specificity portrayal of the human matrix metalloproteinase family. We used the high-throughput proteomic technique PICS (Proteomic Identification of protease Cleavage Sites) to comprehensively assay 9 different MMPs. We identified more than 4300 peptide cleavage sites, spanning both the prime and non-prime sides of the scissile peptide bond allowing detailed subsite cooperativity analysis. The proteomic cleavage data were expanded by kinetic analysis using a set of 6 quenched-fluorescent peptide substrates designed using these results. These datasets represent one of the largest specificity profiling efforts with subsequent structural follow up for any protease family and put the spotlight on the specificity similarities and differences of the MMP family. A detailed analysis of this data may be found in Eckhard et al. (2015) [1]. The raw mass spectrometry data and the corresponding metadata have been deposited in PRIDE/ProteomeXchange with the accession number PXD002265.
CircadiOmics: circadian omic web portal.
Ceglia, Nicholas; Liu, Yu; Chen, Siwei; Agostinelli, Forest; Eckel-Mahan, Kristin; Sassone-Corsi, Paolo; Baldi, Pierre
2018-06-15
Circadian rhythms play a fundamental role at all levels of biological organization. Understanding the mechanisms and implications of circadian oscillations continues to be the focus of intense research. However, there has been no comprehensive and integrated way for accessing and mining all circadian omic datasets. The latest release of CircadiOmics (http://circadiomics.ics.uci.edu) fills this gap for providing the most comprehensive web server for studying circadian data. The newly updated version contains high-throughput 227 omic datasets corresponding to over 74 million measurements sampled over 24 h cycles. Users can visualize and compare oscillatory trajectories across species, tissues and conditions. Periodicity statistics (e.g. period, amplitude, phase, P-value, q-value etc.) obtained from BIO_CYCLE and other methods are provided for all samples in the repository and can easily be downloaded in the form of publication-ready figures and tables. New features and substantial improvements in performance and data volume make CircadiOmics a powerful web portal for integrated analysis of circadian omic data.
Rapid underway profiling of water quality in Queensland estuaries.
Hodge, Jonathan; Longstaff, Ben; Steven, Andy; Thornton, Phillip; Ellis, Peter; McKelvie, Ian
2005-01-01
We present an overview of a portable underway water quality monitoring system (RUM-Rapid Underway Monitoring), developed by integrating several off-the-shelf water quality instruments to provide rapid, comprehensive, and spatially referenced 'snapshots' of water quality conditions. We demonstrate the utility of the system from studies in the Northern Great Barrier Reef (Daintree River) and the Moreton Bay region. The Brisbane dataset highlights RUM's utility in characterising plumes as well as its ability to identify the smaller scale structure of large areas. RUM is shown to be particularly useful when measuring indicators with large small-scale variability such as turbidity and chlorophyll-a. Additionally, the Daintree dataset shows the ability to integrate other technologies, resulting in a more comprehensive analysis, whilst sampling offshore highlights some of the analytical issues required for sampling low concentration data. RUM is a low cost, highly flexible solution that can be modified for use in any water type, on most vessels and is only limited by the available monitoring technologies.
CORUM: the comprehensive resource of mammalian protein complexes
Ruepp, Andreas; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Stransky, Michael; Waegele, Brigitte; Schmidt, Thorsten; Doudieu, Octave Noubibou; Stümpflen, Volker; Mewes, H. Werner
2008-01-01
Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The CORUM (http://mips.gsf.de/genre/proj/corum/index.html) database is a collection of experimentally verified mammalian protein complexes. Information is manually derived by critical reading of the scientific literature from expert annotators. Information about protein complexes includes protein complex names, subunits, literature references as well as the function of the complexes. For functional annotation, we use the FunCat catalogue that enables to organize the protein complex space into biologically meaningful subsets. The database contains more than 1750 protein complexes that are built from 2400 different genes, thus representing 12% of the protein-coding genes in human. A web-based system is available to query, view and download the data. CORUM provides a comprehensive dataset of protein complexes for discoveries in systems biology, analyses of protein networks and protein complex-associated diseases. Comparable to the MIPS reference dataset of protein complexes from yeast, CORUM intends to serve as a reference for mammalian protein complexes. PMID:17965090
Drilling informatics: data-driven challenges of scientific drilling
NASA Astrophysics Data System (ADS)
Yamada, Yasuhiro; Kyaw, Moe; Saito, Sanny
2017-04-01
The primary aim of scientific drilling is to precisely understand the dynamic nature of the Earth. This is the reason why we investigate the subsurface materials (rock and fluid including microbial community) existing under particular environmental conditions. This requires sample collection and analytical data production from the samples, and in-situ data measurement at boreholes. Current available data comes from cores, cuttings, mud logging, geophysical logging, and exploration geophysics, but these datasets are difficult to be integrated because of their different kinds and scales. Now we are producing more useful datasets to fill the gap between the exiting data and extracting more information from such datasets and finally integrating the information. In particular, drilling parameters are very useful datasets as geomechanical properties. We believe such approach, 'drilling informatics', would be the most appropriate to obtain the comprehensive and dynamic picture of our scientific target, such as the seismogenic fault zone and the Moho discontinuity surface. This presentation introduces our initiative and current achievements of drilling informatics.
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping.
Pinho, Ana Luísa; Amadon, Alexis; Ruest, Torsten; Fabre, Murielle; Dohmatob, Elvis; Denghien, Isabelle; Ginisty, Chantal; Becuwe-Desmidt, Séverine; Roger, Séverine; Laurier, Laurence; Joly-Testault, Véronique; Médiouni-Cloarec, Gaëlle; Doublé, Christine; Martins, Bernadette; Pinel, Philippe; Eger, Evelyn; Varoquaux, Gaël; Pallier, Christophe; Dehaene, Stanislas; Hertz-Pannier, Lucie; Thirion, Bertrand
2018-06-12
Functional Magnetic Resonance Imaging (fMRI) has furthered brain mapping on perceptual, motor, as well as higher-level cognitive functions. However, to date, no data collection has systematically addressed the functional mapping of cognitive mechanisms at a fine spatial scale. The Individual Brain Charting (IBC) project stands for a high-resolution multi-task fMRI dataset that intends to provide the objective basis toward a comprehensive functional atlas of the human brain. The data refer to a cohort of 12 participants performing many different tasks. The large amount of task-fMRI data on the same subjects yields a precise mapping of the underlying functions, free from both inter-subject and inter-site variability. The present article gives a detailed description of the first release of the IBC dataset. It comprises a dozen of tasks, addressing both low- and high- level cognitive functions. This openly available dataset is thus intended to become a reference for cognitive brain mapping.
A studyforrest extension, retinotopic mapping and localization of higher visual areas
Sengupta, Ayan; Kaule, Falko R.; Guntupalli, J. Swaroop; Hoffmann, Michael B.; Häusler, Christian; Stadler, Jörg; Hanke, Michael
2016-01-01
The studyforrest (http://studyforrest.org) dataset is likely the largest neuroimaging dataset on natural language and story processing publicly available today. In this article, along with a companion publication, we present an update of this dataset that extends its scope to vision and multi-sensory research. 15 participants of the original cohort volunteered for a series of additional studies: a clinical examination of visual function, a standard retinotopic mapping procedure, and a localization of higher visual areas—such as the fusiform face area. The combination of this update, the previous data releases for the dataset, and the companion publication, which includes neuroimaging and eye tracking data from natural stimulation with a motion picture, form an extremely versatile and comprehensive resource for brain imaging research—with almost six hours of functional neuroimaging data across five different stimulation paradigms for each participant. Furthermore, we describe employed paradigms and present results that document the quality of the data for the purpose of characterising major properties of participants’ visual processing stream. PMID:27779618
Smith, Tanya; Page-Nicholson, Samantha; Morrison, Kerryn; Gibbons, Bradley; Jones, M Genevieve W; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin; Roxburgh, Lizanne
2016-01-01
The International Crane Foundation (ICF) / Endangered Wildlife Trust's (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus , Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus . The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal.
NASA Astrophysics Data System (ADS)
Boyer, T.; Sun, L.; Locarnini, R. A.; Mishonov, A. V.; Hall, N.; Ouellet, M.
2016-02-01
The World Ocean Database (WOD) contains systematically quality controlled historical and recent ocean profile data (temperature, salinity, oxygen, nutrients, carbon cycle variables, biological variables) ranging from Captain Cooks second voyage (1773) to this year's Argo floats. The US National Centers for Environmental Information (NCEI) also hosts the Global Temperature and Salinity Profile Program (GTSPP) Continuously Managed Database (CMD) which provides quality controlled near-real time ocean profile data and higher level quality controlled temperature and salinity profiles from 1990 to present. Both databases are used extensively for ocean and climate studies. Synchronization of these two databases will allow easier access and use of comprehensive regional and global ocean profile data sets for ocean and climate studies. Synchronizing consists of two distinct phases: 1) a retrospective comparison of data in WOD and GTSPP to ensure that the most comprehensive and highest quality data set is available to researchers without the need to individually combine and contrast the two datasets and 2) web services to allow the constantly accruing near-real time data in the GTSPP CMD and the continuous addition and quality control of historical data in WOD to be made available to researchers together, seamlessly.
Late paleozoic fusulinoidean gigantism driven by atmospheric hyperoxia.
Payne, Jonathan L; Groves, John R; Jost, Adam B; Nguyen, Thienan; Moffitt, Sarah E; Hill, Tessa M; Skotheim, Jan M
2012-09-01
Atmospheric hyperoxia, with pO(2) in excess of 30%, has long been hypothesized to account for late Paleozoic (360-250 million years ago) gigantism in numerous higher taxa. However, this hypothesis has not been evaluated statistically because comprehensive size data have not been compiled previously at sufficient temporal resolution to permit quantitative analysis. In this study, we test the hyperoxia-gigantism hypothesis by examining the fossil record of fusulinoidean foraminifers, a dramatic example of protistan gigantism with some individuals exceeding 10 cm in length and exceeding their relatives by six orders of magnitude in biovolume. We assembled and examined comprehensive regional and global, species-level datasets containing 270 and 1823 species, respectively. A statistical model of size evolution forced by atmospheric pO(2) is conclusively favored over alternative models based on random walks or a constant tendency toward size increase. Moreover, the ratios of volume to surface area in the largest fusulinoideans are consistent in magnitude and trend with a mathematical model based on oxygen transport limitation. We further validate the hyperoxia-gigantism model through an examination of modern foraminiferal species living along a measured gradient in oxygen concentration. These findings provide the first quantitative confirmation of a direct connection between Paleozoic gigantism and atmospheric hyperoxia. © 2012 The Author(s). Evolution© 2012 The Society for the Study of Evolution.
Liu, Ming-Qi; Zeng, Wen-Feng; Fang, Pan; Cao, Wei-Qian; Liu, Chao; Yan, Guo-Quan; Zhang, Yang; Peng, Chao; Wu, Jian-Qiang; Zhang, Xiao-Jin; Tu, Hui-Jun; Chi, Hao; Sun, Rui-Xiang; Cao, Yong; Dong, Meng-Qiu; Jiang, Bi-Yun; Huang, Jiang-Ming; Shen, Hua-Li; Wong, Catherine C L; He, Si-Min; Yang, Peng-Yuan
2017-09-05
The precise and large-scale identification of intact glycopeptides is a critical step in glycoproteomics. Owing to the complexity of glycosylation, the current overall throughput, data quality and accessibility of intact glycopeptide identification lack behind those in routine proteomic analyses. Here, we propose a workflow for the precise high-throughput identification of intact N-glycopeptides at the proteome scale using stepped-energy fragmentation and a dedicated search engine. pGlyco 2.0 conducts comprehensive quality control including false discovery rate evaluation at all three levels of matches to glycans, peptides and glycopeptides, improving the current level of accuracy of intact glycopeptide identification. The N-glycoproteome of samples metabolically labeled with 15 N/ 13 C were analyzed quantitatively and utilized to validate the glycopeptide identification, which could be used as a novel benchmark pipeline to compare different search engines. Finally, we report a large-scale glycoproteome dataset consisting of 10,009 distinct site-specific N-glycans on 1988 glycosylation sites from 955 glycoproteins in five mouse tissues.Protein glycosylation is a heterogeneous post-translational modification that generates greater proteomic diversity that is difficult to analyze. Here the authors describe pGlyco 2.0, a workflow for the precise one step identification of intact N-glycopeptides at the proteome scale.
Comprehensive cellular‐resolution atlas of the adult human brain
Royall, Joshua J.; Sunkin, Susan M.; Ng, Lydia; Facer, Benjamin A.C.; Lesnar, Phil; Guillozet‐Bongaarts, Angie; McMurray, Bergen; Szafer, Aaron; Dolbeare, Tim A.; Stevens, Allison; Tirrell, Lee; Benner, Thomas; Caldejon, Shiella; Dalley, Rachel A.; Dee, Nick; Lau, Christopher; Nyhus, Julie; Reding, Melissa; Riley, Zackery L.; Sandman, David; Shen, Elaine; van der Kouwe, Andre; Varjabedian, Ani; Write, Michelle; Zollei, Lilla; Dang, Chinh; Knowles, James A.; Koch, Christof; Phillips, John W.; Sestan, Nenad; Wohnoutka, Paul; Zielke, H. Ronald; Hohmann, John G.; Jones, Allan R.; Bernard, Amy; Hawrylycz, Michael J.; Hof, Patrick R.; Fischl, Bruce
2016-01-01
ABSTRACT Detailed anatomical understanding of the human brain is essential for unraveling its functional architecture, yet current reference atlases have major limitations such as lack of whole‐brain coverage, relatively low image resolution, and sparse structural annotation. We present the first digital human brain atlas to incorporate neuroimaging, high‐resolution histology, and chemoarchitecture across a complete adult female brain, consisting of magnetic resonance imaging (MRI), diffusion‐weighted imaging (DWI), and 1,356 large‐format cellular resolution (1 µm/pixel) Nissl and immunohistochemistry anatomical plates. The atlas is comprehensively annotated for 862 structures, including 117 white matter tracts and several novel cyto‐ and chemoarchitecturally defined structures, and these annotations were transferred onto the matching MRI dataset. Neocortical delineations were done for sulci, gyri, and modified Brodmann areas to link macroscopic anatomical and microscopic cytoarchitectural parcellations. Correlated neuroimaging and histological structural delineation allowed fine feature identification in MRI data and subsequent structural identification in MRI data from other brains. This interactive online digital atlas is integrated with existing Allen Institute for Brain Science gene expression atlases and is publicly accessible as a resource for the neuroscience community. J. Comp. Neurol. 524:3127–3481, 2016. © 2016 The Authors The Journal of Comparative Neurology Published by Wiley Periodicals, Inc. PMID:27418273
Chen, Ziyi; Quan, Lijun; Huang, Anfei; Zhao, Qiang; Yuan, Yao; Yuan, Xuye; Shen, Qin; Shang, Jingzhe; Ben, Yinyin; Qin, F Xiao-Feng; Wu, Aiping
2018-01-01
The RNA sequencing approach has been broadly used to provide gene-, pathway-, and network-centric analyses for various cell and tissue samples. However, thus far, rich cellular information carried in tissue samples has not been thoroughly characterized from RNA-Seq data. Therefore, it would expand our horizons to better understand the biological processes of the body by incorporating a cell-centric view of tissue transcriptome. Here, a computational model named seq-ImmuCC was developed to infer the relative proportions of 10 major immune cells in mouse tissues from RNA-Seq data. The performance of seq-ImmuCC was evaluated among multiple computational algorithms, transcriptional platforms, and simulated and experimental datasets. The test results showed its stable performance and superb consistency with experimental observations under different conditions. With seq-ImmuCC, we generated the comprehensive landscape of immune cell compositions in 27 normal mouse tissues and extracted the distinct signatures of immune cell proportion among various tissue types. Furthermore, we quantitatively characterized and compared 18 different types of mouse tumor tissues of distinct cell origins with their immune cell compositions, which provided a comprehensive and informative measurement for the immune microenvironment inside tumor tissues. The online server of seq-ImmuCC are freely available at http://wap-lab.org:3200/immune/.
Vanhove, Maarten P M; Pariselle, Antoine; Van Steenberge, Maarten; Raeymaekers, Joost A M; Hablützel, Pascal I; Gillardin, Céline; Hellemans, Bart; Breman, Floris C; Koblmüller, Stephan; Sturmbauer, Christian; Snoeks, Jos; Volckaert, Filip A M; Huyse, Tine
2015-09-03
The stunning diversity of cichlid fishes has greatly enhanced our understanding of speciation and radiation. Little is known about the evolution of cichlid parasites. Parasites are abundant components of biodiversity, whose diversity typically exceeds that of their hosts. In the first comprehensive phylogenetic parasitological analysis of a vertebrate radiation, we study monogenean parasites infecting tropheine cichlids from Lake Tanganyika. Monogeneans are flatworms usually infecting the body surface and gills of fishes. In contrast to many other parasites, they depend only on a single host species to complete their lifecycle. Our spatially comprehensive combined nuclear-mitochondrial DNA dataset of the parasites covering almost all tropheine host species (N = 18), reveals species-rich parasite assemblages and shows consistent host-specificity. Statistical comparisons of host and parasite phylogenies based on distance and topology-based tests demonstrate significant congruence and suggest that host-switching is rare. Molecular rate evaluation indicates that species of Cichlidogyrus probably diverged synchronically with the initial radiation of the tropheines. They further diversified through within-host speciation into an overlooked species radiation. The unique life history and specialisation of certain parasite groups has profound evolutionary consequences. Hence, evolutionary parasitology adds a new dimension to the study of biodiversity hotspots like Lake Tanganyika.
Image processing for optical mapping.
Ravindran, Prabu; Gupta, Aditya
2015-01-01
Optical Mapping is an established single-molecule, whole-genome analysis system, which has been used to gain a comprehensive understanding of genomic structure and to study structural variation of complex genomes. A critical component of Optical Mapping system is the image processing module, which extracts single molecule restriction maps from image datasets of immobilized, restriction digested and fluorescently stained large DNA molecules. In this review, we describe robust and efficient image processing techniques to process these massive datasets and extract accurate restriction maps in the presence of noise, ambiguity and confounding artifacts. We also highlight a few applications of the Optical Mapping system.
Development of an inter-professional screening instrument for cancer patients' education process.
Vaartio-Rajalin, Heli; Huumonen, Tuula; Iire, Liisa; Jekunen, Antti; Leino-Kilpi, Helena; Minn, Heikki; Paloniemi, Jenni; Zabalegui, Adelaida
2016-02-01
The aim of this paper is to describe the development of an inter-professional screening instrument for cancer patients' cognitive resources, knowledge expectations and inter-professional collaboration within patient education. Four empirical datasets during 2012-2014 were analyzed in order to identify main categories, subcategories and items for inter-professional screening instrument. Our inter-professional screening instrument integrates the critical moments of cancer patient education and the knowledge expectation types obtained from patient datasets to assessment of patients' cognitive resources, knowledge expectations and comprehension; and intra; and inter-professional. Copyright © 2015 Elsevier Inc. All rights reserved.
Polling, C; Tulloch, A; Banerjee, S; Cross, S; Dutta, R; Wood, D M; Dargan, P I; Hotopf, M
2015-07-16
Self-harm is a significant public health concern in the UK. This is reflected in the recent addition to the English Public Health Outcomes Framework of rates of attendance at Emergency Departments (EDs) following self-harm. However there is currently no source of data to measure this outcome. Routinely available data for inpatient admissions following self-harm miss the majority of cases presenting to services. We aimed to investigate (i) if a dataset of ED presentations could be produced using a combination of routinely collected clinical and administrative data and (ii) to validate this dataset against another one produced using methods similar to those used in previous studies. Using the Clinical Record Interactive Search system, the electronic health records (EHRs) used in four EDs were linked to Hospital Episode Statistics to create a dataset of attendances following self-harm. This dataset was compared with an audit dataset of ED attendances created by manual searching of ED records. The proportion of total cases detected by each dataset was compared. There were 1932 attendances detected by the EHR dataset and 1906 by the audit. The EHR and audit datasets detected 77% and 76 of all attendances respectively and both detected 82% of individual patients. There were no differences in terms of age, sex, ethnicity or marital status between those detected and those missed using the EHR method. Both datasets revealed more than double the number of self-harm incidents than could be identified from inpatient admission records. It was possible to use routinely collected EHR data to create a dataset of attendances at EDs following self-harm. The dataset detected the same proportion of attendances and individuals as the audit dataset, proved more comprehensive than the use of inpatient admission records, and did not show a systematic bias in those cases it missed.
Screening for High Conductivity/Low Viscosity Ionic Liquids Using Product Descriptors.
Martin, Shawn; Pratt, Harry D; Anderson, Travis M
2017-07-01
We seek to optimize Ionic liquids (ILs) for application to redox flow batteries. As part of this effort, we have developed a computational method for suggesting ILs with high conductivity and low viscosity. Since ILs consist of cation-anion pairs, we consider a method for treating ILs as pairs using product descriptors for QSPRs, a concept borrowed from the prediction of protein-protein interactions in bioinformatics. We demonstrate the method by predicting electrical conductivity, viscosity, and melting point on a dataset taken from the ILThermo database on June 18 th , 2014. The dataset consists of 4,329 measurements taken from 165 ILs made up of 72 cations and 34 anions. We benchmark our QSPRs on the known values in the dataset then extend our predictions to screen all 2,448 possible cation-anion pairs in the dataset. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Screening for High Conductivity/Low Viscosity Ionic Liquids Using Product Descriptors
Martin, Shawn; Pratt, III, Harry D.; Anderson, Travis M.
2017-02-21
We seek to optimize Ionic liquids (ILs) for application to redox flow batteries. As part of this effort, we have developed a computational method for suggesting ILs with high conductivity and low viscosity. Since ILs consist of cation-anion pairs, we consider a method for treating ILs as pairs using product descriptors for QSPRs, a concept borrowed from the prediction of protein-protein interactions in bioinformatics. We demonstrate the method by predicting electrical conductivity, viscosity, and melting point on a dataset taken from the ILThermo database on June 18th, 2014. The dataset consists of 4,329 measurements taken from 165 ILs made upmore » of 72 cations and 34 anions. In conclusion, we benchmark our QSPRs on the known values in the dataset then extend our predictions to screen all 2,448 possible cation-anion pairs in the dataset.« less
Screening for High Conductivity/Low Viscosity Ionic Liquids Using Product Descriptors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Martin, Shawn; Pratt, III, Harry D.; Anderson, Travis M.
We seek to optimize Ionic liquids (ILs) for application to redox flow batteries. As part of this effort, we have developed a computational method for suggesting ILs with high conductivity and low viscosity. Since ILs consist of cation-anion pairs, we consider a method for treating ILs as pairs using product descriptors for QSPRs, a concept borrowed from the prediction of protein-protein interactions in bioinformatics. We demonstrate the method by predicting electrical conductivity, viscosity, and melting point on a dataset taken from the ILThermo database on June 18th, 2014. The dataset consists of 4,329 measurements taken from 165 ILs made upmore » of 72 cations and 34 anions. In conclusion, we benchmark our QSPRs on the known values in the dataset then extend our predictions to screen all 2,448 possible cation-anion pairs in the dataset.« less
Scheuch, Matthias; Höper, Dirk; Beer, Martin
2015-03-03
Fuelled by the advent and subsequent development of next generation sequencing technologies, metagenomics became a powerful tool for the analysis of microbial communities both scientifically and diagnostically. The biggest challenge is the extraction of relevant information from the huge sequence datasets generated for metagenomics studies. Although a plethora of tools are available, data analysis is still a bottleneck. To overcome the bottleneck of data analysis, we developed an automated computational workflow called RIEMS - Reliable Information Extraction from Metagenomic Sequence datasets. RIEMS assigns every individual read sequence within a dataset taxonomically by cascading different sequence analyses with decreasing stringency of the assignments using various software applications. After completion of the analyses, the results are summarised in a clearly structured result protocol organised taxonomically. The high accuracy and performance of RIEMS analyses were proven in comparison with other tools for metagenomics data analysis using simulated sequencing read datasets. RIEMS has the potential to fill the gap that still exists with regard to data analysis for metagenomics studies. The usefulness and power of RIEMS for the analysis of genuine sequencing datasets was demonstrated with an early version of RIEMS in 2011 when it was used to detect the orthobunyavirus sequences leading to the discovery of Schmallenberg virus.
Thermodynamic Data Rescue and Informatics for Deep Carbon Science
NASA Astrophysics Data System (ADS)
Zhong, H.; Ma, X.; Prabhu, A.; Eleish, A.; Pan, F.; Parsons, M. A.; Ghiorso, M. S.; West, P.; Zednik, S.; Erickson, J. S.; Chen, Y.; Wang, H.; Fox, P. A.
2017-12-01
A large number of legacy datasets are contained in geoscience literature published between 1930 and 1980 and not expressed external to the publication text in digitized formats. Extracting, organizing, and reusing these "dark" datasets is highly valuable for many within the Earth and planetary science community. As a part of the Deep Carbon Observatory (DCO) data legacy missions, the DCO Data Science Team and Extreme Physics and Chemistry community identified thermodynamic datasets related to carbon, or more specifically datasets about the enthalpy and entropy of chemicals, as a proof of principle analysis. The data science team endeavored to develop a semi-automatic workflow, which includes identifying relevant publications, extracting contained datasets using OCR methods, collaborative reviewing, and registering the datasets via the DCO Data Portal where the 'Linked Data' feature of the data portal provides a mechanism for connecting rescued datasets beyond their individual data sources, to research domains, DCO Communities, and more, making data discovery and retrieval more effective.To date, the team has successfully rescued, deposited and registered additional datasets from publications with thermodynamic sources. These datasets contain 3 main types of data: (1) heat content or enthalpy data determined for a given compound as a function of temperature using high-temperature calorimetry, (2) heat content or enthalpy data determined for a given compound as a function of temperature using adiabatic calorimetry, and (3) direct determination of heat capacity of a compound as a function of temperature using differential scanning calorimetry. The data science team integrated these datasets and delivered a spectrum of data analytics including visualizations, which will lead to a comprehensive characterization of the thermodynamics of carbon and carbon-related materials.
Jumpponen, Ari; Brown, Shawn P.; Trappe, James M.; Cázares, Efrén; Strömmer, Rauni
2015-01-01
Periglacial substrates exposed by retreating glaciers represent extreme and sensitive environments defined by a variety of abiotic stressors that challenge organismal establishment and survival. The simple communities often residing at these sites enable their analyses in depth. We utilized existing data and mined published sporocarp, morphotyped ectomycorrhizae (ECM), as well as environmental sequence data of internal transcribed spacer (ITS) and large subunit (LSU) regions of the ribosomal RNA gene to identify taxa that occur at a glacier forefront in the North Cascades Mountains in Washington State in the USA. The discrete data types consistently identified several common and widely distributed genera, perhaps best exemplified by Inocybe and Laccaria. Although we expected low diversity and richness, our environmental sequence data included 37 ITS and 26 LSU operational taxonomic units (OTUs) that likely form ECM. While environmental surveys of metabarcode markers detected large numbers of targeted ECM taxa, both the fruiting body and the morphotype datasets included genera that were undetected in either of the metabarcode datasets. These included hypogeous (Hymenogaster) and epigeous (Lactarius) taxa, some of which may produce large sporocarps but may possess small and/or spatially patchy genets. We highlight the importance of combining various data types to provide a comprehensive view of a fungal community, even in an environment assumed to host communities of low species richness and diversity. PMID:29376900
Paraboschi, Elvezia Maria; Cardamone, Giulia; Rimoldi, Valeria; Gemmati, Donato; Spreafico, Marta; Duga, Stefano; Soldà, Giulia; Asselta, Rosanna
2015-09-30
Abnormalities in RNA metabolism and alternative splicing (AS) are emerging as important players in complex disease phenotypes. In particular, accumulating evidence suggests the existence of pathogenic links between multiple sclerosis (MS) and altered AS, including functional studies showing that an imbalance in alternatively-spliced isoforms may contribute to disease etiology. Here, we tested whether the altered expression of AS-related genes represents a MS-specific signature. A comprehensive comparative analysis of gene expression profiles of publicly-available microarray datasets (190 MS cases, 182 controls), followed by gene-ontology enrichment analysis, highlighted a significant enrichment for differentially-expressed genes involved in RNA metabolism/AS. In detail, a total of 17 genes were found to be differentially expressed in MS in multiple datasets, with CELF1 being dysregulated in five out of seven studies. We confirmed CELF1 downregulation in MS (p=0.0015) by real-time RT-PCRs on RNA extracted from blood cells of 30 cases and 30 controls. As a proof of concept, we experimentally verified the unbalance in alternatively-spliced isoforms in MS of the NFAT5 gene, a putative CELF1 target. In conclusion, for the first time we provide evidence of a consistent dysregulation of splicing-related genes in MS and we discuss its possible implications in modulating specific AS events in MS susceptibility genes.
Hydrologic Derivatives for Modeling and Analysis—A new global high-resolution database
Verdin, Kristine L.
2017-07-17
The U.S. Geological Survey has developed a new global high-resolution hydrologic derivative database. Loosely modeled on the HYDRO1k database, this new database, entitled Hydrologic Derivatives for Modeling and Analysis, provides comprehensive and consistent global coverage of topographically derived raster layers (digital elevation model data, flow direction, flow accumulation, slope, and compound topographic index) and vector layers (streams and catchment boundaries). The coverage of the data is global, and the underlying digital elevation model is a hybrid of three datasets: HydroSHEDS (Hydrological data and maps based on SHuttle Elevation Derivatives at multiple Scales), GMTED2010 (Global Multi-resolution Terrain Elevation Data 2010), and the SRTM (Shuttle Radar Topography Mission). For most of the globe south of 60°N., the raster resolution of the data is 3 arc-seconds, corresponding to the resolution of the SRTM. For the areas north of 60°N., the resolution is 7.5 arc-seconds (the highest resolution of the GMTED2010 dataset) except for Greenland, where the resolution is 30 arc-seconds. The streams and catchments are attributed with Pfafstetter codes, based on a hierarchical numbering system, that carry important topological information. This database is appropriate for use in continental-scale modeling efforts. The work described in this report was conducted by the U.S. Geological Survey in cooperation with the National Aeronautics and Space Administration Goddard Space Flight Center.
Acoustic Seabed Characterization of the Porcupine Bank, Irish Margin
NASA Astrophysics Data System (ADS)
O'Toole, Ronan; Monteys, Xavier
2010-05-01
The Porcupine Bank represents a large section of continental shelf situated west of the Irish landmass, located in water depths ranging between 150 and 500m. Under the Irish National Seabed Survey (INSS 1999-2006) this area was comprehensively mapped, generating multiple acoustic datasets including high resolution multibeam echosounder data. The unique nature of the area's datasets in terms of data density, consistency and geographic extent has allowed the development of a large-scale integrated physical characterization of the Porcupine Bank for multidisciplinary applications. Integrated analysis of backscatter and bathymetry data has resulted in a baseline delineation of sediment distribution, seabed geology and geomorphological features on the bank, along with an inclusive set of related database information. The methodology used incorporates a variety of statistical techniques which are necessary in isolating sonar system artefacts and addressing sonar geometry related issues. A number of acoustic backscatter parameters at several angles of incidence have been analysed in order to complement the characterization for both surface and subsurface sediments. Acoustic sub bottom records have also been incorporated in order to investigate the physical characteristics of certain features on the Porcupine Bank. Where available, groundtruthing information in terms of sediment samples, video footage and cores has been applied to add physical descriptors and validation to the characterization. Extensive mapping of different rock outcrops, sediment drifts, seabed features and other geological classes has been achieved using this methodology.
Improving information retrieval in functional analysis.
Rodriguez, Juan C; González, Germán A; Fresno, Cristóbal; Llera, Andrea S; Fernández, Elmer A
2016-12-01
Transcriptome analysis is essential to understand the mechanisms regulating key biological processes and functions. The first step usually consists of identifying candidate genes; to find out which pathways are affected by those genes, however, functional analysis (FA) is mandatory. The most frequently used strategies for this purpose are Gene Set and Singular Enrichment Analysis (GSEA and SEA) over Gene Ontology. Several statistical methods have been developed and compared in terms of computational efficiency and/or statistical appropriateness. However, whether their results are similar or complementary, the sensitivity to parameter settings, or possible bias in the analyzed terms has not been addressed so far. Here, two GSEA and four SEA methods and their parameter combinations were evaluated in six datasets by comparing two breast cancer subtypes with well-known differences in genetic background and patient outcomes. We show that GSEA and SEA lead to different results depending on the chosen statistic, model and/or parameters. Both approaches provide complementary results from a biological perspective. Hence, an Integrative Functional Analysis (IFA) tool is proposed to improve information retrieval in FA. It provides a common gene expression analytic framework that grants a comprehensive and coherent analysis. Only a minimal user parameter setting is required, since the best SEA/GSEA alternatives are integrated. IFA utility was demonstrated by evaluating four prostate cancer and the TCGA breast cancer microarray datasets, which showed its biological generalization capabilities. Copyright © 2016 Elsevier Ltd. All rights reserved.
Global surface displacement data for assessing variability of displacement at a point on a fault
Hecker, Suzanne; Sickler, Robert; Feigelson, Leah; Abrahamson, Norman; Hassett, Will; Rosa, Carla; Sanquini, Ann
2014-01-01
This report presents a global dataset of site-specific surface-displacement data on faults. We have compiled estimates of successive displacements attributed to individual earthquakes, mainly paleoearthquakes, at sites where two or more events have been documented, as a basis for analyzing inter-event variability in surface displacement on continental faults. An earlier version of this composite dataset was used in a recent study relating the variability of surface displacement at a point to the magnitude-frequency distribution of earthquakes on faults, and to hazard from fault rupture (Hecker and others, 2013). The purpose of this follow-on report is to provide potential data users with an updated comprehensive dataset, largely complete through 2010 for studies in English-language publications, as well as in some unpublished reports and abstract volumes.
Assessment of mangrove forests in the Pacific region using Landsat imagery
NASA Astrophysics Data System (ADS)
Bhattarai, Bibek; Giri, Chandra
2011-01-01
The information on the mangrove forests for the Pacific region is scarce or outdated. A regional assessment based on a consistent methodology and data sources was needed to understand their true extent. Our investigation offers a regionally consistent, high resolution (30 m), and the most comprehensive mapping of mangrove forests on the islands of American Samoa, Fiji, French Polynesia, Guam, Hawaii, Kiribati, Marshall Islands, Micronesia, Nauru, New Caledonia, Northern Mariana Islands, Palau, Papua New Guinea, Samoa, Solomon Islands, Tonga, Tuvalu, Vanuatu, and Wallis and Futuna Islands for the year 2000. We employed a hybrid supervised and unsupervised image classification technique on a total of 128 Landsat scenes gathered between 1999 and 2004, and validated the results using existing geographic information science (GIS) datasets, high resolution imagery, and published literature. We also draw a comparative analysis with the mangrove forests inventory published by the Food and Agriculture Association (FAO) of the United Nations. Our estimate shows a total of 623755 hectares of mangrove forests in the Pacific region; an increase of 18% from FAO's estimates. Although mangrove forests are disproportionately distributed toward a few larger islands on the western Pacific, they are also significant in many smaller islands.
Computer Simulation of Classic Studies in Psychology.
ERIC Educational Resources Information Center
Bradley, Drake R.
This paper describes DATASIM, a comprehensive software package which generates simulated data for actual or hypothetical research designs. DATASIM is primarily intended for use in statistics and research methods courses, where it is used to generate "individualized" datasets for students to analyze, and later to correct their answers.…
NASA Technical Reports Server (NTRS)
Liu, Zhong; Ostrenga, D.; Teng, W. L.; Trivedi, Bhagirath; Kempler, S.
2012-01-01
The NASA Goddard Earth Sciences Data and Information Services Center (GES DISC) is home of global precipitation product archives, in particular, the Tropical Rainfall Measuring Mission (TRMM) products. TRMM is a joint U.S.-Japan satellite mission to monitor tropical and subtropical (40 S - 40 N) precipitation and to estimate its associated latent heating. The TRMM satellite provides the first detailed and comprehensive dataset on the four dimensional distribution of rainfall and latent heating over vastly undersampled tropical and subtropical oceans and continents. The TRMM satellite was launched on November 27, 1997. TRMM data products are archived at and distributed by GES DISC. The newly released TRMM Version 7 consists of several changes including new parameters, new products, meta data, data structures, etc. For example, hydrometeor profiles in 2A12 now have 28 layers (14 in V6). New parameters have been added to several popular Level-3 products, such as, 3B42, 3B43. Version 2.2 of the Global Precipitation Climatology Project (GPCP) dataset has been added to the TRMM Online Visualization and Analysis System (TOVAS; URL: http://disc2.nascom.nasa.gov/Giovanni/tovas/), allowing online analysis and visualization without downloading data and software. The GPCP dataset extends back to 1979. Version 3 of the Global Precipitation Climatology Centre (GPCC) monitoring product has been updated in TOVAS as well. The product provides global gauge-based monthly rainfall along with number of gauges per grid. The dataset begins in January 1986. To facilitate data and information access and support precipitation research and applications, we have developed a Precipitation Data and Information Services Center (PDISC; URL: http://disc.gsfc.nasa.gov/precipitation). In addition to TRMM, PDISC provides current and past observational precipitation data. Users can access precipitation data archives consisting of both remote sensing and in-situ observations. Users can use these data products to conduct a wide variety of activities, including case studies, model evaluation, uncertainty investigation, etc. To support Earth science applications, PDISC provides users near-real-time precipitation products over the Internet. At PDISC, users can access tools and software. Documentation, FAQ and assistance are also available. Other capabilities include: 1) Mirador (http://mirador.gsfc.nasa.gov/), a simplified interface for searching, browsing, and ordering Earth science data at NASA Goddard Earth Sciences Data and Information Services Center (GES DISC). Mirador is designed to be fast and easy to learn; 2)TOVAS; 3) NetCDF data download for the GIS community; 4) Data via OPeNDAP (http://disc.sci.gsfc.nasa.gov/services/opendap/). The OPeNDAP provides remote access to individual variables within datasets in a form usable by many tools, such as IDV, McIDAS-V, Panoply, Ferret and GrADS; 5) The Open Geospatial Consortium (OGC) Web Map Service (WMS) (http://disc.sci.gsfc.nasa.gov/services/wxs_ogc.shtml). The WMS is an interface that allows the use of data and enables clients to build customized maps with data coming from a different network.
Roon, David A.; Waits, L.P.; Kendall, K.C.
2005-01-01
Non-invasive genetic sampling (NGS) is becoming a popular tool for population estimation. However, multiple NGS studies have demonstrated that polymerase chain reaction (PCR) genotyping errors can bias demographic estimates. These errors can be detected by comprehensive data filters such as the multiple-tubes approach, but this approach is expensive and time consuming as it requires three to eight PCR replicates per locus. Thus, researchers have attempted to correct PCR errors in NGS datasets using non-comprehensive error checking methods, but these approaches have not been evaluated for reliability. We simulated NGS studies with and without PCR error and 'filtered' datasets using non-comprehensive approaches derived from published studies and calculated mark-recapture estimates using CAPTURE. In the absence of data-filtering, simulated error resulted in serious inflations in CAPTURE estimates; some estimates exceeded N by ??? 200%. When data filters were used, CAPTURE estimate reliability varied with per-locus error (E??). At E?? = 0.01, CAPTURE estimates from filtered data displayed < 5% deviance from error-free estimates. When E?? was 0.05 or 0.09, some CAPTURE estimates from filtered data displayed biases in excess of 10%. Biases were positive at high sampling intensities; negative biases were observed at low sampling intensities. We caution researchers against using non-comprehensive data filters in NGS studies, unless they can achieve baseline per-locus error rates below 0.05 and, ideally, near 0.01. However, we suggest that data filters can be combined with careful technique and thoughtful NGS study design to yield accurate demographic information. ?? 2005 The Zoological Society of London.
Smith, Tanya; Page-Nicholson, Samantha; Gibbons, Bradley; Jones, M. Genevieve W.; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin
2016-01-01
Abstract Background The International Crane Foundation (ICF) / Endangered Wildlife Trust’s (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus, Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus. The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. New information This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal. PMID:27956850
Comparative analysis and assessment of M. tuberculosis H37Rv protein-protein interaction datasets
2011-01-01
Background M. tuberculosis is a formidable bacterial pathogen. There is thus an increasing demand on understanding the function and relationship of proteins in various strains of M. tuberculosis. Protein-protein interactions (PPIs) data are crucial for this kind of knowledge. However, the quality of the main available M. tuberculosis PPI datasets is unclear. This hampers the effectiveness of research works that rely on these PPI datasets. Here, we analyze the two main available M. tuberculosis H37Rv PPI datasets. The first dataset is the high-throughput B2H PPI dataset from Wang et al’s recent paper in Journal of Proteome Research. The second dataset is from STRING database, version 8.3, comprising entirely of H37Rv PPIs predicted using various methods. We find that these two datasets have a surprisingly low level of agreement. We postulate the following causes for this low level of agreement: (i) the H37Rv B2H PPI dataset is of low quality; (ii) the H37Rv STRING PPI dataset is of low quality; and/or (iii) the H37Rv STRING PPIs are predictions of other forms of functional associations rather than direct physical interactions. Results To test the quality of these two datasets, we evaluate them based on correlated gene expression profiles, coherent informative GO term annotations, and conservation in other organisms. We observe a significantly greater portion of PPIs in the H37Rv STRING PPI dataset (with score ≥ 770) having correlated gene expression profiles and coherent informative GO term annotations in both interaction partners than that in the H37Rv B2H PPI dataset. Predicted H37Rv interologs derived from non-M. tuberculosis experimental PPIs are much more similar to the H37Rv STRING functional associations dataset (with score ≥ 770) than the H37Rv B2H PPI dataset. H37Rv predicted physical interologs from IntAct also show extremely low similarity with the H37Rv B2H PPI dataset; and this similarity level is much lower than that between the S. aureus MRSA252 predicted physical interologs from IntAct and S. aureus MRSA252 pull-down PPIs. Comparative analysis with several representative two-hybrid PPI datasets in other species further confirms that the H37Rv B2H PPI dataset is of low quality. Next, to test the possibility that the H37Rv STRING PPIs are not purely direct physical interactions, we compare M. tuberculosis H37Rv protein pairs that catalyze adjacent steps in enzymatic reactions to B2H PPIs and predicted PPIs in STRING, which shows it has much lower similarities with the B2H PPIs than with STRING PPIs. This result strongly suggests that the H37Rv STRING PPIs more likely correspond to indirect relationships between protein pairs than to B2H PPIs. For more precise support, we turn to S. cerevisiae for its comprehensively studied interactome. We compare S. cerevisiae predicted PPIs in STRING to three independent protein relationship datasets which respectively comprise PPIs reported in Y2H assays, protein pairs reported to be in the same protein complexes, and protein pairs that catalyze successive reaction steps in enzymatic reactions. Our analysis reveals that S. cerevisiae predicted STRING PPIs have much higher similarity to the latter two types of protein pairs than to two-hybrid PPIs. As H37Rv STRING PPIs are predicted using similar methods as S. cerevisiae predicted STRING PPIs, this suggests that these H37Rv STRING PPIs are more likely to correspond to the latter two types of protein pairs rather than to two-hybrid PPIs as well. Conclusions The H37Rv B2H PPI dataset has low quality. It should not be used as the gold standard to assess the quality of other (possibly predicted) H37Rv PPI datasets. The H37Rv STRING PPI dataset also has low quality; nevertheless, a subset consisting of STRING PPIs with score ≥770 has satisfactory quality. However, these STRING “PPIs” should be interpreted as functional associations, which include a substantial portion of indirect protein interactions, rather than direct physical interactions. These two factors cause the strikingly low similarity between these two main H37Rv PPI datasets. The results and conclusions from this comparative analysis provide valuable guidance in using these M. tuberculosis H37Rv PPI datasets in subsequent studies for a wide range of purposes. PMID:22369691
Zhai, Xuetong; Chakraborty, Dev P
2017-06-01
The objective was to design and implement a bivariate extension to the contaminated binormal model (CBM) to fit paired receiver operating characteristic (ROC) datasets-possibly degenerate-with proper ROC curves. Paired datasets yield two correlated ratings per case. Degenerate datasets have no interior operating points and proper ROC curves do not inappropriately cross the chance diagonal. The existing method, developed more than three decades ago utilizes a bivariate extension to the binormal model, implemented in CORROC2 software, which yields improper ROC curves and cannot fit degenerate datasets. CBM can fit proper ROC curves to unpaired (i.e., yielding one rating per case) and degenerate datasets, and there is a clear scientific need to extend it to handle paired datasets. In CBM, nondiseased cases are modeled by a probability density function (pdf) consisting of a unit variance peak centered at zero. Diseased cases are modeled with a mixture distribution whose pdf consists of two unit variance peaks, one centered at positive μ with integrated probability α, the mixing fraction parameter, corresponding to the fraction of diseased cases where the disease was visible to the radiologist, and one centered at zero, with integrated probability (1-α), corresponding to disease that was not visible. It is shown that: (a) for nondiseased cases the bivariate extension is a unit variances bivariate normal distribution centered at (0,0) with a specified correlation ρ 1 ; (b) for diseased cases the bivariate extension is a mixture distribution with four peaks, corresponding to disease not visible in either condition, disease visible in only one condition, contributing two peaks, and disease visible in both conditions. An expression for the likelihood function is derived. A maximum likelihood estimation (MLE) algorithm, CORCBM, was implemented in the R programming language that yields parameter estimates and the covariance matrix of the parameters, and other statistics. A limited simulation validation of the method was performed. CORCBM and CORROC2 were applied to two datasets containing nine readers each contributing paired interpretations. CORCBM successfully fitted the data for all readers, whereas CORROC2 failed to fit a degenerate dataset. All fits were visually reasonable. All CORCBM fits were proper, whereas all CORROC2 fits were improper. CORCBM and CORROC2 were in agreement (a) in declaring only one of the nine readers as having significantly different performances in the two modalities; (b) in estimating higher correlations for diseased cases than for nondiseased ones; and (c) in finding that the intermodality correlation estimates for nondiseased cases were consistent between the two methods. All CORCBM fits yielded higher area under curve (AUC) than the CORROC2 fits, consistent with the fact that a proper ROC model like CORCBM is based on a likelihood-ratio-equivalent decision variable, and consequently yields higher performance than the binormal model-based CORROC2. The method gave satisfactory fits to four simulated datasets. CORCBM is a robust method for fitting paired ROC datasets, always yielding proper ROC curves, and able to fit degenerate datasets. © 2017 American Association of Physicists in Medicine.
Improving the discoverability, accessibility, and citability of omics datasets: a case report.
Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J
2017-03-01
Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Micro RNA as a potential blood-based epigenetic biomarker for Alzheimer's disease.
Fransquet, Peter D; Ryan, Joanne
2018-06-06
As the prevalence of Alzheimer's disease (AD) increases, the search for a definitive, easy to access diagnostic biomarker has become increasingly important. Micro RNA (miRNA), involved in the epigenetic regulation of protein synthesis, is a biological mark which varies in association with a number of disease states, possibly including AD. Here we comprehensively review methods and findings from 26 studies comparing the measurement of miRNA in blood between AD cases and controls. Thirteen of these studies used receiver operator characteristic (ROC) analysis to determine the diagnostic accuracy of identified miRNA to predict AD, and three studies did this with a machine learning approach. Of 8098 individually measured miRNAs, 23 that were differentially expressed between AD cases and controls were found to be significant in two or more studies. Only six of these were consistent in their direction of expression between studies (miR-107, miR-125b, miR-146a, miR-181c, miR-29b, and miR-342), and they were all shown to be down regulated in individuals with AD compared to controls. Of these directionally concordant miRNAs, the strongest evidence was for miR-107 which has also been shown in previous studies to be involved in the dysregulation of proteins involved in aspects of AD pathology, as well as being consistently downregulated in studies of AD brains. We conclude that imperative to the discovery of reliable and replicable miRNA biomarkers of AD, standardised methods of measurements, appropriate statistical analysis, utilization of large datasets with machine learning approaches, and comprehensive reporting of findings is urgently needed. Copyright © 2017. Published by Elsevier Inc.
Lee, Mikyung; Huang, Ruili; Tong, Weida
2016-01-01
Nuclear receptors (NRs) are ligand-activated transcriptional regulators that play vital roles in key biological processes such as growth, differentiation, metabolism, reproduction, and morphogenesis. Disruption of NRs can result in adverse health effects such as NR-mediated endocrine disruption. A comprehensive understanding of core transcriptional targets regulated by NRs helps to elucidate their key biological processes in both toxicological and therapeutic aspects. In this study, we applied a probabilistic graphical model to identify the transcriptional targets of NRs and the biological processes they govern. The Tox21 program profiled a collection of approximate 10 000 environmental chemicals and drugs against a panel of human NRs in a quantitative high-throughput screening format for their NR disruption potential. The Japanese Toxicogenomics Project, one of the most comprehensive efforts in the field of toxicogenomics, generated large-scale gene expression profiles on the effect of 131 compounds (in its first phase of study) at various doses, and different durations, and their combinations. We applied author-topic model to these 2 toxicological datasets, which consists of 11 NRs run in either agonist and/or antagonist mode (18 assays total) and 203 in vitro human gene expression profiles connected by 52 shared drugs. As a result, a set of clusters (topics), which consists of a set of NRs and their associated target genes were determined. Various transcriptional targets of the NRs were identified by assays run in either agonist or antagonist mode. Our results were validated by functional analysis and compared with TRANSFAC data. In summary, our approach resulted in effective identification of associated/affected NRs and their target genes, providing biologically meaningful hypothesis embedded in their relationships. PMID:26643261
NASA Technical Reports Server (NTRS)
Davies, Diane K.; Brown, Molly E.; Green, David S.; Michael, Karen A.; Murray, John J.; Justice, Christopher O.; Soja, Amber J.
2016-01-01
It is widely accepted that time-sensitive remote sensing data serve the needs of decision makers in the applications communities and yet to date, a comprehensive portfolio of NASA low latency datasets has not been available. This paper will describe the NASA low latency, or Near-Real Time (NRT), portfolio, how it was developed and plans to make it available online through a portal that leverages the existing EOSDIS capabilities such as the Earthdata Search Client (https:search.earthdata.nasa.gov), the Common Metadata Repository (CMR) and the Global Imagery Browse Service (GIBS). This paper will report on the outcomes of a NASA Workshop to Develop a Portfolio of Low Latency Datasets for Time-Sensitive Applications (27-29 September 2016 at NASA Langley Research Center, Hampton VA). The paper will also summarize findings and recommendations from the meeting outlining perceived shortfalls and opportunities for low latency research and application science.
Show me the numbers: What data currently exist for non-native species in the USA?
Crall, Alycia W.; Meyerson, Laura A.; Stohlgren, Thomas J.; Jarnevich, Catherine S.; Newman, Gregory J.; Graham, James
2006-01-01
Non-native species continue to be introduced to the United States from other countries via trade and transportation, creating a growing need for early detection and rapid response to new invaders. It is therefore increasingly important to synthesize existing data on non-native species abundance and distributions. However, no comprehensive analysis of existing data has been undertaken for non-native species, and there have been few efforts to improve collaboration. We therefore conducted a survey to determine what datasets currently exist for non-native species in the US from county, state, multi-state region, national, and global scales. We identified 319 datasets and collected metadata for 79% of these. Through this study, we provide a better understanding of extant non-native species datasets and identify data gaps (ie taxonomic, spatial, and temporal) to help guide future survey, research, and predictive modeling efforts.
Enhanced risk management by an emerging multi-agent architecture
NASA Astrophysics Data System (ADS)
Lin, Sin-Jin; Hsu, Ming-Fu
2014-07-01
Classification in imbalanced datasets has attracted much attention from researchers in the field of machine learning. Most existing techniques tend not to perform well on minority class instances when the dataset is highly skewed because they focus on minimising the forecasting error without considering the relative distribution of each class. This investigation proposes an emerging multi-agent architecture, grounded on cooperative learning, to solve the class-imbalanced classification problem. Additionally, this study deals further with the obscure nature of the multi-agent architecture and expresses comprehensive rules for auditors. The results from this study indicate that the presented model performs satisfactorily in risk management and is able to tackle a highly class-imbalanced dataset comparatively well. Furthermore, the knowledge visualised process, supported by real examples, can assist both internal and external auditors who must allocate limited detecting resources; they can take the rules as roadmaps to modify the auditing programme.
Automated Analysis of Fluorescence Microscopy Images to Identify Protein-Protein Interactions
Venkatraman, S.; Doktycz, M. J.; Qi, H.; ...
2006-01-01
The identification of protein interactions is important for elucidating biological networks. One obstacle in comprehensive interaction studies is the analyses of large datasets, particularly those containing images. Development of an automated system to analyze an image-based protein interaction dataset is needed. Such an analysis system is described here, to automatically extract features from fluorescence microscopy images obtained from a bacterial protein interaction assay. These features are used to relay quantitative values that aid in the automated scoring of positive interactions. Experimental observations indicate that identifying at least 50% positive cells in an image is sufficient to detect a protein interaction.more » Based on this criterion, the automated system presents 100% accuracy in detecting positive interactions for a dataset of 16 images. Algorithms were implemented using MATLAB and the software developed is available on request from the authors.« less
Wind and wave dataset for Matara, Sri Lanka
NASA Astrophysics Data System (ADS)
Luo, Yao; Wang, Dongxiao; Priyadarshana Gamage, Tilak; Zhou, Fenghua; Madusanka Widanage, Charith; Liu, Taiwei
2018-01-01
We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1) is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017) is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447).
A benchmark for comparison of cell tracking algorithms
Maška, Martin; Ulman, Vladimír; Svoboda, David; Matula, Pavel; Matula, Petr; Ederra, Cristina; Urbiola, Ainhoa; España, Tomás; Venkatesan, Subramanian; Balak, Deepak M.W.; Karas, Pavel; Bolcková, Tereza; Štreitová, Markéta; Carthel, Craig; Coraluppi, Stefano; Harder, Nathalie; Rohr, Karl; Magnusson, Klas E. G.; Jaldén, Joakim; Blau, Helen M.; Dzyubachyk, Oleh; Křížek, Pavel; Hagen, Guy M.; Pastor-Escuredo, David; Jimenez-Carretero, Daniel; Ledesma-Carbayo, Maria J.; Muñoz-Barrutia, Arrate; Meijering, Erik; Kozubek, Michal; Ortiz-de-Solorzano, Carlos
2014-01-01
Motivation: Automatic tracking of cells in multidimensional time-lapse fluorescence microscopy is an important task in many biomedical applications. A novel framework for objective evaluation of cell tracking algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2013 Cell Tracking Challenge. In this article, we present the logistics, datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. Results: The main contributions of the challenge include the creation of a comprehensive video dataset repository and the definition of objective measures for comparison and ranking of the algorithms. With this benchmark, six algorithms covering a variety of segmentation and tracking paradigms have been compared and ranked based on their performance on both synthetic and real datasets. Given the diversity of the datasets, we do not declare a single winner of the challenge. Instead, we present and discuss the results for each individual dataset separately. Availability and implementation: The challenge Web site (http://www.codesolorzano.com/celltrackingchallenge) provides access to the training and competition datasets, along with the ground truth of the training videos. It also provides access to Windows and Linux executable files of the evaluation software and most of the algorithms that competed in the challenge. Contact: codesolorzano@unav.es Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24526711
Bayoglu, Riza; Geeraedts, Leo; Groenen, Karlijn H J; Verdonschot, Nico; Koopman, Bart; Homminga, Jasper
2017-06-14
Musculo-skeletal modeling could play a key role in advancing our understanding of the healthy and pathological spine, but the credibility of such models are strictly dependent on the accuracy of the anatomical data incorporated. In this study, we present a complete and coherent musculo-skeletal dataset for the thoracic and cervical regions of the human spine, obtained through detailed dissection of an embalmed male cadaver. We divided the muscles into a number of muscle-tendon elements, digitized their attachments at the bones, and measured morphological muscle parameters. In total, 225 muscle elements were measured over 39 muscles. For every muscle element, we provide the coordinates of its attachments, fiber length, tendon length, sarcomere length, optimal fiber length, pennation angle, mass, and physiological cross-sectional area together with the skeletal geometry of the cadaver. Results were consistent with similar anatomical studies. Furthermore, we report new data for several muscles such as rotatores, multifidus, levatores costarum, spinalis, semispinalis, subcostales, transversus thoracis, and intercostales muscles. This dataset complements our previous study where we presented a consistent dataset for the lumbar region of the spine (Bayoglu et al., 2017). Therefore, when used together, these datasets enable a complete and coherent dataset for the entire spine. The complete dataset will be used to develop a musculo-skeletal model for the entire human spine to study clinical and ergonomic applications. Copyright © 2017 Elsevier Ltd. All rights reserved.
Comprehensive comparison of gap filling techniques for eddy covariance net carbon fluxes
NASA Astrophysics Data System (ADS)
Moffat, A. M.; Papale, D.; Reichstein, M.; Hollinger, D. Y.; Richardson, A. D.; Barr, A. G.; Beckstein, C.; Braswell, B. H.; Churkina, G.; Desai, A. R.; Falge, E.; Gove, J. H.; Heimann, M.; Hui, D.; Jarvis, A. J.; Kattge, J.; Noormets, A.; Stauch, V. J.
2007-12-01
Review of fifteen techniques for estimating missing values of net ecosystem CO2 exchange (NEE) in eddy covariance time series and evaluation of their performance for different artificial gap scenarios based on a set of ten benchmark datasets from six forested sites in Europe. The goal of gap filling is the reproduction of the NEE time series and hence this present work focuses on estimating missing NEE values, not on editing or the removal of suspect values in these time series due to systematic errors in the measurements (e.g. nighttime flux, advection). The gap filling was examined by generating fifty secondary datasets with artificial gaps (ranging in length from single half-hours to twelve consecutive days) for each benchmark dataset and evaluating the performance with a variety of statistical metrics. The performance of the gap filling varied among sites and depended on the level of aggregation (native half- hourly time step versus daily), long gaps were more difficult to fill than short gaps, and differences among the techniques were more pronounced during the day than at night. The non-linear regression techniques (NLRs), the look-up table (LUT), marginal distribution sampling (MDS), and the semi-parametric model (SPM) generally showed good overall performance. The artificial neural network based techniques (ANNs) were generally, if only slightly, superior to the other techniques. The simple interpolation technique of mean diurnal variation (MDV) showed a moderate but consistent performance. Several sophisticated techniques, the dual unscented Kalman filter (UKF), the multiple imputation method (MIM), the terrestrial biosphere model (BETHY), but also one of the ANNs and one of the NLRs showed high biases which resulted in a low reliability of the annual sums, indicating that additional development might be needed. An uncertainty analysis comparing the estimated random error in the ten benchmark datasets with the artificial gap residuals suggested that the techniques are already at or very close to the noise limit of the measurements. Based on the techniques and site data examined here, the effect of gap filling on the annual sums of NEE is modest, with most techniques falling within a range of ±25 g C m-2 y-1.
PinAPL-Py: A comprehensive web-application for the analysis of CRISPR/Cas9 screens.
Spahn, Philipp N; Bath, Tyler; Weiss, Ryan J; Kim, Jihoon; Esko, Jeffrey D; Lewis, Nathan E; Harismendy, Olivier
2017-11-20
Large-scale genetic screens using CRISPR/Cas9 technology have emerged as a major tool for functional genomics. With its increased popularity, experimental biologists frequently acquire large sequencing datasets for which they often do not have an easy analysis option. While a few bioinformatic tools have been developed for this purpose, their utility is still hindered either due to limited functionality or the requirement of bioinformatic expertise. To make sequencing data analysis of CRISPR/Cas9 screens more accessible to a wide range of scientists, we developed a Platform-independent Analysis of Pooled Screens using Python (PinAPL-Py), which is operated as an intuitive web-service. PinAPL-Py implements state-of-the-art tools and statistical models, assembled in a comprehensive workflow covering sequence quality control, automated sgRNA sequence extraction, alignment, sgRNA enrichment/depletion analysis and gene ranking. The workflow is set up to use a variety of popular sgRNA libraries as well as custom libraries that can be easily uploaded. Various analysis options are offered, suitable to analyze a large variety of CRISPR/Cas9 screening experiments. Analysis output includes ranked lists of sgRNAs and genes, and publication-ready plots. PinAPL-Py helps to advance genome-wide screening efforts by combining comprehensive functionality with user-friendly implementation. PinAPL-Py is freely accessible at http://pinapl-py.ucsd.edu with instructions and test datasets.
CLEX: A Cross-Linguistic Lexical Norms Database
ERIC Educational Resources Information Center
Jorgensen, Rune Norgaard; Dale, Philip S.; Bleses, Dorthe; Fenson, Larry
2010-01-01
Parent report has proven a valid and cost-effective means of evaluating early child language. Norming datasets for these instruments, which provide the basis for standardized comparisons of individual children to a population, can also be used to derive norms for the acquisition of individual words in production and comprehension and also early…
ERIC Educational Resources Information Center
Blankenberger, Bob; Lichtenberger, Eric; Witt, M. Allison; Franklin, Doug
2017-01-01
Illinois education policymakers have adopted the completion agenda that emphasizes increasing postsecondary credential attainment. Meeting completion agenda goals necessitates addressing the achievement gap. To aid in developing policy to support improved completion, this study analyzes a comprehensive statewide dataset of the 2003 Illinois high…
NASA Astrophysics Data System (ADS)
Horton, Pascal; Weingartner, Rolf; Brönnimann, Stefan
2017-04-01
The analogue method is a statistical downscaling method for precipitation prediction. It uses similarity in terms of synoptic-scale predictors with situations in the past in order to provide a probabilistic prediction for the day of interest. It has been used for decades in a context of weather or flood forecasting, and is more recently also applied to climate studies, whether for reconstruction of past weather conditions or future climate impact studies. In order to evaluate the relationship between synoptic scale predictors and the local weather variable of interest, e.g. precipitation, reanalysis datasets are necessary. Nowadays, the number of available reanalysis datasets increases. These are generated by different atmospheric models with different assimilation techniques and offer various spatial and temporal resolutions. A major difference between these datasets is also the length of the archive they provide. While some datasets start at the beginning of the satellite era (1980) and assimilate these data, others aim at homogeneity on a longer period (e.g. 20th century) and only assimilate conventional observations. The context of the application of analogue methods might drive the choice of an appropriate dataset, for example when the archive length is a leading criterion. However, in many studies, a reanalysis dataset is subjectively chosen, according to the user's preferences or the ease of access. The impact of this choice on the results of the downscaling procedure is rarely considered and no comprehensive comparison has been undertaken so far. In order to fill this gap and to advise on the choice of appropriate datasets, nine different global reanalysis datasets were compared in seven distinct versions of analogue methods, over 300 precipitation stations in Switzerland. Significant differences in terms of prediction performance were identified. Although the impact of the reanalysis dataset on the skill score varies according to the chosen predictor, be it atmospheric circulation or thermodynamic variables, some hierarchy between the datasets is often preserved. This work can thus help choosing an appropriate dataset for the analogue method, or raise awareness of the consequences of using a certain dataset.
Parton Distributions based on a Maximally Consistent Dataset
NASA Astrophysics Data System (ADS)
Rojo, Juan
2016-04-01
The choice of data that enters a global QCD analysis can have a substantial impact on the resulting parton distributions and their predictions for collider observables. One of the main reasons for this has to do with the possible presence of inconsistencies, either internal within an experiment or external between different experiments. In order to assess the robustness of the global fit, different definitions of a conservative PDF set, that is, a PDF set based on a maximally consistent dataset, have been introduced. However, these approaches are typically affected by theory biases in the selection of the dataset. In this contribution, after a brief overview of recent NNPDF developments, we propose a new, fully objective, definition of a conservative PDF set, based on the Bayesian reweighting approach. Using the new NNPDF3.0 framework, we produce various conservative sets, which turn out to be mutually in agreement within the respective PDF uncertainties, as well as with the global fit. We explore some of their implications for LHC phenomenology, finding also good consistency with the global fit result. These results provide a non-trivial validation test of the new NNPDF3.0 fitting methodology, and indicate that possible inconsistencies in the fitted dataset do not affect substantially the global fit PDFs.
Griscom, Bronson W.; Ellis, Peter W.; Baccini, Alessandro; Marthinus, Delon; Evans, Jeffrey S.; Ruslandi
2016-01-01
Background Forest conservation efforts are increasingly being implemented at the scale of sub-national jurisdictions in order to mitigate global climate change and provide other ecosystem services. We see an urgent need for robust estimates of historic forest carbon emissions at this scale, as the basis for credible measures of climate and other benefits achieved. Despite the arrival of a new generation of global datasets on forest area change and biomass, confusion remains about how to produce credible jurisdictional estimates of forest emissions. We demonstrate a method for estimating the relevant historic forest carbon fluxes within the Regency of Berau in eastern Borneo, Indonesia. Our method integrates best available global and local datasets, and includes a comprehensive analysis of uncertainty at the regency scale. Principal Findings and Significance We find that Berau generated 8.91 ± 1.99 million tonnes of net CO2 emissions per year during 2000–2010. Berau is an early frontier landscape where gross emissions are 12 times higher than gross sequestration. Yet most (85%) of Berau’s original forests are still standing. The majority of net emissions were due to conversion of native forests to unspecified agriculture (43% of total), oil palm (28%), and fiber plantations (9%). Most of the remainder was due to legal commercial selective logging (17%). Our overall uncertainty estimate offers an independent basis for assessing three other estimates for Berau. Two other estimates were above the upper end of our uncertainty range. We emphasize the importance of including an uncertainty range for all parameters of the emissions equation to generate a comprehensive uncertainty estimate–which has not been done before. We believe comprehensive estimates of carbon flux uncertainty are increasingly important as national and international institutions are challenged with comparing alternative estimates and identifying a credible range of historic emissions values. PMID:26752298
NASA Astrophysics Data System (ADS)
Evans, B. J. K.; Wyborn, L. A.; Druken, K. A.; Richards, C. J.; Trenham, C. E.; Wang, J.
2016-12-01
The Australian National Computational Infrastructure (NCI) manages a large geospatial repository (10+ PBytes) of Earth systems, environmental, water management and geophysics research data, co-located with a petascale supercomputer and an integrated research cloud. NCI has applied the principles of the "Common Framework for Earth-Observation Data" (the Framework) to the organisation of these collections enabling a diverse range of researchers to explore different aspects of the data and, in particular, for seamless programmatic data analysis, both in-situ access and via data services. NCI provides access to the collections through the National Environmental Research Data Interoperability Platform (NERDIP) - a comprehensive and integrated data platform with both common and emerging services designed to enable data accessibility and citability. Applying the Framework across the range of datasets ensures that programmatic access, both in-situ and network methods, work as uniformly as possible for any dataset, using both APIs and data services. NCI has also created a comprehensive quality assurance framework to regularise compliance checks across the data, library APIs and data services, and to establish a comprehensive set of benchmarks to quantify both functionality and performance perspectives for the Framework. The quality assurance includes organisation of datasets through a data management plan, which anchors the data directory structure, version controls and data information services so that they are kept aligned with operational changes over time. Specific attention has been placed on the way data are packed inside the files. Our experience has shown that complying with standards such as CF and ACDD is still not enough to ensure that all data services or software packages correctly read the data. Further, data may not be optimally organised for the different access patterns, which causes poor performance of the CPUs and bandwidth utilisation. We will also discuss some gaps in the Framework that have emerged and our approach to resolving these.
Integrative missing value estimation for microarray data.
Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine
2006-10-12
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
The Atlanta Motor Speech Disorders Corpus: Motivation, Development, and Utility.
Laures-Gore, Jacqueline; Russell, Scott; Patel, Rupal; Frankel, Michael
2016-01-01
This paper describes the design and collection of a comprehensive spoken language dataset from speakers with motor speech disorders in Atlanta, Ga., USA. This collaborative project aimed to gather a spoken database consisting of nonmainstream American English speakers residing in the Southeastern US in order to provide a more diverse perspective of motor speech disorders. Ninety-nine adults with an acquired neurogenic disorder resulting in a motor speech disorder were recruited. Stimuli include isolated vowels, single words, sentences with contrastive focus, sentences with emotional content and prosody, sentences with acoustic and perceptual sensitivity to motor speech disorders, as well as 'The Caterpillar' and 'The Grandfather' passages. Utility of this data in understanding the potential interplay of dialect and dysarthria was demonstrated with a subset of the speech samples existing in the database. The Atlanta Motor Speech Disorders Corpus will enrich our understanding of motor speech disorders through the examination of speech from a diverse group of speakers. © 2016 S. Karger AG, Basel.
Gene transfers can date the tree of life.
Davín, Adrián A; Tannier, Eric; Williams, Tom A; Boussau, Bastien; Daubin, Vincent; Szöllősi, Gergely J
2018-05-01
Biodiversity has always been predominantly microbial, and the scarcity of fossils from bacteria, archaea and microbial eukaryotes has prevented a comprehensive dating of the tree of life. Here, we show that patterns of lateral gene transfer deduced from an analysis of modern genomes encode a novel and abundant source of information about the temporal coexistence of lineages throughout the history of life. We use state-of-the-art species tree-aware phylogenetic methods to reconstruct the history of thousands of gene families and demonstrate that dates implied by gene transfers are consistent with estimates from relaxed molecular clocks in Bacteria, Archaea and Eukarya. We present the order of speciations according to lateral gene transfer data calibrated to geological time for three datasets comprising 40 genomes for Cyanobacteria, 60 genomes for Archaea and 60 genomes for Fungi. An inspection of discrepancies between transfers and clocks and a comparison with mammalian fossils show that gene transfer in microbes is potentially as informative for dating the tree of life as the geological record in macroorganisms.
Rakotoarinivo, Mijoro; Blach-Overgaard, Anne; Baker, William J.; Dransfield, John; Moat, Justin; Svenning, Jens-Christian
2013-01-01
The distribution of rainforest in many regions across the Earth was strongly affected by Pleistocene ice ages. However, the extent to which these dynamics are still important for modern-day biodiversity patterns within tropical biodiversity hotspots has not been assessed. We employ a comprehensive dataset of Madagascan palms (Arecaceae) and climate reconstructions from the last glacial maximum (LGM; 21 000 years ago) to assess the relative role of modern environment and LGM climate in explaining geographical species richness patterns in this major tropical biodiversity hotspot. We found that palaeoclimate exerted a strong influence on palm species richness patterns, with richness peaking in areas with higher LGM precipitation relative to present-day even after controlling for modern environment, in particular in northeastern Madagascar, consistent with the persistence of tropical rainforest during the LGM primarily in this region. Our results provide evidence that diversity patterns in the World's most biodiverse regions may be shaped by long-term climate history as well as contemporary environment. PMID:23427173
Prediction of drug indications based on chemical interactions and chemical similarities.
Huang, Guohua; Lu, Yin; Lu, Changhong; Zheng, Mingyue; Cai, Yu-Dong
2015-01-01
Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs.
Prediction of Drug Indications Based on Chemical Interactions and Chemical Similarities
Huang, Guohua; Lu, Yin; Lu, Changhong; Cai, Yu-Dong
2015-01-01
Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs. PMID:25821813
A large-scale dataset of solar event reports from automated feature recognition modules
NASA Astrophysics Data System (ADS)
Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.
2016-05-01
The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.
Towards an effective data peer review
NASA Astrophysics Data System (ADS)
Düsterhus, André; Hense, Andreas
2014-05-01
Peer review is an established procedure to ensure the quality of scientific publications and is currently used as a prerequisite for acceptance of papers in the scientific community. In the past years the publication of raw data and its metadata got increased attention, which led to the idea of bringing it to the same standards the journals for traditional publications have. One missing element to achieve this is a comparable peer review scheme. This contribution introduces the idea of a quality evaluation process, which is designed to analyse the technical quality as well as the content of a dataset. It bases on quality tests, which results are evaluated with the help of the knowledge of an expert. The results of the tests and the expert knowledge are evaluated probabilistically and are statistically combined. As a result the quality of a dataset is estimated with a single value only. This approach allows the reviewer to quickly identify the potential weaknesses of a dataset and generate a transparent and comprehensible report. To demonstrate the scheme, an application on a large meteorological dataset will be shown. Furthermore, potentials and risks of such a scheme will be introduced and practical implications for its possible introduction to data centres investigated. Especially, the effects of reducing the estimate of quality of a dataset to a single number will be critically discussed.
Li, Jia; Xia, Changqun; Chen, Xiaowu
2017-10-12
Image-based salient object detection (SOD) has been extensively studied in past decades. However, video-based SOD is much less explored due to the lack of large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos. In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects who free-view all videos. From the user data, we find that salient objects in a video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for videobased salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliencyguided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at the pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are constructed in an unsupervised manner that automatically infers a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. In experiments, the proposed unsupervised approach is compared with 31 state-of-the-art models on the proposed dataset and outperforms 30 of them, including 19 imagebased classic (unsupervised or non-deep learning) models, six image-based deep learning models, and five video-based unsupervised models. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.
Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis
2014-01-01
Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution dataset mergers, such as the one exemplified here, can serve as a baseline towards comprehensive species distribution datasets.
2013-01-01
Background Good quality spatial data on Family Physicians or General Practitioners (GPs) are key to accurately measuring geographic access to primary health care. The validity of computed associations between health outcomes and measures of GP access such as GP density is contingent on geographical data quality. This is especially true in rural and remote areas, where GPs are often small in number and geographically dispersed. However, there has been limited effort in assessing the quality of nationally comprehensive, geographically explicit, GP datasets in Australia or elsewhere. Our objective is to assess the extent of association or agreement between different spatially explicit nationwide GP workforce datasets in Australia. This is important since disagreement would imply differential relationships with primary healthcare relevant outcomes with different datasets. We also seek to enumerate these associations across categories of rurality or remoteness. Method We compute correlations of GP headcounts and workload contributions between four different datasets at two different geographical scales, across varying levels of rurality and remoteness. Results The datasets are in general agreement with each other at two different scales. Small numbers of absolute headcounts, with relatively larger fractions of locum GPs in rural areas cause unstable statistical estimates and divergences between datasets. Conclusion In the Australian context, many of the available geographic GP workforce datasets may be used for evaluating valid associations with health outcomes. However, caution must be exercised in interpreting associations between GP headcounts or workloads and outcomes in rural and remote areas. The methods used in these analyses may be replicated in other locales with multiple GP or physician datasets. PMID:24005003
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sellers, P.J.; Collatz, J.; Koster, R.
1996-09-01
A comprehensive series of global datasets for land-atmosphere models has been collected, formatted to a common grid, and released on a set of CD-ROMs. This paper describes the motivation for and the contents of the dataset. In June of 1992, an interdisciplinary earth science workshop was convened in Columbia, Maryland, to assess progress in land-atmosphere research, specifically in the areas of models, satellite data algorithms, and field experiments. At the workshop, representatives of the land-atmosphere modeling community defined a need for global datasets to prescribe boundary conditions, initialize state variables, and provide near-surface meteorological and radiative forcings for their models.more » The International Satellite Land Surface Climatology Project (ISLSCP), a part of the Global Energy and Water Cycle Experiment, worked with the Distributed Active Archive Center of the National Aeronautics and Space Administration Goddard Space Flight Center to bring the required datasets together in a usable format. The data have since been released on a collection of CD-ROMs. The datasets on the CD-ROMs are grouped under the following headings: vegetation; hydrology and soils; snow, ice, and oceans; radiation and clouds; and near-surface meteorology. All datasets cover the period 1987-88, and all but a few are spatially continuous over the earth`s land surface. All have been mapped to a common 1{degree} x 1{degree} equal-angle grid. The temporal frequency for most of the datasets is monthly. A few of the near-surface meteorological parameters are available both as six-hourly values and as monthly means. 26 refs., 8 figs., 2 tabs.« less
A shortest-path graph kernel for estimating gene product semantic similarity.
Alvarez, Marco A; Qi, Xiaojun; Yan, Changhui
2011-07-29
Existing methods for calculating semantic similarity between gene products using the Gene Ontology (GO) often rely on external resources, which are not part of the ontology. Consequently, changes in these external resources like biased term distribution caused by shifting of hot research topics, will affect the calculation of semantic similarity. One way to avoid this problem is to use semantic methods that are "intrinsic" to the ontology, i.e. independent of external knowledge. We present a shortest-path graph kernel (spgk) method that relies exclusively on the GO and its structure. In spgk, a gene product is represented by an induced subgraph of the GO, which consists of all the GO terms annotating it. Then a shortest-path graph kernel is used to compute the similarity between two graphs. In a comprehensive evaluation using a benchmark dataset, spgk compares favorably with other methods that depend on external resources. Compared with simUI, a method that is also intrinsic to GO, spgk achieves slightly better results on the benchmark dataset. Statistical tests show that the improvement is significant when the resolution and EC similarity correlation coefficient are used to measure the performance, but is insignificant when the Pfam similarity correlation coefficient is used. Spgk uses a graph kernel method in polynomial time to exploit the structure of the GO to calculate semantic similarity between gene products. It provides an alternative to both methods that use external resources and "intrinsic" methods with comparable performance.
NO and NOy in the upper troposphere: Nine years of CARIBIC measurements onboard a passenger aircraft
NASA Astrophysics Data System (ADS)
Stratmann, G.; Ziereis, H.; Stock, P.; Brenninkmeijer, C. A. M.; Zahn, A.; Rauthe-Schöch, A.; Velthoven, P. V.; Schlager, H.; Volz-Thomas, A.
2016-05-01
Nitrogen oxide (NO and NOy) measurements were performed onboard an in-service aircraft within the framework of CARIBIC (Civil Aircraft for the Regular Investigation of the atmosphere Based on an Instrument Container). A total of 330 flights were completed from May 2005 through April 2013 between Frankfurt/Germany and destination airports in Canada, the USA, Brazil, Venezuela, Chile, Argentina, Colombia, South Africa, China, South Korea, Japan, India, Thailand, and the Philippines. Different regions show differing NO and NOy mixing ratios. In the mid-latitudes, observed NOy and NO generally shows clear seasonal cycles in the upper troposphere with a maximum in summer and a minimum in winter. Mean NOy mixing ratios vary between 1.36 nmol/mol in summer and 0.27 nmol/mol in winter. Mean NO mixing ratios range between 0.05 nmol/mol and 0.22 nmol/mol. Regions south of 40°N show no consistent seasonal dependence. Based on CO observations, low, median and high CO air masses were defined. According to this classification, more data was obtained in high CO air masses in the regions south of 40°N compared to the midlatitudes. This indicates that boundary layer emissions are more important in these regions. In general, NOy mixing ratios are highest when measured in high CO air masses. This dataset is one of the most comprehensive NO and NOy dataset available today for the upper troposphere and is therefore highly suitable for the validation of atmosphere-chemistry-models.
A Biome map for Modelling Global Mid-Pliocene Climate Change
NASA Astrophysics Data System (ADS)
Salzmann, U.; Haywood, A. M.
2006-12-01
The importance of vegetation-climate feedbacks was highlighted by several paleo-climate modelling exercises but their role as a boundary condition in Tertiary modelling has not been fully recognised or explored. Several paleo-vegetation datasets and maps have been produced for specific time slabs or regions for the Tertiary, but the vegetation classifications that have been used differ, thus making meaningful comparisons difficult. In order to facilitate further investigations into Tertiary climate and environmental change we are presently implementing the comprehensive GIS database TEVIS (Tertiary Environment and Vegetation Information System). TEVIS integrates marine and terrestrial vegetation data, taken from fossil pollen, leaf or wood, into an internally consistent classification scheme to produce for different time slabs global Tertiary Biome and Mega- Biome maps (Harrison & Prentice, 2003). In the frame of our ongoing 5-year programme we present a first global vegetation map for the mid-Pliocene time slab, a period of sustained global warmth. Data were synthesised from the PRISM data set (Thompson and Fleming 1996) after translating them to the Biome classification scheme and from new literature. The outcomes of the Biome map are compared with modelling results using an advanced numerical general circulation model (HadAM3) and the BIOME 4 vegetation model. Our combined proxy data and modelling approach will provide new palaeoclimate datasets to test models that are used to predict future climate change, and provide a more rigorous picture of climate and environmental changes during the Neogene.
Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.
2017-01-01
Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines. PMID:29372115
Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S
2017-01-01
As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
An algorithm for direct causal learning of influences on patient outcomes.
Rathnam, Chandramouli; Lee, Sanghoon; Jiang, Xia
2017-01-01
This study aims at developing and introducing a new algorithm, called direct causal learner (DCL), for learning the direct causal influences of a single target. We applied it to both simulated and real clinical and genome wide association study (GWAS) datasets and compared its performance to classic causal learning algorithms. The DCL algorithm learns the causes of a single target from passive data using Bayesian-scoring, instead of using independence checks, and a novel deletion algorithm. We generate 14,400 simulated datasets and measure the number of datasets for which DCL correctly and partially predicts the direct causes. We then compare its performance with the constraint-based path consistency (PC) and conservative PC (CPC) algorithms, the Bayesian-score based fast greedy search (FGS) algorithm, and the partial ancestral graphs algorithm fast causal inference (FCI). In addition, we extend our comparison of all five algorithms to both a real GWAS dataset and real breast cancer datasets over various time-points in order to observe how effective they are at predicting the causal influences of Alzheimer's disease and breast cancer survival. DCL consistently outperforms FGS, PC, CPC, and FCI in discovering the parents of the target for the datasets simulated using a simple network. Overall, DCL predicts significantly more datasets correctly (McNemar's test significance: p<0.0001) than any of the other algorithms for these network types. For example, when assessing overall performance (simple and complex network results combined), DCL correctly predicts approximately 1400 more datasets than the top FGS method, 1600 more datasets than the top CPC method, 4500 more datasets than the top PC method, and 5600 more datasets than the top FCI method. Although FGS did correctly predict more datasets than DCL for the complex networks, and DCL correctly predicted only a few more datasets than CPC for these networks, there is no significant difference in performance between these three algorithms for this network type. However, when we use a more continuous measure of accuracy, we find that all the DCL methods are able to better partially predict more direct causes than FGS and CPC for the complex networks. In addition, DCL consistently had faster runtimes than the other algorithms. In the application to the real datasets, DCL identified rs6784615, located on the NISCH gene, and rs10824310, located on the PRKG1 gene, as direct causes of late onset Alzheimer's disease (LOAD) development. In addition, DCL identified ER category as a direct predictor of breast cancer mortality within 5 years, and HER2 status as a direct predictor of 10-year breast cancer mortality. These predictors have been identified in previous studies to have a direct causal relationship with their respective phenotypes, supporting the predictive power of DCL. When the other algorithms discovered predictors from the real datasets, these predictors were either also found by DCL or could not be supported by previous studies. Our results show that DCL outperforms FGS, PC, CPC, and FCI in almost every case, demonstrating its potential to advance causal learning. Furthermore, our DCL algorithm effectively identifies direct causes in the LOAD and Metabric GWAS datasets, which indicates its potential for clinical applications. Copyright © 2016 Elsevier B.V. All rights reserved.
Arnold, L. Rick
2010-01-01
These datasets were compiled in support of U.S. Geological Survey Scientific-Investigations Report 2010-5082-Hydrogeology and Steady-State Numerical Simulation of Groundwater Flow in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. The datasets were developed by the U.S. Geological Survey in cooperation with the Lost Creek Ground Water Management District and the Colorado Geological Survey. The four datasets are described as follows and methods used to develop the datasets are further described in Scientific-Investigations Report 2010-5082: (1) ds507_regolith_data: This point dataset contains geologic information concerning regolith (unconsolidated sediment) thickness and top-of-bedrock altitude at selected well and test-hole locations in and near the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Data were compiled from published reports, consultant reports, and from lithologic logs of wells and test holes on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources. (2) ds507_regthick_contours: This dataset consists of contours showing generalized lines of equal regolith thickness overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness was contoured manually on the basis of information provided in the dataset ds507_regolith_data. (3) ds507_regthick_grid: This dataset consists of raster-based generalized thickness of regolith overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness in this dataset was derived from contours presented in the dataset ds507_regthick_contours. (4) ds507_welltest_data: This point dataset contains estimates of aquifer transmissivity and hydraulic conductivity at selected well locations in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. This dataset also contains hydrologic information used to estimate transmissivity from specific capacity at selected well locations. Data were compiled from published reports, consultant reports, and from well-test records on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources.
A new method for water quality assessment: by harmony degree equation.
Zuo, Qiting; Han, Chunhui; Liu, Jing; Ma, Junxia
2018-02-22
Water quality assessment is an important basic work in the development, utilization, management, and protection of water resources, and also a prerequisite for water safety. In this paper, the harmony degree equation (HDE) was introduced into the research of water quality assessment, and a new method for water quality assessment was proposed according to the HDE: by harmony degree equation (WQA-HDE). First of all, the calculation steps and ideas of this method were described in detail, and then, this method with some other important methods of water quality assessment (single factor assessment method, mean-type comprehensive index assessment method, and multi-level gray correlation assessment method) were used to assess the water quality of the Shaying River (the largest tributary of the Huaihe in China). For this purpose, 2 years (2013-2014) dataset of nine water quality variables covering seven monitoring sites, and approximately 189 observations were used to compare and analyze the characteristics and advantages of the new method. The results showed that the calculation steps of WQA-HDE are similar to the comprehensive assessment method, and WQA-HDE is more operational comparing with the results of other water quality assessment methods. In addition, this new method shows good flexibility by setting the judgment criteria value HD 0 of water quality; when HD 0 = 0.8, the results are closer to reality, and more realistic and reliable. Particularly, when HD 0 = 1, the results of WQA-HDE are consistent with the single factor assessment method, both methods are subject to the most stringent "one vote veto" judgment condition. So, WQA-HDE is a composite method that combines the single factor assessment and comprehensive assessment. This research not only broadens the research field of theoretical method system of harmony theory but also promotes the unity of water quality assessment method and can be used for reference in other comprehensive assessment.
NASA Astrophysics Data System (ADS)
Hiebl, Johann; Frei, Christoph
2018-04-01
Spatial precipitation datasets that are long-term consistent, highly resolved and extend over several decades are an increasingly popular basis for modelling and monitoring environmental processes and planning tasks in hydrology, agriculture, energy resources management, etc. Here, we present a grid dataset of daily precipitation for Austria meant to promote such applications. It has a grid spacing of 1 km, extends back till 1961 and is continuously updated. It is constructed with the classical two-tier analysis, involving separate interpolations for mean monthly precipitation and daily relative anomalies. The former was accomplished by kriging with topographic predictors as external drift utilising 1249 stations. The latter is based on angular distance weighting and uses 523 stations. The input station network was kept largely stationary over time to avoid artefacts on long-term consistency. Example cases suggest that the new analysis is at least as plausible as previously existing datasets. Cross-validation and comparison against experimental high-resolution observations (WegenerNet) suggest that the accuracy of the dataset depends on interpretation. Users interpreting grid point values as point estimates must expect systematic overestimates for light and underestimates for heavy precipitation as well as substantial random errors. Grid point estimates are typically within a factor of 1.5 from in situ observations. Interpreting grid point values as area mean values, conditional biases are reduced and the magnitude of random errors is considerably smaller. Together with a similar dataset of temperature, the new dataset (SPARTACUS) is an interesting basis for modelling environmental processes, studying climate change impacts and monitoring the climate of Austria.
NASA Astrophysics Data System (ADS)
Lal, Mohan; Mishra, S. K.; Pandey, Ashish; Pandey, R. P.; Meena, P. K.; Chaudhary, Anubhav; Jha, Ranjit Kumar; Shreevastava, Ajit Kumar; Kumar, Yogendra
2017-01-01
The Soil Conservation Service curve number (SCS-CN) method, also known as the Natural Resources Conservation Service curve number (NRCS-CN) method, is popular for computing the volume of direct surface runoff for a given rainfall event. The performance of the SCS-CN method, based on large rainfall (P) and runoff (Q) datasets of United States watersheds, is evaluated using a large dataset of natural storm events from 27 agricultural plots in India. On the whole, the CN estimates from the National Engineering Handbook (chapter 4) tables do not match those derived from the observed P and Q datasets. As a result, the runoff prediction using former CNs was poor for the data of 22 (out of 24) plots. However, the match was little better for higher CN values, consistent with the general notion that the existing SCS-CN method performs better for high rainfall-runoff (high CN) events. Infiltration capacity (fc) was the main explanatory variable for runoff (or CN) production in study plots as it exhibited the expected inverse relationship between CN and fc. The plot-data optimization yielded initial abstraction coefficient (λ) values from 0 to 0.659 for the ordered dataset and 0 to 0.208 for the natural dataset (with 0 as the most frequent value). Mean and median λ values were, respectively, 0.030 and 0 for the natural rainfall-runoff dataset and 0.108 and 0 for the ordered rainfall-runoff dataset. Runoff estimation was very sensitive to λ and it improved consistently as λ changed from 0.2 to 0.03.
Slavinskaya, N. A.; Abbasi, M.; Starcke, J. H.; ...
2017-01-24
An automated data-centric infrastructure, Process Informatics Model (PrIMe), was applied to validation and optimization of a syngas combustion model. The Bound-to-Bound Data Collaboration (B2BDC) module of PrIMe was employed to discover the limits of parameter modifications based on uncertainty quantification (UQ) and consistency analysis of the model–data system and experimental data, including shock-tube ignition delay times and laminar flame speeds. Existing syngas reaction models are reviewed, and the selected kinetic data are described in detail. Empirical rules were developed and applied to evaluate the uncertainty bounds of the literature experimental data. Here, the initial H 2/CO reaction model, assembled frommore » 73 reactions and 17 species, was subjected to a B2BDC analysis. For this purpose, a dataset was constructed that included a total of 167 experimental targets and 55 active model parameters. Consistency analysis of the composed dataset revealed disagreement between models and data. Further analysis suggested that removing 45 experimental targets, 8 of which were self-inconsistent, would lead to a consistent dataset. This dataset was subjected to a correlation analysis, which highlights possible directions for parameter modification and model improvement. Additionally, several methods of parameter optimization were applied, some of them unique to the B2BDC framework. The optimized models demonstrated improved agreement with experiments compared to the initially assembled model, and their predictions for experiments not included in the initial dataset (i.e., a blind prediction) were investigated. The results demonstrate benefits of applying the B2BDC methodology for developing predictive kinetic models.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Slavinskaya, N. A.; Abbasi, M.; Starcke, J. H.
An automated data-centric infrastructure, Process Informatics Model (PrIMe), was applied to validation and optimization of a syngas combustion model. The Bound-to-Bound Data Collaboration (B2BDC) module of PrIMe was employed to discover the limits of parameter modifications based on uncertainty quantification (UQ) and consistency analysis of the model–data system and experimental data, including shock-tube ignition delay times and laminar flame speeds. Existing syngas reaction models are reviewed, and the selected kinetic data are described in detail. Empirical rules were developed and applied to evaluate the uncertainty bounds of the literature experimental data. Here, the initial H 2/CO reaction model, assembled frommore » 73 reactions and 17 species, was subjected to a B2BDC analysis. For this purpose, a dataset was constructed that included a total of 167 experimental targets and 55 active model parameters. Consistency analysis of the composed dataset revealed disagreement between models and data. Further analysis suggested that removing 45 experimental targets, 8 of which were self-inconsistent, would lead to a consistent dataset. This dataset was subjected to a correlation analysis, which highlights possible directions for parameter modification and model improvement. Additionally, several methods of parameter optimization were applied, some of them unique to the B2BDC framework. The optimized models demonstrated improved agreement with experiments compared to the initially assembled model, and their predictions for experiments not included in the initial dataset (i.e., a blind prediction) were investigated. The results demonstrate benefits of applying the B2BDC methodology for developing predictive kinetic models.« less
The Multi-Resolution Land Characteristics (MRLC) Consortium is a good example of the national benefits of federal collaboration. It started in the mid-1990s as a small group of federal agencies with the straightforward goal of compiling a comprehensive national Landsat dataset t...
ERIC Educational Resources Information Center
Zhang, Yu
2013-01-01
With the increasing attention on improving student achievement, private tutoring has been expanding rapidly worldwide. However, the evidence on the effect of private tutoring is inconclusive for education researchers and policy makers. Employing a comprehensive dataset collected from China in 2010, this study tries to identify the effect of…
Compiling a Comprehensive EVA Training Dataset for NASA Astronauts
NASA Technical Reports Server (NTRS)
Laughlin, M. S.; Murry, J. D.; Lee, L. R.; Wear, M. L.; Van Baalen, M.
2016-01-01
Training for a spacewalk or extravehicular activity (EVA) is considered hazardous duty for NASA astronauts. This activity places astronauts at risk for decompression sickness as well as various musculoskeletal disorders from working in the spacesuit. As a result, the operational and research communities over the years have requested access to EVA training data to supplement their studies.
The Educational Impact of Online Learning: How Do University Students Perform in Subsequent Courses?
ERIC Educational Resources Information Center
Krieg, John M.; Henson, Steven E.
2016-01-01
Using a large student-level dataset from a medium-sized regional comprehensive university, we measure the impact of taking an online prerequisite course on follow-up course grades. To control for self-selection into online courses, we utilize student, instructor, course, and time fixed effects augmented with an instrumental variable approach. We…
Identifying Psychometric Properties of the Social-Emotional Learning Skills Scale
ERIC Educational Resources Information Center
Esen-Aygun, Hanife; Sahin-Taskin, Cigdem
2017-01-01
This study aims to develop a comprehensive scale of social-emotional learning. After constructing a wide range of item pool and expertise evaluation, validity and reliability studies were carried out through using the data-set of 439 primary school students at 3rd and 4th grade levels. Exploratory and confirmatory factor analysis results revealed…
USDA-ARS?s Scientific Manuscript database
Bees are among the most important pollinators of flowering plants in most ecosystems. This paper describes a large dataset that represents one of the outcomes of a comprehensive, broadly comparative study on the diversity, biology, biogeography, and evolution of Anthidium Fabricius in the Western He...
James, Eric P.; Benjamin, Stanley G.; Marquis, Melinda
2016-10-28
A new gridded dataset for wind and solar resource estimation over the contiguous United States has been derived from hourly updated 1-h forecasts from the National Oceanic and Atmospheric Administration High-Resolution Rapid Refresh (HRRR) 3-km model composited over a three-year period (approximately 22 000 forecast model runs). The unique dataset features hourly data assimilation, and provides physically consistent wind and solar estimates for the renewable energy industry. The wind resource dataset shows strong similarity to that previously provided by a Department of Energy-funded study, and it includes estimates in southern Canada and northern Mexico. The solar resource dataset represents anmore » initial step towards application-specific fields such as global horizontal and direct normal irradiance. This combined dataset will continue to be augmented with new forecast data from the advanced HRRR atmospheric/land-surface model.« less
NASA Astrophysics Data System (ADS)
Griffiths, Thomas; Habler, Gerlinde; Schantl, Philip; Abart, Rainer
2017-04-01
Crystallographic orientation relationships (CORs) between crystalline inclusions and their hosts are commonly used to support particular inclusion origins, but often interpretations are based on a small fraction of all inclusions in a system. The electron backscatter diffraction (EBSD) method allows collection of large COR datasets more quickly than other methods while maintaining high spatial resolution. Large datasets allow analysis of the relative frequencies of different CORs, and identification of 'statistical CORs', where certain limited degrees of freedom exist in the orientation relationship between two neighbour crystals (Griffiths et al. 2016). Statistical CORs exist in addition to completely fixed 'specific' CORs (previously the only type of COR considered). We present a comparison of three EBSD single point datasets (all N > 200 inclusions) of rutile inclusions in garnet hosts, covering three rock systems, each with a different geological history: 1) magmatic garnet in pegmatite from the Koralpe complex, Eastern Alps, formed at temperatures > 600°C and low pressures; 2) granulite facies garnet rims on ultra-high-pressure garnets from the Kimi complex, Rhodope Massif; and 3) a Moldanubian granulite from the southeastern Bohemian Massif, equilibrated at peak conditions of 1050°C and 1.6 GPa. The present study is unique because all datasets have been analysed using the same catalogue of potential CORs, therefore relative frequencies and other COR properties can be meaningfully compared. In every dataset > 94% of the inclusions analysed exhibit one of the CORs tested for. Certain CORs are consistently among the most common in all datasets. However, the relative abundances of these common CORs show large variations between datasets (varying from 8 to 42 % relative abundance in one case). Other CORs are consistently uncommon but nonetheless present in every dataset. Lastly, there are some CORs that are common in one of the datasets and rare in the remainder. These patterns suggest competing influences on relative COR frequencies. Certain CORs seem consistently favourable, perhaps pointing to very stable low energy configurations, whereas some CORs are favoured in only one system, perhaps due to particulars of the formation mechanism, kinetics or conditions. Variations in COR frequencies between datasets seem to correlate with the conditions of host-inclusion system evolution. The two datasets from granulite-facies metamorphic samples show more similarities to each other than to the pegmatite dataset, and the sample inferred to have experienced the highest temperatures (Moldanubian granulite) shows the lowest diversity of CORs, low frequencies of statistical CORs and the highest frequency of specific CORs. These results provide evidence that petrological information is being encoded in COR distributions. They make a strong case for further studies of the factors influencing COR development and for measurements of COR distributions in other systems and between different phases. Griffiths, T.A., Habler, G., Abart, R. (2016): Crystallographic orientation relationships in host-inclusion systems: New insights from large EBSD data sets. Amer. Miner., 101, 690-705.
NASA Astrophysics Data System (ADS)
Sun, Yankui; Li, Shan; Sun, Zhongyang
2017-01-01
We propose a framework for automated detection of dry age-related macular degeneration (AMD) and diabetic macular edema (DME) from retina optical coherence tomography (OCT) images, based on sparse coding and dictionary learning. The study aims to improve the classification performance of state-of-the-art methods. First, our method presents a general approach to automatically align and crop retina regions; then it obtains global representations of images by using sparse coding and a spatial pyramid; finally, a multiclass linear support vector machine classifier is employed for classification. We apply two datasets for validating our algorithm: Duke spectral domain OCT (SD-OCT) dataset, consisting of volumetric scans acquired from 45 subjects-15 normal subjects, 15 AMD patients, and 15 DME patients; and clinical SD-OCT dataset, consisting of 678 OCT retina scans acquired from clinics in Beijing-168, 297, and 213 OCT images for AMD, DME, and normal retinas, respectively. For the former dataset, our classifier correctly identifies 100%, 100%, and 93.33% of the volumes with DME, AMD, and normal subjects, respectively, and thus performs much better than the conventional method; for the latter dataset, our classifier leads to a correct classification rate of 99.67%, 99.67%, and 100.00% for DME, AMD, and normal images, respectively.
Sun, Yankui; Li, Shan; Sun, Zhongyang
2017-01-01
We propose a framework for automated detection of dry age-related macular degeneration (AMD) and diabetic macular edema (DME) from retina optical coherence tomography (OCT) images, based on sparse coding and dictionary learning. The study aims to improve the classification performance of state-of-the-art methods. First, our method presents a general approach to automatically align and crop retina regions; then it obtains global representations of images by using sparse coding and a spatial pyramid; finally, a multiclass linear support vector machine classifier is employed for classification. We apply two datasets for validating our algorithm: Duke spectral domain OCT (SD-OCT) dataset, consisting of volumetric scans acquired from 45 subjects—15 normal subjects, 15 AMD patients, and 15 DME patients; and clinical SD-OCT dataset, consisting of 678 OCT retina scans acquired from clinics in Beijing—168, 297, and 213 OCT images for AMD, DME, and normal retinas, respectively. For the former dataset, our classifier correctly identifies 100%, 100%, and 93.33% of the volumes with DME, AMD, and normal subjects, respectively, and thus performs much better than the conventional method; for the latter dataset, our classifier leads to a correct classification rate of 99.67%, 99.67%, and 100.00% for DME, AMD, and normal images, respectively.
Longitudinal Data on the Effectiveness of Mathematics Mini-Games in Primary Education
ERIC Educational Resources Information Center
Bakker, Marjoke; Van den Heuvel-Panhuizen, Marja; Robitzsch, Alexander
2015-01-01
This paper describes a dataset consisting of longitudinal data gathered in the BRXXX project. The aim of the project was to investigate the effectiveness of online mathematics mini-games in enhancing primary school students' multiplicative reasoning ability (multiplication and division). The dataset includes data of 719 students from 35 primary…
NASA Astrophysics Data System (ADS)
Hidalgo, A.; González-Rouco, J. F.; Jiménez, P. A.; Navarro, J.; García-Bustamante, E.; Lucio-Eceiza, E. E.; Montávez, J. P.; García, A. Y.; Prieto, L.
2012-04-01
Offshore wind energy is becoming increasingly important as a reliable source of electricity generation. The areas located in the vicinity of the Cantabrian and Mediterranean coasts are areas of interest in this regard. This study targets an assessment of the wind resource focused on the two coastal regions and the strip of land between them, thereby including most of the northeastern part of the Iberian Peninsula (IP) and containing the Ebro basin. The analysis of the wind resource in inland areas is crucial as the wind channeling through the existing mountains has a direct impact on the sea circulations near the coast. The thermal circulations generated by the topography near the coast also influence the offshore wind resource. This work summarizes the results of the first steps of a Quality Assurance (QA) procedure applied to the surface wind database available over the area of interest. The dataset consists of 752 stations compiled from different sources: 14 buoys distributed over the IP coast provided by Puertos del Estado (1990-2010); and 738 land sites over the area of interest provided by 8 different Spanish institutions (1933-2010) and the National Center of Atmospheric Research (NCAR; 1978-2010). It is worth noting that the variety of institutional observational protocols lead to different temporal resolutions and peculiarities that somewhat complicate the QA. The QA applied to the dataset is structured in three steps that involve the detection and suppression of: 1) manipulation errors (i.e. repetitions); 2) unrealistic values and ranges in wind module and direction; 3) abnormally low (e.g. long constant periods) and high variations (e.g. extreme values and inhomogeneities) to ensure the temporal consistency of the time series. A quality controlled observational network of wind variables with such spatial density and temporal length is not frequent and specifically for the IP is not documented in the literature. The final observed dataset will allow for a comprehensive understanding of the wind field climatology and variability and its association with the large scale atmospheric circulation as well as their dependence on local/regional features like topography, land-sea contrast, etc. In future steps, a high spatial resolution simulation will be accomplished with the WRF mesoescale model in order to improve the knowledge of the wind field in the area of interest. Such simulation will be validated by comparison with the observational dataset. In addition, studies to analyze the sensitivity of the model to different factors such as the parameterizations of the most significant physical processes that the model does not solve explicitly, the boundary conditions that feed the model, etc. will be carried out.
Nasejje, Justine B; Mwambi, Henry; Dheda, Keertan; Lesosky, Maia
2017-07-28
Random survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (CIF) are known to correct the bias in RSF models by separating the procedure for the best covariate to split on from that of the best split point search for the selected covariate. In this study, we compare the random survival forest model to the conditional inference model (CIF) using twenty-two simulated time-to-event datasets. We also analysed two real time-to-event datasets. The first dataset is based on the survival of children under-five years of age in Uganda and it consists of categorical covariates with most of them having more than two levels (many split-points). The second dataset is based on the survival of patients with extremely drug resistant tuberculosis (XDR TB) which consists of mainly categorical covariates with two levels (few split-points). The study findings indicate that the conditional inference forest model is superior to random survival forest models in analysing time-to-event data that consists of covariates with many split-points based on the values of the bootstrap cross-validated estimates for integrated Brier scores. However, conditional inference forests perform comparably similar to random survival forests models in analysing time-to-event data consisting of covariates with fewer split-points. Although survival forests are promising methods in analysing time-to-event data, it is important to identify the best forest model for analysis based on the nature of covariates of the dataset in question.
Improving the use of environmental diversity as a surrogate for species representation.
Albuquerque, Fabio; Beier, Paul
2018-01-01
The continuous p-median approach to environmental diversity (ED) is a reliable way to identify sites that efficiently represent species. A recently developed maximum dispersion (maxdisp) approach to ED is computationally simpler, does not require the user to reduce environmental space to two dimensions, and performed better than continuous p-median for datasets of South African animals. We tested whether maxdisp performs as well as continuous p-median for 12 datasets that included plants and other continents, and whether particular types of environmental variables produced consistently better models of ED. We selected 12 species inventories and atlases to span a broad range of taxa (plants, birds, mammals, reptiles, and amphibians), spatial extents, and resolutions. For each dataset, we used continuous p-median ED and maxdisp ED in combination with five sets of environmental variables (five combinations of temperature, precipitation, insolation, NDVI, and topographic variables) to select environmentally diverse sites. We used the species accumulation index (SAI) to evaluate the efficiency of ED in representing species for each approach and set of environmental variables. Maxdisp ED represented species better than continuous p-median ED in five of 12 biodiversity datasets, and about the same for the other seven biodiversity datasets. Efficiency of ED also varied with type of variables used to define environmental space, but no particular combination of variables consistently performed best. We conclude that maxdisp ED performs at least as well as continuous p-median ED, and has the advantage of faster and simpler computation. Surprisingly, using all 38 environmental variables was not consistently better than using subsets of variables, nor did any subset emerge as consistently best or worst; further work is needed to identify the best variables to define environmental space. Results can help ecologists and conservationists select sites for species representation and assist in conservation planning.
Dreier, Larissa Alice; Zernikow, Boris; Blankenburg, Markus; Wager, Julia
2018-02-01
Sleep problems are a common and serious issue in children with life-limiting conditions (LLCs) and severe psychomotor impairment (SPMI). The "Sleep Questionnaire for Children with Severe Psychomotor Impairment" (Schlaffragebogen für Kinder mit Neurologischen und Anderen Komplexen Erkrankungen, SNAKE) was developed for this unique patient group. In a proxy rating, the SNAKE assesses five different dimensions of sleep(-associated) problems (disturbances going to sleep, disturbances remaining asleep, arousal and breathing disorders, daytime sleepiness, and daytime behavior disorders). It has been tested with respect to construct validity and some aspects of criterion validity. The present study examined whether the five SNAKE scales are consistent with parents' or other caregivers' global ratings of a child's sleep quality. Data from a comprehensive dataset of children and adolescents with LLCs and SPMI were analyzed through correlation coefficients and Mann-Whitney U testing. The results confirmed the consistency of both sources of information. The highest levels of agreements with the global rating were achieved for disturbances in terms of going to sleep and disturbances with respect to remaining asleep. The results demonstrate that the scales and therefore the SNAKE itself is well-suited for gathering information on different sleep(-associated) problems in this vulnerable population.
3Drefine: an interactive web server for efficient protein structure refinement
Bhattacharya, Debswapna; Nowotny, Jackson; Cao, Renzhi; Cheng, Jianlin
2016-01-01
3Drefine is an interactive web server for consistent and computationally efficient protein structure refinement with the capability to perform web-based statistical and visual analysis. The 3Drefine refinement protocol utilizes iterative optimization of hydrogen bonding network combined with atomic-level energy minimization on the optimized model using a composite physics and knowledge-based force fields for efficient protein structure refinement. The method has been extensively evaluated on blind CASP experiments as well as on large-scale and diverse benchmark datasets and exhibits consistent improvement over the initial structure in both global and local structural quality measures. The 3Drefine web server allows for convenient protein structure refinement through a text or file input submission, email notification, provided example submission and is freely available without any registration requirement. The server also provides comprehensive analysis of submissions through various energy and statistical feedback and interactive visualization of multiple refined models through the JSmol applet that is equipped with numerous protein model analysis tools. The web server has been extensively tested and used by many users. As a result, the 3Drefine web server conveniently provides a useful tool easily accessible to the community. The 3Drefine web server has been made publicly available at the URL: http://sysbio.rnet.missouri.edu/3Drefine/. PMID:27131371
Wieczorek, Karina; Lachowska-Cierlik, Dorota; Kajtoch, Łukasz; Kanturski, Mariusz
2017-01-01
The Chaitophorinae is a bionomically diverse Holarctic subfamily of Aphididae. The current classification includes two tribes: the Chaitophorini associated with deciduous trees and shrubs, and Siphini that feed on monocotyledonous plants. We present the first phylogenetic hypothesis for the subfamily, based on molecular and morphological datasets. Molecular analyses were based on the mitochondrial gene cytochrome oxidase subunit I (COI) and the nuclear gene elongation factor-1α (EF-1α). Phylogenetic inferences were obtained individually on each of genes and joined alignments using Bayesian inference (BI) and Maximum likelihood (ML). In phylogenetic trees reconstructed on the basis of nuclear and mitochondrial genes as well as a morphological dataset, the monophyly of Siphini and the genus Chaitophorus was supported. Periphyllus forms independent lineages from Chaitophorus and Siphini. Within this genus two clades comprising European and Asiatic species, respectively, were indicated. Concerning relationships within the subfamily, EF-1α and joined COI and EF-1α genes analysis strongly supports the hypothesis that Chaitophorini do not form a monophyletic clade. Periphyllus is a sister group to a clade containing Chaitophorus and Siphini. The Asiatic unit of Periphyllus also includes Trichaitophorus koyaensis. The analysis of morphological dataset under equally weighted parsimony also supports the view that Chaitophorini is an artificial taxon, as Lambersaphis pruinosae and Pseudopterocomma hughi, both traditionally included in the Chaitophorini, formed independent lineages. COI analyses support consistent groups within the subfamily, but relationships between groups are poorly resolved. These analyses were extended to include the species of closely related and phylogenetically unstudied subfamily Drepanosiphinae, which produced congruent results. Genera Drepanosiphum and Depanaphis are monophyletic and sister. The position of Yamatocallis tokyoensis differs in the molecular and morphological analyses, i.e. it is either an independent lineage (EF-1α, COI, joined COI and EF-1α genes) or is nested inside this unit (morphology). Our data also support separation of Chaitophorinae from Drepanosiphinae. PMID:28288166
Advanced Multivariate Inversion Techniques for High Resolution 3D Geophysical Modeling (Invited)
NASA Astrophysics Data System (ADS)
Maceira, M.; Zhang, H.; Rowe, C. A.
2009-12-01
We focus on the development and application of advanced multivariate inversion techniques to generate a realistic, comprehensive, and high-resolution 3D model of the seismic structure of the crust and upper mantle that satisfies several independent geophysical datasets. Building on previous efforts of joint invesion using surface wave dispersion measurements, gravity data, and receiver functions, we have added a fourth dataset, seismic body wave P and S travel times, to the simultaneous joint inversion method. We present a 3D seismic velocity model of the crust and upper mantle of northwest China resulting from the simultaneous, joint inversion of these four data types. Surface wave dispersion measurements are primarily sensitive to seismic shear-wave velocities, but at shallow depths it is difficult to obtain high-resolution velocities and to constrain the structure due to the depth-averaging of the more easily-modeled, longer-period surface waves. Gravity inversions have the greatest resolving power at shallow depths, and they provide constraints on rock density variations. Moreover, while surface wave dispersion measurements are primarily sensitive to vertical shear-wave velocity averages, body wave receiver functions are sensitive to shear-wave velocity contrasts and vertical travel-times. Addition of the fourth dataset, consisting of seismic travel-time data, helps to constrain the shear wave velocities both vertically and horizontally in the model cells crossed by the ray paths. Incorporation of both P and S body wave travel times allows us to invert for both P and S velocity structure, capitalizing on empirical relationships between both wave types’ seismic velocities with rock densities, thus eliminating the need for ad hoc assumptions regarding the Poisson ratios. Our new tomography algorithm is a modification of the Maceira and Ammon joint inversion code, in combination with the Zhang and Thurber TomoDD (double-difference tomography) program.
NASA Astrophysics Data System (ADS)
Powell, Eric N.; Kuykendall, Kelsey M.; Moreno, Paula
2017-06-01
A comprehensive dataset for the Georges Bank region is used to directly compare the distribution of the death assemblage and the living community at large spatial scales and to assess the application of the death assemblage in tracking changes in species' distributional pattern as a consequence of climate change. Focus is placed on the biomass-dominant clam species of the northwest Atlantic continental shelf: the surfclam Spisula solidissima and the ocean quahog Arctica islandica, for which extensive datasets exist on the distributions of the living population and the death assemblage. For both surfclams and ocean quahogs, the distribution of dead shells, in the main, tracked the distribution of live animals relatively closely. Thus, for both species, the presence of dead shells was a positive indicator of present, recent, or past occupation by live animals. Shell dispersion within habitat was greater for surfclams than for ocean quahogs either due to spatial time averaging, animals not living in all habitable areas all of the time, or parautochthonous redistribution of shell. The regional distribution of dead shell differed from the distribution of live animals, for both species, in a systematic way indicative of range shifts due to climate change. In each case the differential distribution was consistent with warming of the northwest Atlantic. Present-day overlap of live surfclams with live ocean quahogs was consistent with the expectation that the surfclam's range is shifting into deeper water in response to the recent warming trend. The presence of locations devoid of dead shells where live surfclams nevertheless were collected measures the recentness of this event. The presence of dead ocean quahog shells at shallower depths than live ocean quahogs offers good evidence that a range shift has occurred in the past, but prior to the initiation of routine surveys in 1980. Possibly, this range shift tracks initial colonization at the end of the Little Ice Age.
Nie, Zhi; Vairavan, Srinivasan; Narayan, Vaibhav A; Ye, Jieping; Li, Qingqin S
2018-01-01
Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70-0.78 and 0.72-0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.
GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets.
Jeong, Seongmun; Kim, Jae-Yoon; Jeong, Soon-Chun; Kang, Sung-Taeg; Moon, Jung-Kyung; Kim, Namshin
2017-01-01
Selecting core subsets from plant genotype datasets is important for enhancing cost-effectiveness and to shorten the time required for analyses of genome-wide association studies (GWAS), and genomics-assisted breeding of crop species, etc. Recently, a large number of genetic markers (>100,000 single nucleotide polymorphisms) have been identified from high-density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) data. However, there is no software available for picking out the efficient and consistent core subset from such a huge dataset. It is necessary to develop software that can extract genetically important samples in a population with coherence. We here present a new program, GenoCore, which can find quickly and efficiently the core subset representing the entire population. We introduce simple measures of coverage and diversity scores, which reflect genotype errors and genetic variations, and can help to select a sample rapidly and accurately for crop genotype dataset. Comparison of our method to other core collection software using example datasets are performed to validate the performance according to genetic distance, diversity, coverage, required system resources, and the number of selected samples. GenoCore selects the smallest, most consistent, and most representative core collection from all samples, using less memory with more efficient scores, and shows greater genetic coverage compared to the other software tested. GenoCore was written in R language, and can be accessed online with an example dataset and test results at https://github.com/lovemun/Genocore.
Unbalanced 2 x 2 Factorial Designs and the Interaction Effect: A Troublesome Combination
2015-01-01
In this power study, ANOVAs of unbalanced and balanced 2 x 2 datasets are compared (N = 120). Datasets are created under the assumption that H1 of the effects is true. The effects are constructed in two ways, assuming: 1. contributions to the effects solely in the treatment groups; 2. contrasting contributions in treatment and control groups. The main question is whether the two ANOVA correction methods for imbalance (applying Sums of Squares Type II or III; SS II or SS III) offer satisfactory power in the presence of an interaction. Overall, SS II showed higher power, but results varied strongly. When compared to a balanced dataset, for some unbalanced datasets the rejection rate of H0 of main effects was undesirably higher. SS III showed consistently somewhat lower power. When the effects were constructed with equal contributions from control and treatment groups, the interaction could be re-estimated satisfactorily. When an interaction was present, SS III led consistently to somewhat lower rejection rates of H0 of main effects, compared to the rejection rates found in equivalent balanced datasets, while SS II produced strongly varying results. In data constructed with only effects in the treatment groups and no effects in the control groups, the H0 of moderate and strong interaction effects was often not rejected and SS II seemed applicable. Even then, SS III provided slightly better results when a true interaction was present. ANOVA allowed not always for a satisfactory re-estimation of the unique interaction effect. Yet, SS II worked better only when an interaction effect could be excluded, whereas SS III results were just marginally worse in that case. Overall, SS III provided consistently 1 to 5% lower rejection rates of H0 in comparison with analyses of balanced datasets, while results of SS II varied too widely for general application. PMID:25807514
NASA Astrophysics Data System (ADS)
Easterday, K.; Kelly, M.; McIntyre, P. J.
2015-12-01
Climate change is forecasted to have considerable influence on the distribution, structure, and function of California's forests. However, human interactions with forested landscapes (e.g. fire suppression, resource extraction and etc.) have complicated scientific understanding of the relative contributions of climate change and anthropogenic land management practices as drivers of change. Observed changes in forest structure towards smaller, denser forests across California have been attributed to both climate change (e.g. increased temperatures and declining water availability) and management practices (e.g. fire suppression and logging). Disentangling how these drivers of change act both together and apart is important to developing sustainable policy and land management practices as well as enhancing knowledge of human and natural system interactions. To that end, a comprehensive historical dataset - the Vegetation Type Mapping project (VTM) - and a modern forest inventory dataset (FIA) are used to analyze how spatial variations in vegetation composition and structure over a ~100 year period can be explained by land ownership.Climate change is forecasted to have considerable influence on the distribution, structure, and function of California's forests. However, human interactions with forested landscapes (e.g. fire suppression, resource extraction and etc.) have complicated scientific understanding of the relative contributions of climate change and anthropogenic land management practices as drivers of change. Observed changes in forest structure towards smaller, denser forests across California have been attributed to both climate change (e.g. increased temperatures and declining water availability) and management practices (e.g. fire suppression and logging). Disentangling how these drivers of change act both together and apart is important to developing sustainable policy and land management practices as well as enhancing knowledge of human and natural system interactions. To that end, a comprehensive historical dataset - the Vegetation Type Mapping project (VTM) - and a modern forest inventory dataset (FIA) are used to analyze how spatial variations in vegetation composition and structure over a ~100 year period can be explained by land ownership.
Krüger, Angela V; Jelier, Rob; Dzyubachyk, Oleh; Zimmerman, Timo; Meijering, Erik; Lehner, Ben
2015-02-15
Chromatin regulators are widely expressed proteins with diverse roles in gene expression, nuclear organization, cell cycle regulation, pluripotency, physiology and development, and are frequently mutated in human diseases such as cancer. Their inhibition often results in pleiotropic effects that are difficult to study using conventional approaches. We have developed a semi-automated nuclear tracking algorithm to quantify the divisions, movements and positions of all nuclei during the early development of Caenorhabditis elegans and have used it to systematically study the effects of inhibiting chromatin regulators. The resulting high dimensional datasets revealed that inhibition of multiple regulators, including F55A3.3 (encoding FACT subunit SUPT16H), lin-53 (RBBP4/7), rba-1 (RBBP4/7), set-16 (MLL2/3), hda-1 (HDAC1/2), swsn-7 (ARID2), and let-526 (ARID1A/1B) affected cell cycle progression and caused chromosome segregation defects. In contrast, inhibition of cir-1 (CIR1) accelerated cell division timing in specific cells of the AB lineage. The inhibition of RNA polymerase II also accelerated these division timings, suggesting that normal gene expression is required to delay cell cycle progression in multiple lineages in the early embryo. Quantitative analyses of the dataset suggested the existence of at least two functionally distinct SWI/SNF chromatin remodeling complex activities in the early embryo, and identified a redundant requirement for the egl-27 and lin-40 MTA orthologs in the development of endoderm and mesoderm lineages. Moreover, our dataset also revealed a characteristic rearrangement of chromatin to the nuclear periphery upon the inhibition of multiple general regulators of gene expression. Our systematic, comprehensive and quantitative datasets illustrate the power of single cell-resolution quantitative tracking and high dimensional phenotyping to investigate gene function. Furthermore, the results provide an overview of the functions of essential chromatin regulators during the early development of an animal. Copyright © 2014 Elsevier Inc. All rights reserved.
Modes of Brachiopod Body Size Evolution throughout the Phanerozoic Eon
NASA Astrophysics Data System (ADS)
Zhang, Z.; Payne, J.
2012-12-01
Body size correlates with numerous physiological and behavioral traits and is therefore one of the most important influences on the survival prospects of individuals and species. Patterns of body size evolution across taxa can therefore complement taxonomic diversity and geochemical proxy data in quantifying controls on long-term trends in the history of life. In contrast to widely available and synoptic taxonomic diversity data for fossil animal families and genera, however, no comprehensive size dataset exists, even for a single fossil animal phylum. For this study, we compiled a comprehensive, genus-level dataset of body sizes spanning the entire Phanerozoic for the phylum Brachiopoda. We use this dataset to examine statistical support for several possible modes of size evolution, in addition to environmental covariates: CO2, O2, and sea level. Brachiopod body size in the Phanerozoic followed two evolutionary modes: directional trend in the Early Paleozoic (Cambrian - Mississippian), and unbiased random walk from the Mississippian to the modern. We find no convincing correlation between trends in any single environmental parameter and brachiopod body size over time. The Paleozoic size increase follows Cope's Rule, and has been documented in many other marine invertebrates, while the Mesozoic size plateau has not been. This interval of size stability correlates with increased competition for resources from bivalves beginning during the Mesozoic Marine Revolution, and may be causally linked. The Late Mesozoic decline in size is an artifact of the improved sampling of smaller genera, many of which are less abundant than their Paleozoic ancestors. The Cenozoic brachiopod dataset is similarly incomplete. Biodiversity is decoupled from size dynamics even within the Paleozoic when brachiopods are on average becoming larger and more abundant, suggesting the presence of different controls. Our findings reveal that the dynamics of body size evolution changed over time in brachiopods, indicating that no single, simple model is likely to capture the true complexity of their evolutionary dynamics.
A comprehensive assessment of the musculoskeletal system: The CAMS-Knee data set.
Taylor, William R; Schütz, Pascal; Bergmann, Georg; List, Renate; Postolka, Barbara; Hitz, Marco; Dymke, Jörn; Damm, Philipp; Duda, Georg; Gerber, Hans; Schwachmeyer, Verena; Hosseini Nasab, Seyyed Hamed; Trepczynski, Adam; Kutzner, Ines
2017-12-08
Combined knowledge of the functional kinematics and kinetics of the human body is critical for understanding a wide range of biomechanical processes including musculoskeletal adaptation, injury mechanics, and orthopaedic treatment outcome, but also for validation of musculoskeletal models. Until now, however, no datasets that include internal loading conditions (kinetics), synchronized with advanced kinematic analyses in multiple subjects have been available. Our goal was to provide such datasets and thereby foster a new understanding of how in vivo knee joint movement and contact forces are interlinked - and thereby impact biomechanical interpretation of any new knee replacement design. In this collaborative study, we have created unique kinematic and kinetic datasets of the lower limb musculoskeletal system for worldwide dissemination by assessing a unique cohort of 6 subjects with instrumented knee implants (Charité - Universitätsmedizin Berlin) synchronized with a moving fluoroscope (ETH Zürich) and other measurement techniques (including whole body kinematics, ground reaction forces, video data, and electromyography data) for multiple complete cycles of 5 activities of daily living. Maximal tibio-femoral joint contact forces during walking (mean peak 2.74 BW), sit-to-stand (2.73 BW), stand-to-sit (2.57 BW), squats (2.64 BW), stair descent (3.38 BW), and ramp descent (3.39 BW) were observed. Internal rotation of the tibia ranged from 3° external to 9.3° internal. The greatest range of anterio-posterior translation was measured during stair descent (medial 9.3 ± 1.0 mm, lateral 7.5 ± 1.6 mm), and the lowest during stand-to-sit (medial 4.5 ± 1.1 mm, lateral 3.7 ± 1.4 mm). The complete and comprehensive datasets will soon be made available online for public use in biomechanical and orthopaedic research and development. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.
NASA Astrophysics Data System (ADS)
Song, Y.; Gurney, K. R.; Rayner, P. J.; Asefi-Najafabady, S.
2012-12-01
High resolution quantification of global fossil fuel CO2 emissions has become essential in research aimed at understanding the global carbon cycle and supporting the verification of international agreements on greenhouse gas emission reductions. The Fossil Fuel Data Assimilation System (FFDAS) was used to estimate global fossil fuel carbon emissions at 0.25 degree from 1992 to 2010. FFDAS quantifies CO2 emissions based on areal population density, per capita economic activity, energy intensity and carbon intensity. A critical constraint to this system is the estimation of national-scale fossil fuel CO2 emissions disaggregated into economic sectors. Furthermore, prior uncertainty estimation is an important aspect of the FFDAS. Objective techniques to quantify uncertainty for the national emissions are essential. There are several institutional datasets that quantify national carbon emissions, including British Petroleum (BP), the International Energy Agency (IEA), the Energy Information Administration (EIA), and the Carbon Dioxide Information and Analysis Center (CDIAC). These four datasets have been "harmonized" by Jordan Macknick for inter-comparison purposes (Macknick, Carbon Management, 2011). The harmonization attempted to generate consistency among the different institutional datasets via a variety of techniques such as reclassifying into consistent emitting categories, recalculating based on consistent emission factors, and converting into consistent units. These harmonized data form the basis of our uncertainty estimation. We summarized the maximum, minimum and mean national carbon emissions for all the datasets from 1992 to 2010. We calculated key statistics highlighting the remaining differences among the harmonized datasets. We combine the span (max - min) of datasets for each country and year with the standard deviation of the national spans over time. We utilize the economic sectoral definitions from IEA to disaggregate the national total emission into specific sectors required by FFDAS. Our results indicated that although the harmonization performed by Macknick generates better agreement among datasets, significant differences remain at national total level. For example, the CO2 emission span for most countries range from 10% to 12%; BP is generally the highest of the four datasets while IEA is typically the lowest; The US and China had the highest absolute span values but lower percentage span values compared to other countries. However, the US and China make up nearly one-half of the total global absolute span quantity. The absolute span value for the summation of national differences approaches 1 GtC/year in 2007, almost one-half of the biological "missing sink". The span value is used as a potential bias in a recalculation of global and regional carbon budgets to highlight the importance of fossil fuel CO2 emissions in calculating the missing sink. We conclude that if the harmonized span represents potential bias, calculations of the missing sink through forward budget or inverse approaches may be biased by nearly a factor of two.
Microbial bebop: creating music from complex dynamics in microbial ecology.
Larsen, Peter; Gilbert, Jack
2013-01-01
In order for society to make effective policy decisions on complex and far-reaching subjects, such as appropriate responses to global climate change, scientists must effectively communicate complex results to the non-scientifically specialized public. However, there are few ways however to transform highly complicated scientific data into formats that are engaging to the general community. Taking inspiration from patterns observed in nature and from some of the principles of jazz bebop improvisation, we have generated Microbial Bebop, a method by which microbial environmental data are transformed into music. Microbial Bebop uses meter, pitch, duration, and harmony to highlight the relationships between multiple data types in complex biological datasets. We use a comprehensive microbial ecology, time course dataset collected at the L4 marine monitoring station in the Western English Channel as an example of microbial ecological data that can be transformed into music. Four compositions were generated (www.bio.anl.gov/MicrobialBebop.htm.) from L4 Station data using Microbial Bebop. Each composition, though deriving from the same dataset, is created to highlight different relationships between environmental conditions and microbial community structure. The approach presented here can be applied to a wide variety of complex biological datasets.
Lardos, Andreas; Heinrich, Michael
2013-10-28
How medicinal plant knowledge changes over time is a question of central importance in modern ethnopharmacological research. However, only few studies are available which undertook a comprehensive exploration of the evolution of plant use in human cultures. In order to understand this dynamic process, we conduct a systematic diachronic investigation to explore continuity and change in two knowledge systems which are closely related but separated in time-historical iatrosophia texts and today's Greek Orthodox monasteries on Cyprus. An ethnobotanical study was conducted in 21 of the island's monasteries involving various types of interview as well as a written questionnaire survey. Data about medicinal plant use collected in the monasteries was analysed and quantitatively compared to historical iatrosophia texts using data from our pre-existing dataset. We found a core group of plant taxa for which a high consensus exists among the monasteries regarding their medicinal usefulness. Various means and routes of knowledge transmission appear to be involved in the development of this knowledge. The systematic comparison between the monasteries and the iatrosophia shows similarities and differences on various levels. While the plants used by the nuns and monks have by the majority a relationship to the iatrosophia and show a remarkable historical consistency in terms of their use for defined groups of ailments, the importance of many of these plants and the use of herbal medicines in general have changed. This is one of the first studies from the Mediterranean region which is based on a systematic ethnopharmacological analysis involving comprehensive datasets of historical and modern ethnographic data. The example illustrates continuity and change in 'traditional' knowledge as well as the adoption of new knowledge and provides the opportunity to look beyond the dichotomy between traditional and modern concepts of plant usage. Overall, the study suggests that a systematic diachronic approach can facilitate a better understanding of the complex and dynamic processes involved in the development of medicinal plant knowledge. © 2013 Elsevier Ireland Ltd. All rights reserved.
GAN: a platform of genomics and genetics analysis and application in Nicotiana
Yang, Shuai; Zhang, Xingwei; Li, Huayang; Chen, Yudong
2018-01-01
Abstract Nicotiana is an important Solanaceae genus, and plays a significant role in modern biological research. Massive Nicotiana biological data have emerged from in-depth genomics and genetics studies. From big data to big discovery, large-scale analysis and application with new platforms is critical. Based on data accumulation, a comprehensive platform of Genomics and Genetics Analysis and Application in Nicotiana (GAN) has been developed, and is publicly available at http://biodb.sdau.edu.cn/gan/. GAN consists of four main sections: (i) Sources, a total of 5267 germplasm lines, along with detailed descriptions of associated characteristics, are all available on the Germplasm page, which can be queried using eight different inquiry modes. Seven fully sequenced species with accompanying sequences and detailed genomic annotation are available on the Genomics page. (ii) Genetics, detailed descriptions of 10 genetic linkage maps, constructed by different parents, 2239 KEGG metabolic pathway maps and 209 945 gene families across all catalogued genes, along with two co-linearity maps combining N. tabacum with available tomato and potato linkage maps are available here. Furthermore, 3 963 119 genome-SSRs, 10 621 016 SNPs, 12 388 PIPs and 102 895 reverse transcription-polymerase chain reaction primers, are all available to be used and searched on the Markers page. (iii) Tools, the genome browser JBrowse and five useful online bioinformatics softwares, Blast, Primer3, SSR-detect, Nucl-Protein and E-PCR, are provided on the JBrowse and Tools pages. (iv) Auxiliary, all the datasets are shown on a Statistics page, and are available for download on a Download page. In addition, the user’s manual is provided on a Manual page in English and Chinese languages. GAN provides a user-friendly Web interface for searching, browsing and downloading the genomics and genetics datasets in Nicotiana. As far as we can ascertain, GAN is the most comprehensive source of bio-data available, and the most applicable resource for breeding, gene mapping, gene cloning, the study of the origin and evolution of polyploidy, and related studies in Nicotiana. Database URL: http://biodb.sdau.edu.cn/gan/ PMID:29688356
The center for expanded data annotation and retrieval
Bean, Carol A; Cheung, Kei-Hoi; Dumontier, Michel; Durante, Kim A; Gevaert, Olivier; Gonzalez-Beltran, Alejandra; Khatri, Purvesh; Kleinstein, Steven H; O’Connor, Martin J; Pouliot, Yannick; Rocca-Serra, Philippe; Sansone, Susanna-Assunta; Wiser, Jeffrey A
2015-01-01
The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical datasets to facilitate data discovery, data interpretation, and data reuse. We take advantage of emerging community-based standard templates for describing different kinds of biomedical datasets, and we investigate the use of computational techniques to help investigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identify metadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will capture annotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, including secondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human Immunology Project Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to the problems of metadata authoring and management that will generalize to other data-management environments. PMID:26112029
A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie.
Hanke, Michael; Baumgartner, Florian J; Ibe, Pierre; Kaule, Falko R; Pollmann, Stefan; Speck, Oliver; Zinke, Wolf; Stadler, Jörg
2014-01-01
Here we present a high-resolution functional magnetic resonance (fMRI) dataset - 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film ("Forrest Gump"). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response patterns to complex auditory stimulation. Among the potential uses of this dataset are the study of auditory attention and cognition, language and music perception, and social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures - from stimulus creation to data analysis. In order to facilitate replicative and derived works, only free and open-source software was utilized.
A dataset mapping the potential biophysical effects of vegetation cover change
NASA Astrophysics Data System (ADS)
Duveiller, Gregory; Hooker, Josh; Cescatti, Alessandro
2018-02-01
Changing the vegetation cover of the Earth has impacts on the biophysical properties of the surface and ultimately on the local climate. Depending on the specific type of vegetation change and on the background climate, the resulting competing biophysical processes can have a net warming or cooling effect, which can further vary both spatially and seasonally. Due to uncertain climate impacts and the lack of robust observations, biophysical effects are not yet considered in land-based climate policies. Here we present a dataset based on satellite remote sensing observations that provides the potential changes i) of the full surface energy balance, ii) at global scale, and iii) for multiple vegetation transitions, as would now be required for the comprehensive evaluation of land based mitigation plans. We anticipate that this dataset will provide valuable information to benchmark Earth system models, to assess future scenarios of land cover change and to develop the monitoring, reporting and verification guidelines required for the implementation of mitigation plans that account for biophysical land processes.
A dataset mapping the potential biophysical effects of vegetation cover change
Duveiller, Gregory; Hooker, Josh; Cescatti, Alessandro
2018-01-01
Changing the vegetation cover of the Earth has impacts on the biophysical properties of the surface and ultimately on the local climate. Depending on the specific type of vegetation change and on the background climate, the resulting competing biophysical processes can have a net warming or cooling effect, which can further vary both spatially and seasonally. Due to uncertain climate impacts and the lack of robust observations, biophysical effects are not yet considered in land-based climate policies. Here we present a dataset based on satellite remote sensing observations that provides the potential changes i) of the full surface energy balance, ii) at global scale, and iii) for multiple vegetation transitions, as would now be required for the comprehensive evaluation of land based mitigation plans. We anticipate that this dataset will provide valuable information to benchmark Earth system models, to assess future scenarios of land cover change and to develop the monitoring, reporting and verification guidelines required for the implementation of mitigation plans that account for biophysical land processes. PMID:29461538
A practical tool for maximal information coefficient analysis.
Albanese, Davide; Riccadonna, Samantha; Donati, Claudio; Franceschi, Pietro
2018-04-01
The ability of finding complex associations in large omics datasets, assessing their significance, and prioritizing them according to their strength can be of great help in the data exploration phase. Mutual information-based measures of association are particularly promising, in particular after the recent introduction of the TICe and MICe estimators, which combine computational efficiency with superior bias/variance properties. An open-source software implementation of these two measures providing a complete procedure to test their significance would be extremely useful. Here, we present MICtools, a comprehensive and effective pipeline that combines TICe and MICe into a multistep procedure that allows the identification of relationships of various degrees of complexity. MICtools calculates their strength assessing statistical significance using a permutation-based strategy. The performances of the proposed approach are assessed by an extensive investigation in synthetic datasets and an example of a potential application on a metagenomic dataset is also illustrated. We show that MICtools, combining TICe and MICe, is able to highlight associations that would not be captured by conventional strategies.
Evaluation of precipitation extremes over the Asian domain: observation and modelling studies
NASA Astrophysics Data System (ADS)
Kim, In-Won; Oh, Jaiho; Woo, Sumin; Kripalani, R. H.
2018-04-01
In this study, a comparison in the precipitation extremes as exhibited by the seven reference datasets is made to ascertain whether the inferences based on these datasets agree or they differ. These seven datasets, roughly grouped in three categories i.e. rain-gauge based (APHRODITE, CPC-UNI), satellite-based (TRMM, GPCP1DD) and reanalysis based (ERA-Interim, MERRA, and JRA55), having a common data period 1998-2007 are considered. Focus is to examine precipitation extremes in the summer monsoon rainfall over South Asia, East Asia and Southeast Asia. Measures of extreme precipitation include the percentile thresholds, frequency of extreme precipitation events and other quantities. Results reveal that the differences in displaying extremes among the datasets are small over South Asia and East Asia but large differences among the datasets are displayed over the Southeast Asian region including the maritime continent. Furthermore, precipitation data appear to be more consistent over East Asia among the seven datasets. Decadal trends in extreme precipitation are consistent with known results over South and East Asia. No trends in extreme precipitation events are exhibited over Southeast Asia. Outputs of the Coupled Model Intercomparison Project Phase 5 (CMIP5) simulation data are categorized as high, medium and low-resolution models. The regions displaying maximum intensity of extreme precipitation appear to be dependent on model resolution. High-resolution models simulate maximum intensity of extreme precipitation over the Indian sub-continent, medium-resolution models over northeast India and South China and the low-resolution models over Bangladesh, Myanmar and Thailand. In summary, there are differences in displaying extreme precipitation statistics among the seven datasets considered here and among the 29 CMIP5 model data outputs.
CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets
Li, Yang; Liu, Jun S.; Mootha, Vamsi K.
2017-01-01
In recent years, there has been a huge rise in the number of publicly available transcriptional profiling datasets. These massive compendia comprise billions of measurements and provide a special opportunity to predict the function of unstudied genes based on co-expression to well-studied pathways. Such analyses can be very challenging, however, since biological pathways are modular and may exhibit co-expression only in specific contexts. To overcome these challenges we introduce CLIC, CLustering by Inferred Co-expression. CLIC accepts as input a pathway consisting of two or more genes. It then uses a Bayesian partition model to simultaneously partition the input gene set into coherent co-expressed modules (CEMs), while assigning the posterior probability for each dataset in support of each CEM. CLIC then expands each CEM by scanning the transcriptome for additional co-expressed genes, quantified by an integrated log-likelihood ratio (LLR) score weighted for each dataset. As a byproduct, CLIC automatically learns the conditions (datasets) within which a CEM is operative. We implemented CLIC using a compendium of 1774 mouse microarray datasets (28628 microarrays) or 1887 human microarray datasets (45158 microarrays). CLIC analysis reveals that of 910 canonical biological pathways, 30% consist of strongly co-expressed gene modules for which new members are predicted. For example, CLIC predicts a functional connection between protein C7orf55 (FMC1) and the mitochondrial ATP synthase complex that we have experimentally validated. CLIC is freely available at www.gene-clic.org. We anticipate that CLIC will be valuable both for revealing new components of biological pathways as well as the conditions in which they are active. PMID:28719601
ERIC Educational Resources Information Center
Wine, Jennifer; Bryan, Michael; Siegel, Peter
2013-01-01
The National Postsecondary Student Aid Study (NPSAS) helps fulfill the U.S. Department of Education's National Center for Education Statistics (NCES) mandate to collect, analyze, and publish statistics related to education. The purpose of NPSAS is to compile a comprehensive research dataset, based on student-level records, on financial aid…
ERIC Educational Resources Information Center
Wine, Jennifer; Bryan, Michael; Siegel, Peter
2013-01-01
The National Postsecondary Student Aid Study (NPSAS) helps fulfill the U.S. Department of Education's National Center for Education Statistics (NCES) mandate to collect, analyze, and publish statistics related to education. The purpose of NPSAS is to compile a comprehensive research dataset, based on student-level records, on financial aid…
The Impact of Private Schools on Educational Attainment in the State of São Paulo
ERIC Educational Resources Information Center
Stern, Jonathan M. B.
2015-01-01
This study uses a comprehensive dataset on secondary school students in Brazil to examine the impact of private school enrollment on educational attainment in São Paulo. The results show that private school students (across all levels of tuition) perform better than their public school counterparts on Brazil's high school exit exam, even after…
ERIC Educational Resources Information Center
Zhang, Yu; Zhou, Xuehan
2017-01-01
The purpose of this study was to examine the effect of household education expenditure on National College Entrance Exam (NCEE) performance in China. Using a comprehensive dataset with a sample size of 5840 students collected in Jinan, China, this study found that the average effect of household education expenditure on NCEE performance is not…
Evaluation of reference evapotranspiration methods in arid, semiarid, and humid regions
Fei Gao; Gary Feng; Ying Ouyang; Huixiao Wang; Daniel Fisher; Ardeshir Adeli; Johnie Jenkins
2017-01-01
It is often necessary to find a simpler method in different climatic regions to calculate reference crop evapotranspiration (ETo) since the application of the FAO-56 Penman-Monteith method is often restricted due to the unavailability of a comprehensive weather dataset. Seven ETo methods, namely the standard FAO-56 Penman-Monteith, the FAO-24 Radiation, FAO-24 Blaney...
Chattree, A; Barbour, J A; Thomas-Gibson, S; Bhandari, P; Saunders, B P; Veitch, A M; Anderson, J; Rembacken, B J; Loughrey, M B; Pullan, R; Garrett, W V; Lewis, G; Dolwani, S; Rutter, M D
2017-01-01
The management of large non-pedunculated colorectal polyps (LNPCPs) is complex, with widespread variation in management and outcome, even amongst experienced clinicians. Variations in the assessment and decision-making processes are likely to be a major factor in this variability. The creation of a standardized minimum dataset to aid decision-making may therefore result in improved clinical management. An official working group of 13 multidisciplinary specialists was appointed by the Association of Coloproctology of Great Britain and Ireland (ACPGBI) and the British Society of Gastroenterology (BSG) to develop a minimum dataset on LNPCPs. The literature review used to structure the ACPGBI/BSG guidelines for the management of LNPCPs was used by a steering subcommittee to identify various parameters pertaining to the decision-making processes in the assessment and management of LNPCPs. A modified Delphi consensus process was then used for voting on proposed parameters over multiple voting rounds with at least 80% agreement defined as consensus. The minimum dataset was used in a pilot process to ensure rigidity and usability. A 23-parameter minimum dataset with parameters relating to patient and lesion factors, including six parameters relating to image retrieval, was formulated over four rounds of voting with two pilot processes to test rigidity and usability. This paper describes the development of the first reported evidence-based and expert consensus minimum dataset for the management of LNPCPs. It is anticipated that this dataset will allow comprehensive and standardized lesion assessment to improve decision-making in the assessment and management of LNPCPs. Colorectal Disease © 2016 The Association of Coloproctology of Great Britain and Ireland.
Comparing apples and oranges: the Community Intercomparison Suite
NASA Astrophysics Data System (ADS)
Schutgens, Nick; Stier, Philip; Kershaw, Philip; Pascoe, Stephen
2015-04-01
Visual representation and comparison of geoscientific datasets presents a huge challenge due to the large variety of file formats and spatio-temporal sampling of data (be they observations or simulations). The Community Intercomparison Suite attempts to greatly simplify these tasks for users by offering an intelligent but simple command line tool for visualisation and colocation of diverse datasets. In addition, CIS can subset and aggregate large datasets into smaller more manageable datasets. Our philosophy is to remove as much as possible the need for specialist knowledge by the user of the structure of a dataset. The colocation of observations with model data is as simple as: "cis col
Modelling land cover change in the Ganga basin
NASA Astrophysics Data System (ADS)
Moulds, S.; Tsarouchi, G.; Mijic, A.; Buytaert, W.
2013-12-01
Over recent decades the green revolution in India has driven substantial environmental change. Modelling experiments have identified northern India as a 'hot spot' of land-atmosphere coupling strength during the boreal summer. However, there is a wide range of sensitivity of atmospheric variables to soil moisture between individual climate models. The lack of a comprehensive land cover change dataset to force climate models has been identified as a major contributor to model uncertainty. In this work a time series dataset of land cover change between 1970 and 2010 is constructed for northern India to improve the quantification of regional hydrometeorological feedbacks. The MODIS instrument on board the Aqua and Terra satellites provides near-continuous remotely sensed datasets from 2000 to the present day. However, the quality of satellite products before 2000 is poor. To complete the dataset MODIS images are extrapolated back in time using the Conversion of Land Use and its Effects at small regional extent (CLUE-s) modelling framework. Non-spatial estimates of land cover area from national agriculture and forest statistics, available on a state-wise, annual basis, are used as a direct model input. Land cover change is allocated spatially as a function of biophysical and socioeconomic drivers identified using logistic regression. This dataset will provide an essential input to a high resolution, physically based land surface model to generate the lower boundary condition to assess the impact of land cover change on regional climate.
Technical note: Space-time analysis of rainfall extremes in Italy: clues from a reconciled dataset
NASA Astrophysics Data System (ADS)
Libertino, Andrea; Ganora, Daniele; Claps, Pierluigi
2018-05-01
Like other Mediterranean areas, Italy is prone to the development of events with significant rainfall intensity, lasting for several hours. The main triggering mechanisms of these events are quite well known, but the aim of developing rainstorm hazard maps compatible with their actual probability of occurrence is still far from being reached. A systematic frequency analysis of these occasional highly intense events would require a complete countrywide dataset of sub-daily rainfall records, but this kind of information was still lacking for the Italian territory. In this work several sources of data are gathered, for assembling the first comprehensive and updated dataset of extreme rainfall of short duration in Italy. The resulting dataset, referred to as the Italian Rainfall Extreme Dataset (I-RED), includes the annual maximum rainfalls recorded in 1 to 24 consecutive hours from more than 4500 stations across the country, spanning the period between 1916 and 2014. A detailed description of the spatial and temporal coverage of the I-RED is presented, together with an exploratory statistical analysis aimed at providing preliminary information on the climatology of extreme rainfall at the national scale. Due to some legal restrictions, the database can be provided only under certain conditions. Taking into account the potentialities emerging from the analysis, a description of the ongoing and planned future work activities on the database is provided.
Evolution of organogenesis and the origin of altriciality in mammals.
Werneburg, Ingmar; Laurin, Michel; Koyabu, Daisuke; Sánchez-Villagra, Marcelo R
2016-07-01
Mammals feature not only great phenotypic disparity, but also diverse growth and life history patterns, especially in maturity level at birth, ranging from altriciality to precocity. Gestation length, morphology at birth, and other markers of life history are fundamental to our understanding of mammalian evolution. Based on the first synthesis of embryological data and the study of new ontogenetic series, we reconstructed estimates of the ancestral chronology of organogenesis and life-history modes in placental mammals. We found that the ancestor of marsupial and placental mammals was placental-like at birth but had a long, marsupial-like infancy. We hypothesize that mammalian viviparity might have evolved in association with the extension of growth after birth, enabled through lactation, and that mammalian altriciality is inherited from the earliest amniotes. The precocial lifestyle of extant sauropsids and that of many placental mammals were acquired secondarily. We base our conclusions on the best estimates and provide a comprehensive discussion on the methods used and the limitations of our dataset. We provide the most comprehensive embryological dataset ever published, "rescue" old literature sources, and apply available methods and illustrate thus an approach on how to investigate comparatively organogenesis in macroevolution. © 2016 Wiley Periodicals, Inc.
The FaceBase Consortium: a comprehensive resource for craniofacial researchers
Brinkley, James F.; Fisher, Shannon; Harris, Matthew P.; Holmes, Greg; Hooper, Joan E.; Wang Jabs, Ethylin; Jones, Kenneth L.; Kesselman, Carl; Klein, Ophir D.; Maas, Richard L.; Marazita, Mary L.; Selleri, Licia; Spritz, Richard A.; van Bakel, Harm; Visel, Axel; Williams, Trevor J.; Wysocka, Joanna
2016-01-01
The FaceBase Consortium, funded by the National Institute of Dental and Craniofacial Research, National Institutes of Health, is designed to accelerate understanding of craniofacial developmental biology by generating comprehensive data resources to empower the research community, exploring high-throughput technology, fostering new scientific collaborations among researchers and human/computer interactions, facilitating hypothesis-driven research and translating science into improved health care to benefit patients. The resources generated by the FaceBase projects include a number of dynamic imaging modalities, genome-wide association studies, software tools for analyzing human facial abnormalities, detailed phenotyping, anatomical and molecular atlases, global and specific gene expression patterns, and transcriptional profiling over the course of embryonic and postnatal development in animal models and humans. The integrated data visualization tools, faceted search infrastructure, and curation provided by the FaceBase Hub offer flexible and intuitive ways to interact with these multidisciplinary data. In parallel, the datasets also offer unique opportunities for new collaborations and training for researchers coming into the field of craniofacial studies. Here, we highlight the focus of each spoke project and the integration of datasets contributed by the spokes to facilitate craniofacial research. PMID:27287806
DoOR 2.0 - Comprehensive Mapping of Drosophila melanogaster Odorant Responses
NASA Astrophysics Data System (ADS)
Münch, Daniel; Galizia, C. Giovanni
2016-02-01
Odors elicit complex patterns of activated olfactory sensory neurons. Knowing the complete olfactome, i.e. the responses in all sensory neurons for all relevant odorants, is desirable to understand olfactory coding. The DoOR project combines all available Drosophila odorant response data into a single consensus response matrix. Since its first release many studies were published: receptors were deorphanized and several response profiles were expanded. In this study, we add unpublished data to the odor-response profiles for four odorant receptors (Or10a, Or42b, Or47b, Or56a). We deorphanize Or69a, showing a broad response spectrum with the best ligands including 3-hydroxyhexanoate, alpha-terpineol, 3-octanol and linalool. We include all of these datasets into DoOR, provide a comprehensive update of both code and data, and new tools for data analyses and visualizations. The DoOR project has a web interface for quick queries (http://neuro.uni.kn/DoOR), and a downloadable, open source toolbox written in R, including all processed and original datasets. DoOR now gives reliable odorant-responses for nearly all Drosophila olfactory responding units, listing 693 odorants, for a total of 7381 data points.
Müllenbroich, M Caroline; Silvestri, Ludovico; Onofri, Leonardo; Costantini, Irene; Hoff, Marcel Van't; Sacconi, Leonardo; Iannello, Giulio; Pavone, Francesco S
2015-10-01
Comprehensive mapping and quantification of neuronal projections in the central nervous system requires high-throughput imaging of large volumes with microscopic resolution. To this end, we have developed a confocal light-sheet microscope that has been optimized for three-dimensional (3-D) imaging of structurally intact clarified whole-mount mouse brains. We describe the optical and electromechanical arrangement of the microscope and give details on the organization of the microscope management software. The software orchestrates all components of the microscope, coordinates critical timing and synchronization, and has been written in a versatile and modular structure using the LabVIEW language. It can easily be adapted and integrated to other microscope systems and has been made freely available to the light-sheet community. The tremendous amount of data routinely generated by light-sheet microscopy further requires novel strategies for data handling and storage. To complete the full imaging pipeline of our high-throughput microscope, we further elaborate on big data management from streaming of raw images up to stitching of 3-D datasets. The mesoscale neuroanatomy imaged at micron-scale resolution in those datasets allows characterization and quantification of neuronal projections in unsectioned mouse brains.
Modern data science for analytical chemical data - A comprehensive review.
Szymańska, Ewa
2018-10-22
Efficient and reliable analysis of chemical analytical data is a great challenge due to the increase in data size, variety and velocity. New methodologies, approaches and methods are being proposed not only by chemometrics but also by other data scientific communities to extract relevant information from big datasets and provide their value to different applications. Besides common goal of big data analysis, different perspectives and terms on big data are being discussed in scientific literature and public media. The aim of this comprehensive review is to present common trends in the analysis of chemical analytical data across different data scientific fields together with their data type-specific and generic challenges. Firstly, common data science terms used in different data scientific fields are summarized and discussed. Secondly, systematic methodologies to plan and run big data analysis projects are presented together with their steps. Moreover, different analysis aspects like assessing data quality, selecting data pre-processing strategies, data visualization and model validation are considered in more detail. Finally, an overview of standard and new data analysis methods is provided and their suitability for big analytical chemical datasets shortly discussed. Copyright © 2018 Elsevier B.V. All rights reserved.
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-07-01
UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.
Ecosystem-based management of the Laurentian Great Lakes, which spans both the United States and Canada, is hampered by the lack of consistent binational watersheds for the entire Basin. Using comparable data sources and consistent methods we developed spatially equivalent waters...
Egidi, Giovanna; Caramazza, Alfonso
2016-10-01
This research studies the neural systems underlying two integration processes that take place during natural discourse comprehension: consistency evaluation and passive comprehension. Evaluation was operationalized with a consistency judgment task and passive comprehension with a passive listening task. Using fMRI, the experiment examined the integration of incoming sentences with more recent, local context and with more distal, global context in these two tasks. The stimuli were stories in which we manipulated the consistency of the endings with the local context and the relevance of the global context for the integration of the endings. A whole-brain analysis revealed several differences between the two tasks. Two networks previously associated with semantic processing and attention orienting showed more activation during the judgment than the passive listening task. A network previously associated with episodic memory retrieval and construction of mental scenes showed greater activity when global context was relevant, but only during the judgment task. This suggests that evaluation, more than passive listening, triggers the reinstantiation of global context and the construction of a rich mental model for the story. Finally, a network previously linked to fluent updating of a knowledge base showed greater activity for locally consistent endings than inconsistent ones, but only during passive listening, suggesting a mode of comprehension that relies on a local scope approach to language processing. Taken together, these results show that consistency evaluation and passive comprehension weigh differently on distal and local information and are implemented, in part, by different brain networks.
Mehrabi, Saeed; Krishnan, Anand; Roch, Alexandra M; Schmidt, Heidi; Li, DingCheng; Kesterson, Joe; Beesley, Chris; Dexter, Paul; Schmidt, Max; Palakal, Mathew; Liu, Hongfang
2015-01-01
In this study we have developed a rule-based natural language processing (NLP) system to identify patients with family history of pancreatic cancer. The algorithm was developed in a Unstructured Information Management Architecture (UIMA) framework and consisted of section segmentation, relation discovery, and negation detection. The system was evaluated on data from two institutions. The family history identification precision was consistent across the institutions shifting from 88.9% on Indiana University (IU) dataset to 87.8% on Mayo Clinic dataset. Customizing the algorithm on the the Mayo Clinic data, increased its precision to 88.1%. The family member relation discovery achieved precision, recall, and F-measure of 75.3%, 91.6% and 82.6% respectively. Negation detection resulted in precision of 99.1%. The results show that rule-based NLP approaches for specific information extraction tasks are portable across institutions; however customization of the algorithm on the new dataset improves its performance.
Analysis Of The IJCNN 2011 UTL Challenge
2012-01-13
large datasets from various application domains: handwriting recognition, image recognition, video processing, text processing, and ecology. The goal...validation and final evaluation sets consist of 4096 examples each. Dataset Domain Features Sparsity Devel. Transf. AVICENNA Handwriting 120 0% 150205...documents [3]. Transfer learning methods could accelerate the application of handwriting recognizers to historical manuscript by reducing the need for
,
1999-01-01
The National Elevation Dataset (NED) is a new raster product assembled by the U.S. Geological Survey (USGS). The NED is designed to provide national elevation data in a seamless form with a consistent datum, elevation unit, and projection. Data corrections were made in the NED assembly process to minimize artifacts, permit edge matching, and fill sliver areas of missing data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mardirossian, Narbe; Head-Gordon, Martin
Benchmark datasets of non-covalent interactions are essential for assessing the performance of density functionals and other quantum chemistry approaches. In a recent blind test, Taylor et al. benchmarked 14 methods on a new dataset consisting of 10 dimer potential energy curves calculated using coupled cluster with singles, doubles, and perturbative triples (CCSD(T)) at the complete basis set (CBS) limit (80 data points in total). Finally, the dataset is particularly interesting because compressed, near-equilibrium, and stretched regions of the potential energy surface are extensively sampled.
Mardirossian, Narbe; Head-Gordon, Martin
2016-11-09
Benchmark datasets of non-covalent interactions are essential for assessing the performance of density functionals and other quantum chemistry approaches. In a recent blind test, Taylor et al. benchmarked 14 methods on a new dataset consisting of 10 dimer potential energy curves calculated using coupled cluster with singles, doubles, and perturbative triples (CCSD(T)) at the complete basis set (CBS) limit (80 data points in total). Finally, the dataset is particularly interesting because compressed, near-equilibrium, and stretched regions of the potential energy surface are extensively sampled.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brady Raap, Michaele C.; Lyons, Jennifer A.; Collins, Brian A.
This report documents the FY13 efforts to enhance a dataset of spent nuclear fuel isotopic composition data for use in developing intrinsic signatures for nuclear forensics. A review and collection of data from the open literature was performed in FY10. In FY11, the Spent Fuel COMPOsition (SFCOMPO) excel-based dataset for nuclear forensics (NF), SFCOMPO/NF was established and measured data for graphite production reactors, Boiling Water Reactors (BWRs) and Pressurized Water Reactors (PWRs) were added to the dataset and expanded to include a consistent set of data simulated by calculations. A test was performed to determine whether the SFCOMPO/NF dataset willmore » be useful for the analysis and identification of reactor types from isotopic ratios observed in interdicted samples.« less
Prediction of beta-turns with learning machines.
Cai, Yu-Dong; Liu, Xiao-Jun; Li, Yi-Xue; Xu, Xue-biao; Chou, Kuo-Chen
2003-05-01
The support vector machine approach was introduced to predict the beta-turns in proteins. The overall self-consistency rate by the re-substitution test for the training or learning dataset reached 100%. Both the training dataset and independent testing dataset were taken from Chou [J. Pept. Res. 49 (1997) 120]. The success prediction rates by the jackknife test for the beta-turn subset of 455 tetrapeptides and non-beta-turn subset of 3807 tetrapeptides in the training dataset were 58.1 and 98.4%, respectively. The success rates with the independent dataset test for the beta-turn subset of 110 tetrapeptides and non-beta-turn subset of 30,231 tetrapeptides were 69.1 and 97.3%, respectively. The results obtained from this study support the conclusion that the residue-coupled effect along a tetrapeptide is important for the formation of a beta-turn.
Earth-Science Data Co-Locating Tool
NASA Technical Reports Server (NTRS)
Lee, Seungwon; Pan, Lei; Block, Gary L.
2012-01-01
This software is used to locate Earth-science satellite data and climate-model analysis outputs in space and time. This enables the direct comparison of any set of data with different spatial and temporal resolutions. It is written in three separate modules that are clearly separated for their functionality and interface with other modules. This enables a fast development of supporting any new data set. In this updated version of the tool, several new front ends are developed for new products. This software finds co-locatable data pairs for given sets of data products and creates new data products that share the same spatial and temporal coordinates. This facilitates the direct comparison between the two heterogeneous datasets and the comprehensive and synergistic use of the datasets.
NASA Astrophysics Data System (ADS)
Klosik, David F.; Bornholdt, Stefan; Hütt, Marc-Thorsten
2014-09-01
Following the work of Krumov et al. [Eur. Phys. J. B 84, 535 (2011), 10.1140/epjb/e2011-10746-5] we revisit the question whether the usage of large citation datasets allows for the quantitative assessment of social (by means of coauthorship of publications) influence on the progression of science. Applying a more comprehensive and well-curated dataset containing the publications in the journals of the American Physical Society during the whole 20th century we find that the measure chosen in the original study, a score based on small induced subgraphs, has to be used with caution, since the obtained results are highly sensitive to the exact implementation of the author disambiguation task.
Thompson, Allyson L.; Hubbard, Bernard E.
2014-01-01
This report summarizes the application of dasymetric methods for mapping the distribution of population throughout Afghanistan. Because Afghanistan's population has constantly changed through decades of war and conflict, existing vector and raster GIS datasets (such as point settlement densities and intensities of lights at night) do not adequately reflect the changes. The purposes of this report are (1) to provide historic population data at the provincial and district levels that can be used to chart population growth and migration trends within the country and (2) to provide baseline information that can be used for other types of spatial analyses of Afghanistan, such as resource and hazard assessments; infrastructure and capacity rebuilding; and assisting with international, regional, and local planning.
NASA Astrophysics Data System (ADS)
Ostrenga, D.; Liu, Z.; Teng, W. L.; Trivedi, B.; Kempler, S.
2011-12-01
The NASA Goddard Earth Sciences Data and Information Services Center (GES DISC) is home of global precipitation product archives, in particular, the Tropical Rainfall Measuring Mission (TRMM) products. TRMM is a joint U.S.-Japan satellite mission to monitor tropical and subtropical (40deg S - 40deg N) precipitation and to estimate its associated latent heating. The TRMM satellite provides the first detailed and comprehensive dataset on the four dimensional distribution of rainfall and latent heating over vastly undersampled tropical and subtropical oceans and continents. The TRMM satellite was launched on November 27, 1997. TRMM data products are archived at and distributed by GES DISC. The newly released TRMM Version 7 consists of several changes including new parameters, new products, meta data, data structures, etc. For example, hydrometeor profiles in 2A12 now have 28 layers (14 in V6). New parameters have been added to several popular Level-3 products, such as, 3B42, 3B43. Version 2.2 of the Global Precipitation Climatology Project (GPCP) dataset has been added to the TRMM Online Visualization and Analysis System (TOVAS; URL: http://disc2.nascom.nasa.gov/Giovanni/tovas/), allowing online analysis and visualization without downloading data and software. The GPCP dataset extends back to 1979. Results of basic intercomparison between the new and the previous versions of both TRMM and GPCP will be presented to help understand changes in data product characteristics. To facilitate data and information access and support precipitation research and applications, we have developed a Precipitation Data and Information Services Center (PDISC; URL: http://disc.gsfc.nasa.gov/precipitation). In addition to TRMM, PDISC provides current and past observational precipitation data. Users can access precipitation data archives consisting of both remote sensing and in-situ observations. Users can use these data products to conduct a wide variety of activities, including case studies, model evaluation, uncertainty investigation, etc. To support Earth science applications, PDISC provides users near-real-time precipitation products over the Internet. At PDISC, users can access tools and software. Documentation, FAQ and assistance are also available. Other capabilities include: 1) Mirador (http://mirador.gsfc.nasa.gov/), a simplified interface for searching, browsing, and ordering Earth science data at NASA Goddard Earth Sciences Data and Information Services Center (GES DISC). Mirador is designed to be fast and easy to learn; 2)TOVAS; 3) NetCDF data download for the GIS community; 4) Data via OPeNDAP (http://disc.sci.gsfc.nasa.gov/services/opendap/). The OPeNDAP provides remote access to individual variables within datasets in a form usable by many tools, such as IDV, McIDAS-V, Panoply, Ferret and GrADS; 5) The Open Geospatial Consortium (OGC) Web Map Service (WMS) (http://disc.sci.gsfc.nasa.gov/services/wxs_ogc.shtml). The WMS is an interface that allows the use of data and enables clients to build customized maps with data coming from a different network. More details along with examples will be presented.
NASA Astrophysics Data System (ADS)
Beck, H.; Vergopolan, N.; Pan, M.; Levizzani, V.; van Dijk, A.; Weedon, G. P.; Brocca, L.; Huffman, G. J.; Wood, E. F.; William, L.
2017-12-01
We undertook a comprehensive evaluation of 22 gridded (quasi-)global (sub-)daily precipitation (P) datasets for the period 2000-2016. Twelve non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76,086 gauges worldwide. Another ten gauge-corrected ones were evaluated using hydrological modeling, by calibrating the conceptual model HBV against streamflow records for each of 9053 small to medium-sized (<50,000 km2) catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR), the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7) or near-surface soil moisture (SM2RAIN-ASCAT), and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS). Two of the three reanalyses (ERA-Interim and JRA-55) unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified and MSWEP V1.2 and V2.0) generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU), which in turn outperformed those indirectly incorporating gauge data through other multi-source datasets (PERSIANN-CDR V1R1 and PGF). Our results highlight large differences in estimation accuracy, and hence, the importance of P dataset selection in both research and operational applications. The good performance of MSWEP emphasizes that careful data merging can exploit the complementary strengths of gauge-, satellite- and reanalysis-based P estimates.
NASA Astrophysics Data System (ADS)
Beck, Hylke E.; Vergopolan, Noemi; Pan, Ming; Levizzani, Vincenzo; van Dijk, Albert I. J. M.; Weedon, Graham P.; Brocca, Luca; Pappenberger, Florian; Huffman, George J.; Wood, Eric F.
2017-12-01
We undertook a comprehensive evaluation of 22 gridded (quasi-)global (sub-)daily precipitation (P) datasets for the period 2000-2016. Thirteen non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76 086 gauges worldwide. Another nine gauge-corrected datasets were evaluated using hydrological modeling, by calibrating the HBV conceptual model against streamflow records for each of 9053 small to medium-sized ( < 50 000 km2) catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR) and the satellite- and reanalysis-based CHIRP V2.0 dataset, the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7) or near-surface soil moisture (SM2RAIN-ASCAT), and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS). Two of the three reanalyses (ERA-Interim and JRA-55) unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified, and MSWEP V1.2 and V2.0) generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU), which in turn outperformed the one indirectly incorporating gauge data through another multi-source dataset (PERSIANN-CDR V1R1). Our results highlight large differences in estimation accuracy, and hence the importance of P dataset selection in both research and operational applications. The good performance of MSWEP emphasizes that careful data merging can exploit the complementary strengths of gauge-, satellite-, and reanalysis-based P estimates.
A Benchmark Dataset for SSVEP-Based Brain-Computer Interfaces.
Wang, Yijun; Chen, Xiaogang; Gao, Xiaorong; Gao, Shangkai
2017-10-01
This paper presents a benchmark steady-state visual evoked potential (SSVEP) dataset acquired with a 40-target brain- computer interface (BCI) speller. The dataset consists of 64-channel Electroencephalogram (EEG) data from 35 healthy subjects (8 experienced and 27 naïve) while they performed a cue-guided target selecting task. The virtual keyboard of the speller was composed of 40 visual flickers, which were coded using a joint frequency and phase modulation (JFPM) approach. The stimulation frequencies ranged from 8 Hz to 15.8 Hz with an interval of 0.2 Hz. The phase difference between two adjacent frequencies was . For each subject, the data included six blocks of 40 trials corresponding to all 40 flickers indicated by a visual cue in a random order. The stimulation duration in each trial was five seconds. The dataset can be used as a benchmark dataset to compare the methods for stimulus coding and target identification in SSVEP-based BCIs. Through offline simulation, the dataset can be used to design new system diagrams and evaluate their BCI performance without collecting any new data. The dataset also provides high-quality data for computational modeling of SSVEPs. The dataset is freely available fromhttp://bci.med.tsinghua.edu.cn/download.html.
I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chard, Kyle; D'Arcy, Mike; Heavner, Benjamin D.
Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and toolsmore » for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.« less
The Development of a Noncontact Letter Input Interface “Fingual” Using Magnetic Dataset
NASA Astrophysics Data System (ADS)
Fukushima, Taishi; Miyazaki, Fumio; Nishikawa, Atsushi
We have newly developed a noncontact letter input interface called “Fingual”. Fingual uses a glove mounted with inexpensive and small magnetic sensors. Using the glove, users can input letters to form the finger alphabets, a kind of sign language. The proposed method uses some dataset which consists of magnetic field and the corresponding letter information. In this paper, we show two recognition methods using the dataset. First method uses Euclidean norm, and second one additionally uses Gaussian function as a weighting function. Then we conducted verification experiments for the recognition rate of each method in two situations. One of the situations is that subjects used their own dataset; the other is that they used another person's dataset. As a result, the proposed method could recognize letters with a high rate in both situations, even though it is better to use their own dataset than to use another person's dataset. Though Fingual needs to collect magnetic dataset for each letter in advance, its feature is the ability to recognize letters without the complicated calculations such as inverse problems. This paper shows results of the recognition experiments, and shows the utility of the proposed system “Fingual”.
Building the European Seismological Research Infrastructure: results from 4 years NERIES EC project
NASA Astrophysics Data System (ADS)
van Eck, T.; Giardini, D.
2010-12-01
The EC Research Infrastructure (RI) project, Network of Research Infrastructures for European Seismology (NERIES), implemented a comprehensive European integrated RI for earthquake seismological data that is scalable and sustainable. NERIES opened a significant amount of additional seismological data, integrated different distributed data archives, implemented and produced advanced analysis tools and advanced software packages and tools. A single seismic data portal provides a single access point and overview for European seismological data available for the earth science research community. Additional data access tools and sites have been implemented to meet user and robustness requirements, notably those at the EMSC and ORFEUS. The datasets compiled in NERIES and available through the portal include among others: - The expanded Virtual European Broadband Seismic Network (VEBSN) with real-time access to more then 500 stations from > 53 observatories. This data is continuously monitored, quality controlled and archived in the European Integrated Distributed waveform Archive (EIDA). - A unique integration of acceleration datasets from seven networks in seven European or associated countries centrally accessible in a homogeneous format, thus forming the core comprehensive European acceleration database. Standardized parameter analysis and actual software are included in the database. - A Distributed Archive of Historical Earthquake Data (AHEAD) for research purposes, containing among others a comprehensive European Macroseismic Database and Earthquake Catalogue (1000 - 1963, M ≥5.8), including analysis tools. - Data from 3 one year OBS deployments at three sites, Atlantic, Ionian and Ligurian Sea within the general SEED format, thus creating the core integrated data base for ocean, sea and land based seismological observatories. Tools to facilitate analysis and data mining of the RI datasets are: - A comprehensive set of European seismological velocity reference model including a standardized model description with several visualisation tools currently adapted on a global scale. - An integrated approach to seismic hazard modelling and forecasting, a community accepted forecasting testing and model validation approach and the core hazard portal developed along the same technologies as the NERIES data portal. - Implemented homogeneous shakemap estimation tools at several large European observatories and a complementary new loss estimation software tool. - A comprehensive set of new techniques for geotechnical site characterization with relevant software packages documented and maintained (www.geopsy.org). - A set of software packages for data mining, data reduction, data exchange and information management in seismology as research and observatory analysis tools NERIES has a long-term impact and is coordinated with related US initiatives IRIS and EarthScope. The follow-up EC project of NERIES, NERA (2010 - 2014), is funded and will integrate the seismological and the earthquake engineering infrastructures. NERIES further provided the proof of concept for the ESFRI2008 initiative: the European Plate Observing System (EPOS). Its preparatory phase (2010 - 2014) is also funded by the EC.
The health care and life sciences community profile for dataset descriptions
Alexiev, Vladimir; Ansell, Peter; Bader, Gary; Baran, Joachim; Bolleman, Jerven T.; Callahan, Alison; Cruz-Toledo, José; Gaudet, Pascale; Gombocz, Erich A.; Gonzalez-Beltran, Alejandra N.; Groth, Paul; Haendel, Melissa; Ito, Maori; Jupp, Simon; Juty, Nick; Katayama, Toshiaki; Kobayashi, Norio; Krishnaswami, Kalpana; Laibe, Camille; Le Novère, Nicolas; Lin, Simon; Malone, James; Miller, Michael; Mungall, Christopher J.; Rietveld, Laurens; Wimalaratne, Sarala M.; Yamaguchi, Atsuko
2016-01-01
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. PMID:27602295
Atlas-guided cluster analysis of large tractography datasets.
Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer
2013-01-01
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.
A gridded hourly rainfall dataset for the UK applied to a national physically-based modelling system
NASA Astrophysics Data System (ADS)
Lewis, Elizabeth; Blenkinsop, Stephen; Quinn, Niall; Freer, Jim; Coxon, Gemma; Woods, Ross; Bates, Paul; Fowler, Hayley
2016-04-01
An hourly gridded rainfall product has great potential for use in many hydrological applications that require high temporal resolution meteorological data. One important example of this is flood risk management, with flooding in the UK highly dependent on sub-daily rainfall intensities amongst other factors. Knowledge of sub-daily rainfall intensities is therefore critical to designing hydraulic structures or flood defences to appropriate levels of service. Sub-daily rainfall rates are also essential inputs for flood forecasting, allowing for estimates of peak flows and stage for flood warning and response. In addition, an hourly gridded rainfall dataset has significant potential for practical applications such as better representation of extremes and pluvial flash flooding, validation of high resolution climate models and improving the representation of sub-daily rainfall in weather generators. A new 1km gridded hourly rainfall dataset for the UK has been created by disaggregating the daily Gridded Estimates of Areal Rainfall (CEH-GEAR) dataset using comprehensively quality-controlled hourly rain gauge data from over 1300 observation stations across the country. Quality control measures include identification of frequent tips, daily accumulations and dry spells, comparison of daily totals against the CEH-GEAR daily dataset, and nearest neighbour checks. The quality control procedure was validated against historic extreme rainfall events and the UKCP09 5km daily rainfall dataset. General use of the dataset has been demonstrated by testing the sensitivity of a physically-based hydrological modelling system for Great Britain to the distribution and rates of rainfall and potential evapotranspiration. Of the sensitivity tests undertaken, the largest improvements in model performance were seen when an hourly gridded rainfall dataset was combined with potential evapotranspiration disaggregated to hourly intervals, with 61% of catchments showing an increase in NSE between observed and simulated streamflows as a result of more realistic sub-daily meteorological forcing.
[Research on developping the spectral dataset for Dunhuang typical colors based on color constancy].
Liu, Qiang; Wan, Xiao-Xia; Liu, Zhen; Li, Chan; Liang, Jin-Xing
2013-11-01
The present paper aims at developping a method to reasonably set up the typical spectral color dataset for different kinds of Chinese cultural heritage in color rendering process. The world famous wall paintings dating from more than 1700 years ago in Dunhuang Mogao Grottoes was taken as typical case in this research. In order to maintain the color constancy during the color rendering workflow of Dunhuang culture relics, a chromatic adaptation based method for developping the spectral dataset of typical colors for those wall paintings was proposed from the view point of human vision perception ability. Under the help and guidance of researchers in the art-research institution and protection-research institution of Dunhuang Academy and according to the existing research achievement of Dunhuang Research in the past years, 48 typical known Dunhuang pigments were chosen and 240 representative color samples were made with reflective spectral ranging from 360 to 750 nm was acquired by a spectrometer. In order to find the typical colors of the above mentioned color samples, the original dataset was devided into several subgroups by clustering analysis. The grouping number, together with the most typical samples for each subgroup which made up the firstly built typical color dataset, was determined by wilcoxon signed rank test according to the color inconstancy index comprehensively calculated under 6 typical illuminating conditions. Considering the completeness of gamut of Dunhuang wall paintings, 8 complementary colors was determined and finally the typical spectral color dataset was built up which contains 100 representative spectral colors. The analytical calculating results show that the median color inconstancy index of the built dataset in 99% confidence level by wilcoxon signed rank test was 3.28 and the 100 colors are distributing in the whole gamut uniformly, which ensures that this dataset can provide reasonable reference for choosing the color with highest color constancy during the color rendering process of Dunhuang cultural heritage.
Distributed File System Utilities to Manage Large DatasetsVersion 0.5
DOE Office of Scientific and Technical Information (OSTI.GOV)
2014-05-21
FileUtils provides a suite of tools to manage large datasets typically created by large parallel MPI applications. They are written in C and use standard POSIX I/Ocalls. The current suite consists of tools to copy, compare, remove, and list. The tools provide dramatic speedup over existing Linux tools, which often run as a single process.
Correction of elevation offsets in multiple co-located lidar datasets
Thompson, David M.; Dalyander, P. Soupy; Long, Joseph W.; Plant, Nathaniel G.
2017-04-07
IntroductionTopographic elevation data collected with airborne light detection and ranging (lidar) can be used to analyze short- and long-term changes to beach and dune systems. Analysis of multiple lidar datasets at Dauphin Island, Alabama, revealed systematic, island-wide elevation differences on the order of 10s of centimeters (cm) that were not attributable to real-world change and, therefore, were likely to represent systematic sampling offsets. These offsets vary between the datasets, but appear spatially consistent within a given survey. This report describes a method that was developed to identify and correct offsets between lidar datasets collected over the same site at different times so that true elevation changes over time, associated with sediment accumulation or erosion, can be analyzed.
A new global 1-km dataset of percentage tree cover derived from remote sensing
DeFries, R.S.; Hansen, M.C.; Townshend, J.R.G.; Janetos, A.C.; Loveland, Thomas R.
2000-01-01
Accurate assessment of the spatial extent of forest cover is a crucial requirement for quantifying the sources and sinks of carbon from the terrestrial biosphere. In the more immediate context of the United Nations Framework Convention on Climate Change, implementation of the Kyoto Protocol calls for estimates of carbon stocks for a baseline year as well as for subsequent years. Data sources from country level statistics and other ground-based information are based on varying definitions of 'forest' and are consequently problematic for obtaining spatially and temporally consistent carbon stock estimates. By combining two datasets previously derived from the Advanced Very High Resolution Radiometer (AVHRR) at 1 km spatial resolution, we have generated a prototype global map depicting percentage tree cover and associated proportions of trees with different leaf longevity (evergreen and deciduous) and leaf type (broadleaf and needleleaf). The product is intended for use in terrestrial carbon cycle models, in conjunction with other spatial datasets such as climate and soil type, to obtain more consistent and reliable estimates of carbon stocks. The percentage tree cover dataset is available through the Global Land Cover Facility at the University of Maryland at http://glcf.umiacs.umd.edu.
Assessing the internal consistency of the event-related potential: An example analysis.
Thigpen, Nina N; Kappenman, Emily S; Keil, Andreas
2017-01-01
ERPs are widely and increasingly used to address questions in psychophysiological research. As discussed in this special issue, a renewed focus on questions of reliability and stability marks the need for intuitive, quantitative descriptors that allow researchers to communicate the robustness of ERP measures used in a given study. This report argues that well-established indices of internal consistency and effect size meet this need and can be easily extracted from most ERP datasets, as demonstrated with example analyses using a representative dataset from a feature-based visual selective attention task. We demonstrate how to measure the internal consistency of three aspects commonly considered in ERP studies: voltage measurements for specific time ranges at selected sensors, voltage dynamics across all time points of the ERP waveform, and the distribution of voltages across the scalp. We illustrate methods for quantifying the robustness of experimental condition differences, by calculating effect size for different indices derived from the ERP. The number of trials contributing to the ERP waveform was manipulated to examine the relationship between signal-to-noise ratio (SNR), internal consistency, and effect size. In the present example dataset, satisfactory consistency (Cronbach's alpha > 0.7) of individual voltage measurements was reached at lower trial counts than were required to reach satisfactory effect sizes for differences between experimental conditions. Comparing different metrics of robustness, we conclude that the internal consistency and effect size of ERP findings greatly depend on the quantification strategy, the comparisons and analyses performed, and the SNR. © 2016 Society for Psychophysiological Research.
Digital shaded-relief map of Venezuela
Garrity, Christopher P.; Hackley, Paul C.; Urbani, Franco
2004-01-01
The Digital Shaded-Relief Map of Venezuela is a composite of more than 20 tiles of 90 meter (3 arc second) pixel resolution elevation data, captured during the Shuttle Radar Topography Mission (SRTM) in February 2000. The SRTM, a joint project between the National Geospatial-Intelligence Agency (NGA) and the National Aeronautics and Space Administration (NASA), provides the most accurate and comprehensive international digital elevation dataset ever assembled. The 10-day flight mission aboard the U.S. Space Shuttle Endeavour obtained elevation data for about 80% of the world's landmass at 3-5 meter pixel resolution through the use of synthetic aperture radar (SAR) technology. SAR is desirable because it acquires data along continuous swaths, maintaining data consistency across large areas, independent of cloud cover. Swaths were captured at an altitude of 230 km, and are approximately 225 km wide with varying lengths. Rendering of the shaded-relief image required editing of the raw elevation data to remove numerous holes and anomalously high and low values inherent in the dataset. Customized ArcInfo Arc Macro Language (AML) scripts were written to interpolate areas of null values and generalize irregular elevation spikes and wells. Coastlines and major water bodies used as a clipping mask were extracted from 1:500,000-scale geologic maps of Venezuela (Bellizzia and others, 1976). The shaded-relief image was rendered with an illumination azimuth of 315? and an altitude of 65?. A vertical exaggeration of 2X was applied to the image to enhance land-surface features. Image post-processing techniques were accomplished using conventional desktop imaging software.
NASA Astrophysics Data System (ADS)
Zhang, Z.; Zimmermann, N. E.; Poulter, B.
2015-12-01
Simulations of the spatial-temporal dynamics of wetlands is key to understanding the role of wetland biogeochemistry under past and future climate variability. Hydrologic inundation models, such as TOPMODEL, are based on a fundamental parameter known as the compound topographic index (CTI) and provide a computationally cost-efficient approach to simulate global wetland dynamics. However, there remains large discrepancy in the implementations of TOPMODEL in land-surface models (LSMs) and thus their performance against observations. This study describes new improvements to TOPMODEL implementation and estimates of global wetland dynamics using the LPJ-wsl DGVM, and quantifies uncertainties by comparing three digital elevation model products (HYDRO1k, GMTED, and HydroSHEDS) at different spatial resolution and accuracy on simulated inundation dynamics. We found that calibrating TOPMODEL with a benchmark dataset can help to successfully predict the seasonal and interannual variations of wetlands, as well as improve the spatial distribution of wetlands to be consistent with inventories. The HydroSHEDS DEM, using a river-basin scheme for aggregating the CTI, shows best accuracy for capturing the spatio-temporal dynamics of wetland among three DEM products. This study demonstrates the feasibility to capture spatial heterogeneity of inundation and to estimate seasonal and interannual variations in wetland by coupling a hydrological module in LSMs with appropriate benchmark datasets. It additionally highlight the importance of an adequate understanding of topographic indices for simulating global wetlands and show the opportunity to converge wetland estimations in LSMs by identifying the uncertainty associated with existing wetland products.
EUDAT B2FIND : A Cross-Discipline Metadata Service and Discovery Portal
NASA Astrophysics Data System (ADS)
Widmann, Heinrich; Thiemann, Hannes
2016-04-01
The European Data Infrastructure (EUDAT) project aims at a pan-European environment that supports a variety of multiple research communities and individuals to manage the rising tide of scientific data by advanced data management technologies. This led to the establishment of the community-driven Collaborative Data Infrastructure that implements common data services and storage resources to tackle the basic requirements and the specific challenges of international and interdisciplinary research data management. The metadata service B2FIND plays a central role in this context by providing a simple and user-friendly discovery portal to find research data collections stored in EUDAT data centers or in other repositories. For this we store the diverse metadata collected from heterogeneous sources in a comprehensive joint metadata catalogue and make them searchable in an open data portal. The implemented metadata ingestion workflow consists of three steps. First the metadata records - provided either by various research communities or via other EUDAT services - are harvested. Afterwards the raw metadata records are converted and mapped to unified key-value dictionaries as specified by the B2FIND schema. The semantic mapping of the non-uniform, community specific metadata to homogenous structured datasets is hereby the most subtle and challenging task. To assure and improve the quality of the metadata this mapping process is accompanied by • iterative and intense exchange with the community representatives, • usage of controlled vocabularies and community specific ontologies and • formal and semantic validation. Finally the mapped and checked records are uploaded as datasets to the catalogue, which is based on the open source data portal software CKAN. CKAN provides a rich RESTful JSON API and uses SOLR for dataset indexing that enables users to query and search in the catalogue. The homogenization of the community specific data models and vocabularies enables not only the unique presentation of these datasets as tables of field-value pairs but also the faceted, spatial and temporal search in the B2FIND metadata portal. Furthermore the service provides transparent access to the scientific data objects through the given references and identifiers in the metadata. B2FIND offers support for new communities interested in publishing their data within EUDAT. We present here the functionality and the features of the B2FIND service and give an outlook of further developments as interfaces to external libraries and use of Linked Data.
DeWitt, Jessica D.; Chirico, Peter G.; Malpeli, Katherine C.
2015-11-18
This work represents the fourth installment of the series, and publishes a dataset of eight new AOIs and one subarea within Afghanistan. These areas include Dasht-e-Nawar, Farah, North Ghazni, South Ghazni, Chakhansur, Godzareh East, Godzareh West, and Namaksar-e-Herat AOIs and the Central Bamyan subarea of the South Bamyan AOI (datasets for South Bamyan were published previously in Casey and Chirico, 2013). For each AOI and subarea, this dataset collection consists of the areal extent boundaries, elevation contours at 25-, 50-, and 100-m intervals, and an enhanced DEM. Hydrographic datasets covering the extent of four AOIs and one subarea are also included in the collection. The resulting raster and vector layers are intended for use by government agencies, developmental organizations, and private companies in Afghanistan to support mineral assessments, monitoring, management, and investment.
Residential load and rooftop PV generation: an Australian distribution network dataset
NASA Astrophysics Data System (ADS)
Ratnam, Elizabeth L.; Weller, Steven R.; Kellett, Christopher M.; Murray, Alan T.
2017-09-01
Despite the rapid uptake of small-scale solar photovoltaic (PV) systems in recent years, public availability of generation and load data at the household level remains very limited. Moreover, such data are typically measured using bi-directional meters recording only PV generation in excess of residential load rather than recording generation and load separately. In this paper, we report a publicly available dataset consisting of load and rooftop PV generation for 300 de-identified residential customers in an Australian distribution network, with load centres covering metropolitan Sydney and surrounding regional areas. The dataset spans a 3-year period, with separately reported measurements of load and PV generation at 30-min intervals. Following a detailed description of the dataset, we identify several means by which anomalous records (e.g. due to inverter failure) are identified and excised. With the resulting 'clean' dataset, we identify key customer-specific and aggregated characteristics of rooftop PV generation and residential load.
Rafferty, Sharon A.; Arnold, L.R.; Char, Stephen J.
2002-01-01
The U.S. Geological Survey developed this dataset as part of the Colorado Front Range Infrastructure Resources Project (FRIRP). One goal of the FRIRP was to provide information on the availability of those hydrogeologic resources that are either critical to maintaining infrastructure along the northern Front Range or that may become less available because of urban expansion in the northern Front Range. This dataset extends from the Boulder-Jefferson County line on the south, to the middle of Larimer and Weld Counties on the North. On the west, this dataset is bounded by the approximate mountain front of the Front Range of the Rocky Mountains; on the east, by an arbitrary north-south line extending through a point about 6.5 kilometers east of Greeley. This digital geospatial dataset consists of digitized contours of unconsolidated-sediment thickness (depth to bedrock).
ERIC Educational Resources Information Center
Sheppard, Sheri; Gilmartin, Shannon; Chen, Helen L.; Donaldson, Krista; Lichtenstein, Gary; Eris, Ozgur; Lande, Micah; Toye, George
2010-01-01
This report is based on data from the Academic Pathways of People Learning Engineering Survey (APPLES), administered to engineering students at 21 U.S. engineering colleges and schools in the spring of 2008. The first comprehensive set of analyses completed on the APPLES dataset presented here looks at how engineering students experience their…
On Burst Detection and Prediction in Retweeting Sequence
2015-05-22
We conduct a comprehensive empirical analysis of a large microblogging dataset collected from the Sina Weibo and report our observations of burst...whether and how accurate we can predict bursts using classifiers based on the extracted features. Our empirical study of the Sina Weibo data shows the...feasibility of burst prediction using appropriately extracted features and classic classifiers. 1 Introduction Microblogging, such as Twitter and Sina
ERIC Educational Resources Information Center
Seipel, Ben; Carlson, Sarah E.; Clinton, Virginia E.
2017-01-01
The purpose of this study was to examine moment-by-moment fluctuations in text comprehension processing and determine how and when poor and good comprehenders differ. To do so, we reanalyzed a dataset of think-aloud protocols from 138 intermediate elementary students. Both good and poor comprehenders used a variety of processing strategies when…
User Guide to the PDS Dataset for the Cassini Composite Infrared Spectrometer (CIRS)
NASA Technical Reports Server (NTRS)
Nixon, Conor A.; Kaelberer, Monte S.; Gorius, Nicolas
2012-01-01
This User Guide to the Cassini Composite Infrared Spectrometer (CIRS) has been written with two communities in mind. First and foremost, scientists external to the Cassini Project who seek to use the CIRS data as archived in the Planetary Data System (PDS). In addition, it is intended to be a comprehensive reference guide for those internal to the CIRS team.
Wang, James K. T.; Langfelder, Peter; Horvath, Steve; Palazzolo, Michael J.
2017-01-01
Huntington's disease (HD) is a progressive and autosomal dominant neurodegeneration caused by CAG expansion in the huntingtin gene (HTT), but the pathophysiological mechanism of mutant HTT (mHTT) remains unclear. To study HD using systems biological methodologies on all published data, we undertook the first comprehensive curation of two key PubMed HD datasets: perturbation genes that impact mHTT-driven endpoints and therefore are putatively linked causally to pathogenic mechanisms, and the protein interactome of HTT that reflects its biology. We perused PubMed articles containing co-citation of gene IDs and MeSH terms of interest to generate mechanistic gene sets for iterative enrichment analyses and rank ordering. The HD Perturbation database of 1,218 genes highly overlaps the HTT Interactome of 1,619 genes, suggesting links between normal HTT biology and mHTT pathology. These two HD datasets are enriched for protein networks of key genes underlying two mechanisms not previously implicated in HD nor in each other: exosome synaptic functions and homeostatic synaptic plasticity. Moreover, proteins, possibly including HTT, and miRNA detected in exosomes from a wide variety of sources also highly overlap the HD datasets, suggesting both mechanistic and biomarker links. Finally, the HTT Interactome highly intersects protein networks of pathogenic genes underlying Parkinson's, Alzheimer's and eight non-HD polyglutamine diseases, ALS, and spinal muscular atrophy. These protein networks in turn highly overlap the exosome and homeostatic synaptic plasticity gene sets. Thus, we hypothesize that HTT and other neurodegeneration pathogenic genes form a large interlocking protein network involved in exosome and homeostatic synaptic functions, particularly where the two mechanisms intersect. Mutant pathogenic proteins cause dysfunctions at distinct points in this network, each altering the two mechanisms in specific fashion that contributes to distinct disease pathologies, depending on the gene mutation and the cellular and biological context. This protein network is rich with drug targets, and exosomes may provide disease biomarkers, thus enabling drug discovery. All the curated datasets are made available for other investigators. Elucidating the roles of pathogenic neurodegeneration genes in exosome and homeostatic synaptic functions may provide a unifying framework for the age-dependent, progressive and tissue selective nature of multiple neurodegenerative diseases. PMID:28611571
Wang, James K T; Langfelder, Peter; Horvath, Steve; Palazzolo, Michael J
2017-01-01
Huntington's disease (HD) is a progressive and autosomal dominant neurodegeneration caused by CAG expansion in the huntingtin gene ( HTT ), but the pathophysiological mechanism of mutant HTT (mHTT) remains unclear. To study HD using systems biological methodologies on all published data, we undertook the first comprehensive curation of two key PubMed HD datasets: perturbation genes that impact mHTT-driven endpoints and therefore are putatively linked causally to pathogenic mechanisms, and the protein interactome of HTT that reflects its biology. We perused PubMed articles containing co-citation of gene IDs and MeSH terms of interest to generate mechanistic gene sets for iterative enrichment analyses and rank ordering. The HD Perturbation database of 1,218 genes highly overlaps the HTT Interactome of 1,619 genes, suggesting links between normal HTT biology and mHTT pathology. These two HD datasets are enriched for protein networks of key genes underlying two mechanisms not previously implicated in HD nor in each other: exosome synaptic functions and homeostatic synaptic plasticity. Moreover, proteins, possibly including HTT, and miRNA detected in exosomes from a wide variety of sources also highly overlap the HD datasets, suggesting both mechanistic and biomarker links. Finally, the HTT Interactome highly intersects protein networks of pathogenic genes underlying Parkinson's, Alzheimer's and eight non-HD polyglutamine diseases, ALS, and spinal muscular atrophy. These protein networks in turn highly overlap the exosome and homeostatic synaptic plasticity gene sets. Thus, we hypothesize that HTT and other neurodegeneration pathogenic genes form a large interlocking protein network involved in exosome and homeostatic synaptic functions, particularly where the two mechanisms intersect. Mutant pathogenic proteins cause dysfunctions at distinct points in this network, each altering the two mechanisms in specific fashion that contributes to distinct disease pathologies, depending on the gene mutation and the cellular and biological context. This protein network is rich with drug targets, and exosomes may provide disease biomarkers, thus enabling drug discovery. All the curated datasets are made available for other investigators. Elucidating the roles of pathogenic neurodegeneration genes in exosome and homeostatic synaptic functions may provide a unifying framework for the age-dependent, progressive and tissue selective nature of multiple neurodegenerative diseases.
EPA Office of Water (OW): 2002 Impaired Waters Baseline NHDPlus Indexed Dataset
This dataset consists of geospatial and attribute data identifying the spatial extent of state-reported impaired waters (EPA's Integrated Reporting categories 4a, 4b, 4c and 5)* available in EPA's Reach Address Database (RAD) at the time of extraction. For the 2002 baseline reporting year, EPA compiled state-submitted GIS data to create a seamless and nationally consistent picture of the Nation's impaired waters for measuring progress. EPA's Assessment and TMDL Tracking and Implementation System (ATTAINS) is a national compilation of states' 303(d) listings and TMDL development information, spanning several years of tracking over 40,000 impaired waters.
Comparison of Shallow Survey 2012 Multibeam Datasets
NASA Astrophysics Data System (ADS)
Ramirez, T. M.
2012-12-01
The purpose of the Shallow Survey common dataset is a comparison of the different technologies utilized for data acquisition in the shallow survey marine environment. The common dataset consists of a series of surveys conducted over a common area of seabed using a variety of systems. It provides equipment manufacturers the opportunity to showcase their latest systems while giving hydrographic researchers and scientists a chance to test their latest algorithms on the dataset so that rigorous comparisons can be made. Five companies collected data for the Common Dataset in the Wellington Harbor area in New Zealand between May 2010 and May 2011; including Kongsberg, Reson, R2Sonic, GeoAcoustics, and Applied Acoustics. The Wellington harbor and surrounding coastal area was selected since it has a number of well-defined features, including the HMNZS South Seas and HMNZS Wellington wrecks, an armored seawall constructed of Tetrapods and Akmons, aquifers, wharves and marinas. The seabed inside the harbor basin is largely fine-grained sediment, with gravel and reefs around the coast. The area outside the harbor on the southern coast is an active environment, with moving sand and exposed reefs. A marine reserve is also in this area. For consistency between datasets, the coastal research vessel R/V Ikatere and crew were used for all surveys conducted for the common dataset. Using Triton's Perspective processing software multibeam datasets collected for the Shallow Survey were processed for detail analysis. Datasets from each sonar manufacturer were processed using the CUBE algorithm developed by the Center for Coastal and Ocean Mapping/Joint Hydrographic Center (CCOM/JHC). Each dataset was gridded at 0.5 and 1.0 meter resolutions for cross comparison and compliance with International Hydrographic Organization (IHO) requirements. Detailed comparisons were made of equipment specifications (transmit frequency, number of beams, beam width), data density, total uncertainty, and IHO compliance. Results from an initial analysis indicate that more factors need to be considered to properly compare sonar quality from processed results than just utilizing the same vessel with the same vessel configuration. Survey techniques such as focusing the beams over a narrower beam width can greatly increase data quality. While each sonar manufacturer was required to meet Special Order IHO specifications, line spacing was not specified and allowed for a greater data density despite equipment specification.
EPA Facility Registry Service (FRS): Facility Interests Dataset - Intranet
This web feature service consists of location and facility identification information from EPA's Facility Registry Service (FRS) for all sites that are available in the FRS individual feature layers. The layers comprise the FRS major program databases, including:Assessment Cleanup and Redevelopment Exchange System (ACRES) : brownfields sites ; Air Facility System (AFS) : stationary sources of air pollution ; Air Quality System (AQS) : ambient air pollution data from monitoring stations; Bureau of Indian Affairs (BIA) : schools data on Indian land; Base Realignment and Closure (BRAC) facilities; Clean Air Markets Division Business System (CAMDBS) : market-based air pollution control programs; Comprehensive Environmental Response, Compensation, and Liability Information System (CERCLIS) : hazardous waste sites; Integrated Compliance Information System (ICIS) : integrated enforcement and compliance information; National Compliance Database (NCDB) : Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA) and the Toxic Substances Control Act (TSCA); National Pollutant Discharge Elimination System (NPDES) module of ICIS : NPDES surface water permits; Radiation Information Database (RADINFO) : radiation and radioactivity facilities; RACT/BACT/LAER Clearinghouse (RBLC) : best available air pollution technology requirements; Resource Conservation and Recovery Act Information System (RCRAInfo) : tracks generators, transporters, treaters, storers, and disposers of haz
EPA Facility Registry Service (FRS): Facility Interests Dataset - Intranet Download
This downloadable data package consists of location and facility identification information from EPA's Facility Registry Service (FRS) for all sites that are available in the FRS individual feature layers. The layers comprise the FRS major program databases, including:Assessment Cleanup and Redevelopment Exchange System (ACRES) : brownfields sites ; Air Facility System (AFS) : stationary sources of air pollution ; Air Quality System (AQS) : ambient air pollution data from monitoring stations; Bureau of Indian Affairs (BIA) : schools data on Indian land; Base Realignment and Closure (BRAC) facilities; Clean Air Markets Division Business System (CAMDBS) : market-based air pollution control programs; Comprehensive Environmental Response, Compensation, and Liability Information System (CERCLIS) : hazardous waste sites; Integrated Compliance Information System (ICIS) : integrated enforcement and compliance information; National Compliance Database (NCDB) : Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA) and the Toxic Substances Control Act (TSCA); National Pollutant Discharge Elimination System (NPDES) module of ICIS : NPDES surface water permits; Radiation Information Database (RADINFO) : radiation and radioactivity facilities; RACT/BACT/LAER Clearinghouse (RBLC) : best available air pollution technology requirements; Resource Conservation and Recovery Act Information System (RCRAInfo) : tracks generators, transporters, treaters, storers, and disposers
LONI visualization environment.
Dinov, Ivo D; Valentino, Daniel; Shin, Bae Cheol; Konstantinidis, Fotios; Hu, Guogang; MacKenzie-Graham, Allan; Lee, Erh-Fang; Shattuck, David; Ma, Jeff; Schwartz, Craig; Toga, Arthur W
2006-06-01
Over the past decade, the use of informatics to solve complex neuroscientific problems has increased dramatically. Many of these research endeavors involve examining large amounts of imaging, behavioral, genetic, neurobiological, and neuropsychiatric data. Superimposing, processing, visualizing, or interpreting such a complex cohort of datasets frequently becomes a challenge. We developed a new software environment that allows investigators to integrate multimodal imaging data, hierarchical brain ontology systems, on-line genetic and phylogenic databases, and 3D virtual data reconstruction models. The Laboratory of Neuro Imaging visualization environment (LONI Viz) consists of the following components: a sectional viewer for imaging data, an interactive 3D display for surface and volume rendering of imaging data, a brain ontology viewer, and an external database query system. The synchronization of all components according to stereotaxic coordinates, region name, hierarchical ontology, and genetic labels is achieved via a comprehensive BrainMapper functionality, which directly maps between position, structure name, database, and functional connectivity information. This environment is freely available, portable, and extensible, and may prove very useful for neurobiologists, neurogenetisists, brain mappers, and for other clinical, pedagogical, and research endeavors.
EPA Facility Registry Service (FRS): Facility Interests Dataset Download
This downloadable data package consists of location and facility identification information from EPA's Facility Registry Service (FRS) for all sites that are available in the FRS individual feature layers. The layers comprise the FRS major program databases, including:Assessment Cleanup and Redevelopment Exchange System (ACRES) : brownfields sites ; Air Facility System (AFS) : stationary sources of air pollution ; Air Quality System (AQS) : ambient air pollution data from monitoring stations; Bureau of Indian Affairs (BIA) : schools data on Indian land; Base Realignment and Closure (BRAC) facilities; Clean Air Markets Division Business System (CAMDBS) : market-based air pollution control programs; Comprehensive Environmental Response, Compensation, and Liability Information System (CERCLIS) : hazardous waste sites; Integrated Compliance Information System (ICIS) : integrated enforcement and compliance information; National Compliance Database (NCDB) : Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA) and the Toxic Substances Control Act (TSCA); National Pollutant Discharge Elimination System (NPDES) module of ICIS : NPDES surface water permits; Radiation Information Database (RADINFO) : radiation and radioactivity facilities; RACT/BACT/LAER Clearinghouse (RBLC) : best available air pollution technology requirements; Resource Conservation and Recovery Act Information System (RCRAInfo) : tracks generators, transporters, treaters, storers, and disposers
EPA Facility Registry Service (FRS): Facility Interests Dataset
This web feature service consists of location and facility identification information from EPA's Facility Registry Service (FRS) for all sites that are available in the FRS individual feature layers. The layers comprise the FRS major program databases, including:Assessment Cleanup and Redevelopment Exchange System (ACRES) : brownfields sites ; Air Facility System (AFS) : stationary sources of air pollution ; Air Quality System (AQS) : ambient air pollution data from monitoring stations; Bureau of Indian Affairs (BIA) : schools data on Indian land; Base Realignment and Closure (BRAC) facilities; Clean Air Markets Division Business System (CAMDBS) : market-based air pollution control programs; Comprehensive Environmental Response, Compensation, and Liability Information System (CERCLIS) : hazardous waste sites; Integrated Compliance Information System (ICIS) : integrated enforcement and compliance information; National Compliance Database (NCDB) : Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA) and the Toxic Substances Control Act (TSCA); National Pollutant Discharge Elimination System (NPDES) module of ICIS : NPDES surface water permits; Radiation Information Database (RADINFO) : radiation and radioactivity facilities; RACT/BACT/LAER Clearinghouse (RBLC) : best available air pollution technology requirements; Resource Conservation and Recovery Act Information System (RCRAInfo) : tracks generators, transporters, treaters, storers, and disposers of haz
Berninger, Virginia W.; Gebregziabher, Mulugeta; Tsu, Loretta
2016-01-01
Abstract Meta-analysis of voxel-based morphometry dyslexia studies and direct analysis of 293 reading disability and control cases from six different research sites were performed to characterize defining gray matter features of reading disability. These analyses demonstrated consistently lower gray matter volume in left posterior superior temporal sulcus/middle temporal gyrus regions and left orbitofrontal gyrus/pars orbitalis regions. Gray matter volume within both of these regions significantly predicted individual variation in reading comprehension after correcting for multiple comparisons. These regional gray matter differences were observed across published studies and in the multisite dataset after controlling for potential age and gender effects, and despite increased anatomical variance in the reading disability group, but were not significant after controlling for total gray matter volume. Thus, the orbitofrontal and posterior superior temporal sulcus gray matter findings are relatively reliable effects that appear to be dependent on cases with low total gray matter volume. The results are considered in the context of genetics studies linking orbitofrontal and superior temporal sulcus regions to alleles that confer risk for reading disability. PMID:26835509
EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats
Ison, Jon; Kalaš, Matúš; Jonassen, Inge; Bolser, Dan; Uludag, Mahmut; McWilliam, Hamish; Malone, James; Lopez, Rodrigo; Pettifer, Steve; Rice, Peter
2013-01-01
Motivation: Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required. Results: EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, datasets and publications within bioinformatics. EDAM applies to organizing and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations. Availability: The latest stable version of EDAM is available in OWL format from http://edamontology.org/EDAM.owl and in OBO format from http://edamontology.org/EDAM.obo. It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to http://edamontology.org. This article describes version 1.2 available at http://edamontology.org/EDAM_1.2.owl. Contact: jison@ebi.ac.uk PMID:23479348
A rapid approach for automated comparison of independently derived stream networks
Stanislawski, Larry V.; Buttenfield, Barbara P.; Doumbouya, Ariel T.
2015-01-01
This paper presents an improved coefficient of line correspondence (CLC) metric for automatically assessing the similarity of two different sets of linear features. Elevation-derived channels at 1:24,000 scale (24K) are generated from a weighted flow-accumulation model and compared to 24K National Hydrography Dataset (NHD) flowlines. The CLC process conflates two vector datasets through a raster line-density differencing approach that is faster and more reliable than earlier methods. Methods are tested on 30 subbasins distributed across different terrain and climate conditions of the conterminous United States. CLC values for the 30 subbasins indicate 44–83% of the features match between the two datasets, with the majority of the mismatching features comprised of first-order features. Relatively lower CLC values result from subbasins with less than about 1.5 degrees of slope. The primary difference between the two datasets may be explained by different data capture criteria. First-order, headwater tributaries derived from the flow-accumulation model are captured more comprehensively through drainage area and terrain conditions, whereas capture of headwater features in the NHD is cartographically constrained by tributary length. The addition of missing headwaters to the NHD, as guided by the elevation-derived channels, can substantially improve the scientific value of the NHD.
The PRIMAP-hist national historical emissions time series
NASA Astrophysics Data System (ADS)
Gütschow, Johannes; Jeffery, M. Louise; Gieseke, Robert; Gebel, Ronja; Stevens, David; Krapp, Mario; Rocha, Marcia
2016-11-01
To assess the history of greenhouse gas emissions and individual countries' contributions to emissions and climate change, detailed historical data are needed. We combine several published datasets to create a comprehensive set of emissions pathways for each country and Kyoto gas, covering the years 1850 to 2014 with yearly values, for all UNFCCC member states and most non-UNFCCC territories. The sectoral resolution is that of the main IPCC 1996 categories. Additional time series of CO2 are available for energy and industry subsectors. Country-resolved data are combined from different sources and supplemented using year-to-year growth rates from regionally resolved sources and numerical extrapolations to complete the dataset. Regional deforestation emissions are downscaled to country level using estimates of the deforested area obtained from potential vegetation and simulations of agricultural land. In this paper, we discuss the data sources and methods used and present the resulting dataset, including its limitations and uncertainties. The dataset is available from doi:10.5880/PIK.2016.003 and can be viewed on the website accompanying this paper (http://www.pik-potsdam.de/primap-live/primap-hist/).
Solar Irradiance Data Products at the LASP Interactive Solar IRradiance Datacenter (LISIRD)
NASA Astrophysics Data System (ADS)
Lindholm, D. M.; Ware DeWolfe, A.; Wilson, A.; Pankratz, C. K.; Snow, M. A.; Woods, T. N.
2011-12-01
The Laboratory for Atmospheric and Space Physics (LASP) has developed the LASP Interactive Solar IRradiance Datacenter (LISIRD, http://lasp.colorado.edu/lisird/) web site to provide access to a comprehensive set of solar irradiance measurements and related datasets. Current data holdings include products from NASA missions SORCE, UARS, SME, and TIMED-SEE. The data provided covers a wavelength range from soft X-ray (XUV) at 0.1 nm up to the near infrared (NIR) at 2400 nm, as well as Total Solar Irradiance (TSI). Other datasets include solar indices, spectral and flare models, solar images, and more. The LISIRD web site features updated plotting, browsing, and download capabilities enabled by dygraphs, JavaScript, and Ajax calls to the LASP Time Series Server (LaTiS). In addition to the web browser interface, most of the LISIRD datasets can be accessed via the LaTiS web service interface that supports the OPeNDAP standard. OPeNDAP clients and other programming APIs are available for making requests that subset, aggregate, or filter data on the server before it is transported to the user. This poster provides an overview of the LISIRD system, summarizes the datasets currently available, and provides details on how to access solar irradiance data products through LISIRD's interfaces.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oubeidillah, Abdoul A; Kao, Shih-Chieh; Ashfaq, Moetasim
2014-01-01
To extend geographical coverage, refine spatial resolution, and improve modeling efficiency, a computation- and data-intensive effort was conducted to organize a comprehensive hydrologic dataset with post-calibrated model parameters for hydro-climate impact assessment. Several key inputs for hydrologic simulation including meteorologic forcings, soil, land class, vegetation, and elevation were collected from multiple best-available data sources and organized for 2107 hydrologic subbasins (8-digit hydrologic units, HUC8s) in the conterminous United States at refined 1/24 (~4 km) spatial resolution. Using high-performance computing for intensive model calibration, a high-resolution parameter dataset was prepared for the macro-scale Variable Infiltration Capacity (VIC) hydrologic model. The VICmore » simulation was driven by DAYMET daily meteorological forcing and was calibrated against USGS WaterWatch monthly runoff observations for each HUC8. The results showed that this new parameter dataset may help reasonably simulate runoff at most US HUC8 subbasins. Based on this exhaustive calibration effort, it is now possible to accurately estimate the resources required for further model improvement across the entire conterminous United States. We anticipate that through this hydrologic parameter dataset, the repeated effort of fundamental data processing can be lessened, so that research efforts can emphasize the more challenging task of assessing climate change impacts. The pre-organized model parameter dataset will be provided to interested parties to support further hydro-climate impact assessment.« less
NASA Astrophysics Data System (ADS)
Gordov, Evgeny; Okladnikov, Igor; Titov, Alexander
2017-04-01
For comprehensive usage of large geospatial meteorological and climate datasets it is necessary to create a distributed software infrastructure based on the spatial data infrastructure (SDI) approach. Currently, it is generally accepted that the development of client applications as integrated elements of such infrastructure should be based on the usage of modern web and GIS technologies. The paper describes the Web GIS for complex processing and visualization of geospatial (mainly in NetCDF and PostGIS formats) datasets as an integral part of the dedicated Virtual Research Environment for comprehensive study of ongoing and possible future climate change, and analysis of their implications, providing full information and computing support for the study of economic, political and social consequences of global climate change at the global and regional levels. The Web GIS consists of two basic software parts: 1. Server-side part representing PHP applications of the SDI geoportal and realizing the functionality of interaction with computational core backend, WMS/WFS/WPS cartographical services, as well as implementing an open API for browser-based client software. Being the secondary one, this part provides a limited set of procedures accessible via standard HTTP interface. 2. Front-end part representing Web GIS client developed according to a "single page application" technology based on JavaScript libraries OpenLayers (http://openlayers.org/), ExtJS (https://www.sencha.com/products/extjs), GeoExt (http://geoext.org/). It implements application business logic and provides intuitive user interface similar to the interface of such popular desktop GIS applications, as uDIG, QuantumGIS etc. Boundless/OpenGeo architecture was used as a basis for Web-GIS client development. According to general INSPIRE requirements to data visualization Web GIS provides such standard functionality as data overview, image navigation, scrolling, scaling and graphical overlay, displaying map legends and corresponding metadata information. The specialized Web GIS client contains three basic tires: • Tier of NetCDF metadata in JSON format • Middleware tier of JavaScript objects implementing methods to work with: o NetCDF metadata o XML file of selected calculations configuration (XML task) o WMS/WFS/WPS cartographical services • Graphical user interface tier representing JavaScript objects realizing general application business logic Web-GIS developed provides computational processing services launching to support solving tasks in the area of environmental monitoring, as well as presenting calculation results in the form of WMS/WFS cartographical layers in raster (PNG, JPG, GeoTIFF), vector (KML, GML, Shape), and binary (NetCDF) formats. It has shown its effectiveness in the process of solving real climate change research problems and disseminating investigation results in cartographical formats. The work is supported by the Russian Science Foundation grant No 16-19-10257.
Maglione, Anton G; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio
2017-01-01
In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas.
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets
Wernisch, Lorenz
2017-01-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.
Gabasova, Evelina; Reid, John; Wernisch, Lorenz
2017-10-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.
Maglione, Anton G.; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio
2017-01-01
In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas. PMID:28790907
Creating Digital Environments for Multi-Agent Simulation
2003-12-01
foliage on a polygon to represent a tree). Tile A spatial partition of a coverage that shares the same set of feature classes with the same... orthophoto datasets can be made from rectified grayscale aerial images. These datasets can support various weapon systems, Command, Control...Raster Product Format (RPF) Standard. This data consists of unclassified seamless orthophotos , made from rectified grayscale aerial images. DOI 10
Ekeberg, Tomas
2015-05-26
This dataset contains the diffraction patterns that were used for the first three-dimensional reconstruction of a virus using FEL data. The sample was the giant mimivirus particle, which is one of the largest known viruses with a diameter of 450 nm. The dataset consists of the 198 diffraction patterns that were used in the analysis.
Ratz, Joan M.; Conk, Shannon J.
2014-01-01
The Gap Analysis Program (GAP) of the U.S. Geological Survey (USGS) produces geospatial datasets providing information on land cover, predicted species distributions, stewardship (ownership and conservation status), and an analysis dataset which synthesizes the other three datasets. The intent in providing these datasets is to support the conservation of biodiversity. The datasets are made available at no cost. The initial datasets were created at the state level. More recent datasets have been assembled at regional and national levels. GAP entered an agreement with the Policy Analysis and Science Assistance branch of the USGS to conduct an evaluation to describe the effect that using GAP data has on those who utilize the datasets (GAP users). The evaluation project included multiple components: a discussion regarding use of GAP data conducted with participants at a GAP conference, a literature review of publications that cited use of GAP data, and a survey of GAP users. The findings of the published literature search were used to identify topics to include on the survey. This report summarizes the literature search, the characteristics of the resulting set of publications, the emergent themes from statements made regarding GAP data, and a bibliometric analysis of the publications. We cannot claim that this list includes all publications that have used GAP data. Given the time lapse that is common in the publishing process, more recent datasets may be cited less frequently in this list of publications. Reports or products that used GAP data may be produced but never published in print or released online. In that case, our search strategies would not have located those reports. Authors may have used GAP data but failed to cite it in such a way that the search strategies we used would have located those publications. These are common issues when using a literature search as part of an evaluation project. Although the final list of publications we identified is not comprehensive, this set of publications can be considered a sufficient sample of those citing GAP data and suitable for the descriptive analyses we conducted.
Fazio, Simone; Garraín, Daniel; Mathieux, Fabrice; De la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda
2015-01-01
Under the framework of the European Platform on Life Cycle Assessment, the European Reference Life-Cycle Database (ELCD - developed by the Joint Research Centre of the European Commission), provides core Life Cycle Inventory (LCI) data from front-running EU-level business associations and other sources. The ELCD contains energy-related data on power and fuels. This study describes the methods to be used for the quality analysis of energy data for European markets (available in third-party LC databases and from authoritative sources) that are, or could be, used in the context of the ELCD. The methodology was developed and tested on the energy datasets most relevant for the EU context, derived from GaBi (the reference database used to derive datasets for the ELCD), Ecoinvent, E3 and Gemis. The criteria for the database selection were based on the availability of EU-related data, the inclusion of comprehensive datasets on energy products and services, and the general approval of the LCA community. The proposed approach was based on the quality indicators developed within the International Reference Life Cycle Data System (ILCD) Handbook, further refined to facilitate their use in the analysis of energy systems. The overall Data Quality Rating (DQR) of the energy datasets can be calculated by summing up the quality rating (ranging from 1 to 5, where 1 represents very good, and 5 very poor quality) of each of the quality criteria indicators, divided by the total number of indicators considered. The quality of each dataset can be estimated for each indicator, and then compared with the different databases/sources. The results can be used to highlight the weaknesses of each dataset and can be used to guide further improvements to enhance the data quality with regard to the established criteria. This paper describes the application of the methodology to two exemplary datasets, in order to show the potential of the methodological approach. The analysis helps LCA practitioners to evaluate the usefulness of the ELCD datasets for their purposes, and dataset developers and reviewers to derive information that will help improve the overall DQR of databases.
Gilman, Alexey; Laurens, Lieve M.; Puri, Aaron W.; ...
2015-11-16
Methane is a feedstock of interest for the future, both from natural gas and from renewable biogas sources. Methanotrophic bacteria have the potential to enable commercial methane bioconversion to value-added products such as fuels and chemicals. A strain of interest for such applications is Methylomicrobium buryatense 5GB1, due to its robust growth characteristics. But, to take advantage of the potential of this methanotroph, it is important to generate comprehensive bioreactor-based datasets for different growth conditions to compare bioprocess parameters. The datasets of growth parameters, gas utilization rates, and products (total biomass, extracted fatty acids, glycogen, excreted acids) were obtained formore » cultures of M. buryatense 5GB1 grown in continuous culture under methane limitation and O2 limitation conditions. Additionally, experiments were performed involving unrestricted batch growth conditions with both methane and methanol as substrate. All four growth conditions show significant differences. The most notable changes are the high glycogen content and high formate excretion for cells grown on methanol (batch), and high O2:CH4 utilization ratio for cells grown under methane limitation. The results presented here represent the most comprehensive published bioreactor datasets for a gamma-proteobacterial methanotroph. This information shows that metabolism by M. buryatense 5GB1 differs significantly for each of the four conditions tested. O2 limitation resulted in the lowest relative O2 demand and fed-batch growth on methane the highest. Future studies are needed to understand the metabolic basis of these differences. However, these results suggest that both batch and continuous culture conditions have specific advantages, depending on the product of interest.« less
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-01-01
Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il PMID:18586742
Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study
2015-01-01
Objective This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are “invisible” or not deposited in a known repository. Methods We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article. Results About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects. Conclusion In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a “dataset,” determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets. PMID:26207759
Global and Hemispheric Temperature Anomalies: Land and Marine Instrumental Records (1850 - 2015)
Jones, P. D. [Climatic Research Unit (CRU), University of East Anglia, Norwich, United Kingdom; Parker, D. E. [Hadley Centre for Climate Prediction and Research, Berkshire, United Kingdom; Osborn, T. J. [Climatic Research Unit (CRU), University of East Anglia, Norwich, United Kingdom; Briffa, K. R. [Climatic Research Unit (CRU), University of East Anglia, Norwich, United Kingdom
2016-05-01
These global and hemispheric temperature anomaly time series, which incorporate land and marine data, are continually updated and expanded by P. Jones of the Climatic Research Unit (CRU) with help from colleagues at the CRU and other institutions. Some of the earliest work in producing these temperature series dates back to Jones et al. (1986a,b,c), Jones (1988, 1994), and Jones and Briffa (1992). Most of the discussion of methods given here has been gleaned from the Frequently Asked Questions section of the CRU temperature data web pages. Users are encouraged to visit the CRU Web site for the most comprehensive overview of these data (the "HadCRUT4" dataset), other associated datasets, and the most recent literature references to the work of Jones et al.
Data mining in newt-omics, the repository for omics data from the newt.
Looso, Mario; Braun, Thomas
2015-01-01
Salamanders are an excellent model organism to study regenerative processes due to their unique ability to regenerate lost appendages or organs. Straightforward bioinformatics tools to analyze and take advantage of the growing number of "omics" studies performed in salamanders were lacking so far. To overcome this limitation, we have generated a comprehensive data repository for the red-spotted newt Notophthalmus viridescens, named newt-omics, merging omics style datasets on the transcriptome and proteome level including expression values and annotations. The resource is freely available via a user-friendly Web-based graphical user interface ( http://newt-omics.mpi-bn.mpg.de) that allows access and queries to the database without prior bioinformatical expertise. The repository is updated regularly, incorporating new published datasets from omics technologies.
Depth-varying seismogenesis on an oceanic detachment fault at 13°20‧N on the Mid-Atlantic Ridge
NASA Astrophysics Data System (ADS)
Craig, Timothy J.; Parnell-Turner, Ross
2017-12-01
Extension at slow- and intermediate-spreading mid-ocean ridges is commonly accommodated through slip on long-lived faults called oceanic detachments. These curved, convex-upward faults consist of a steeply-dipping section thought to be rooted in the lower crust or upper mantle which rotates to progressively shallower dip-angles at shallower depths. The commonly-observed result is a domed, sub-horizontal oceanic core complex at the seabed. Although it is accepted that detachment faults can accumulate kilometre-scale offsets over millions of years, the mechanism of slip, and their capacity to sustain the shear stresses necessary to produce large earthquakes, remains subject to debate. Here we present a comprehensive seismological study of an active oceanic detachment fault system on the Mid-Atlantic Ridge near 13°20‧N, combining the results from a local ocean-bottom seismograph deployment with waveform inversion of a series of larger teleseismically-observed earthquakes. The unique coincidence of these two datasets provides a comprehensive definition of rupture on the fault, from the uppermost mantle to the seabed. Our results demonstrate that although slip on the deep, steeply-dipping portion of detachment faults is accommodated by failure in numerous microearthquakes, the shallow, gently-dipping section of the fault within the upper few kilometres is relatively strong, and is capable of producing large-magnitude earthquakes. This result brings into question the current paradigm that the shallow sections of oceanic detachment faults are dominated by low-friction mineralogies and therefore slip aseismically, but is consistent with observations from continental detachment faults. Slip on the shallow portion of active detachment faults at relatively low angles may therefore account for many more large-magnitude earthquakes at mid-ocean ridges than previously thought, and suggests that the lithospheric strength at slow-spreading mid-ocean ridges may be concentrated at shallow depths.
Kitahara, Marcelo V.; Cairns, Stephen D.; Stolarski, Jarosław; Blair, David; Miller, David J.
2010-01-01
Background Classical morphological taxonomy places the approximately 1400 recognized species of Scleractinia (hard corals) into 27 families, but many aspects of coral evolution remain unclear despite the application of molecular phylogenetic methods. In part, this may be a consequence of such studies focusing on the reef-building (shallow water and zooxanthellate) Scleractinia, and largely ignoring the large number of deep-sea species. To better understand broad patterns of coral evolution, we generated molecular data for a broad and representative range of deep sea scleractinians collected off New Caledonia and Australia during the last decade, and conducted the most comprehensive molecular phylogenetic analysis to date of the order Scleractinia. Methodology Partial (595 bp) sequences of the mitochondrial cytochrome oxidase subunit 1 (CO1) gene were determined for 65 deep-sea (azooxanthellate) scleractinians and 11 shallow-water species. These new data were aligned with 158 published sequences, generating a 234 taxon dataset representing 25 of the 27 currently recognized scleractinian families. Principal Findings/Conclusions There was a striking discrepancy between the taxonomic validity of coral families consisting predominantly of deep-sea or shallow-water species. Most families composed predominantly of deep-sea azooxanthellate species were monophyletic in both maximum likelihood and Bayesian analyses but, by contrast (and consistent with previous studies), most families composed predominantly of shallow-water zooxanthellate taxa were polyphyletic, although Acroporidae, Poritidae, Pocilloporidae, and Fungiidae were exceptions to this general pattern. One factor contributing to this inconsistency may be the greater environmental stability of deep-sea environments, effectively removing taxonomic “noise” contributed by phenotypic plasticity. Our phylogenetic analyses imply that the most basal extant scleractinians are azooxanthellate solitary corals from deep-water, their divergence predating that of the robust and complex corals. Deep-sea corals are likely to be critical to understanding anthozoan evolution and the origins of the Scleractinia. PMID:20628613
Nicholson, Andrew G; Detterbeck, Frank; Marx, Alexander; Roden, Anja C; Marchevsky, Alberto M; Mukai, Kiyoshi; Chen, Gang; Marino, Mirella; den Bakker, Michael A; Yang, Woo-Ick; Judge, Meagan; Hirschowitz, Lynn
2017-03-01
The International Collaboration on Cancer Reporting (ICCR) is a not-for-profit organization formed by the Royal Colleges of Pathologists of Australasia and the United Kingdom, the College of American Pathologists, the Canadian Association of Pathologists-Association Canadienne des Pathologists in association with the Canadian Partnership Against Cancer, and the European Society of Pathology. Its goal is to produce standardized, internationally agreed, evidence-based datasets for use throughout the world. This article describes the development of a cancer dataset by the multidisciplinary ICCR expert panel for the reporting of thymic epithelial tumours. The dataset includes 'required' (mandatory) and 'recommended' (non-mandatory) elements, which are validated by a review of current evidence and supported by explanatory text. Seven required elements and 12 recommended elements were agreed by the international dataset authoring committee to represent the essential information for the reporting of thymic epithelial tumours. The use of an internationally agreed, structured pathology dataset for reporting thymic tumours provides all of the necessary information for optimal patient management, facilitates consistent and accurate data collection, and provides valuable data for research and international benchmarking. The dataset also provides a valuable resource for those countries and institutions that are not in a position to develop their own datasets. © 2016 John Wiley & Sons Ltd.
Boer, Annemarie; Dutmer, Alisa L; Schiphorst Preuper, Henrica R; van der Woude, Lucas H V; Stewart, Roy E; Deyo, Richard A; Reneman, Michiel F; Soer, Remko
2017-10-01
Validation study with cross-sectional and longitudinal measurements. To translate the US National Institutes of Health (NIH)-minimal dataset for clinical research on chronic low back pain into the Dutch language and to test its validity and reliability among people with chronic low back pain. The NIH developed a minimal dataset to encourage more complete and consistent reporting of clinical research and to be able to compare studies across countries in patients with low back pain. In the Netherlands, the NIH-minimal dataset has not been translated before and measurement properties are unknown. Cross-cultural validity was tested by a formal forward-backward translation. Structural validity was tested with exploratory factor analyses (comparative fit index, Tucker-Lewis index, and root mean square error of approximation). Hypothesis testing was performed to compare subscales of the NIH dataset with the Pain Disability Index and the EurQol-5D (Pearson correlation coefficients). Internal consistency was tested with Cronbach α and test-retest reliability at 2 weeks was calculated in a subsample of patients with Intraclass Correlation Coefficients and weighted Kappa (κω). In total, 452 patients were included of which 52 were included for the test-retest study. factor analysis for structural validity pointed into the direction of a seven-factor model (Cronbach α = 0.78). Factors and total score of the NIH-minimal dataset showed fair to good correlations with Pain Disability Index (r = 0.43-0.70) and EuroQol-5D (r = -0.41 to -0.64). Reliability: test-retest reliability per item showed substantial agreement (κω=0.65). Test-retest reliability per factor was moderate to good (Intraclass Correlation Coefficient = 0.71). The Dutch language version measurement properties of the NIH-minimal were satisfactory. N/A.
Atlas-Guided Cluster Analysis of Large Tractography Datasets
Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer
2013-01-01
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292
Messerich, J.A.; Schilling, S.P.; Thompson, R.A.
2008-01-01
Presented in this report are 27 digital elevation model (DEM) datasets for the crater area of Mount St. Helens. These datasets include pre-eruption baseline data collected in 2000, incremental model subsets collected during the 2004-07 dome building eruption, and associated shaded-relief image datasets. Each dataset was collected photogrammetrically with digital softcopy methods employing a combination of manual collection and iterative compilation of x,y,z coordinate triplets utilizing autocorrelation techniques. DEM data points collected using autocorrelation methods were rigorously edited in stereo and manually corrected to ensure conformity with the ground surface. Data were first collected as a triangulated irregular network (TIN) then interpolated to a grid format. DEM data are based on aerotriangulated photogrammetric solutions for aerial photograph strips flown at a nominal scale of 1:12,000 using a combination of surveyed ground control and photograph-identified control points. The 2000 DEM is based on aerotriangulation of four strips totaling 31 photographs. Subsequent DEMs collected during the course of the eruption are based on aerotriangulation of single aerial photograph strips consisting of between three and seven 1:12,000-scale photographs (two to six stereo pairs). Most datasets were based on three or four stereo pairs. Photogrammetric errors associated with each dataset are presented along with ground control used in the photogrammetric aerotriangulation. The temporal increase in area of deformation in the crater as a result of dome growth, deformation, and translation of glacial ice resulted in continual adoption of new ground control points and abandonment of others during the course of the eruption. Additionally, seasonal snow cover precluded the consistent use of some ground control points.
Mehrabi, Saeed; Krishnan, Anand; Roch, Alexandra M; Schmidt, Heidi; Li, DingCheng; Kesterson, Joe; Beesley, Chris; Dexter, Paul; Schmidt, Max; Palakal, Mathew; Liu, Hongfang
2018-01-01
In this study we have developed a rule-based natural language processing (NLP) system to identify patients with family history of pancreatic cancer. The algorithm was developed in a Unstructured Information Management Architecture (UIMA) framework and consisted of section segmentation, relation discovery, and negation detection. The system was evaluated on data from two institutions. The family history identification precision was consistent across the institutions shifting from 88.9% on Indiana University (IU) dataset to 87.8% on Mayo Clinic dataset. Customizing the algorithm on the the Mayo Clinic data, increased its precision to 88.1%. The family member relation discovery achieved precision, recall, and F-measure of 75.3%, 91.6% and 82.6% respectively. Negation detection resulted in precision of 99.1%. The results show that rule-based NLP approaches for specific information extraction tasks are portable across institutions; however customization of the algorithm on the new dataset improves its performance. PMID:26262122
Status and Preliminary Evaluation for Chinese Re-Analysis Datasets
NASA Astrophysics Data System (ADS)
bin, zhao; chunxiang, shi; tianbao, zhao; dong, si; jingwei, liu
2016-04-01
Based on operational T639L60 spectral model, combined with Hybird_GSI assimilation system by using meteorological observations including radiosondes, buoyes, satellites el al., a set of Chinese Re-Analysis (CRA) datasets is developing by Chinese National Meteorological Information Center (NMIC) of Chinese Meteorological Administration (CMA). The datasets are run at 30km (0.28°latitude / longitude) resolution which holds higher resolution than most of the existing reanalysis dataset. The reanalysis is done in an effort to enhance the accuracy of historical synoptic analysis and aid to find out detailed investigation of various weather and climate systems. The current status of reanalysis is in a stage of preliminary experimental analysis. One-year forecast data during Jun 2013 and May 2014 has been simulated and used in synoptic and climate evaluation. We first examine the model prediction ability with the new assimilation system, and find out that it represents significant improvement in Northern and Southern hemisphere, due to addition of new satellite data, compared with operational T639L60 model, the effect of upper-level prediction is improved obviously and overall prediction stability is enhanced. In climatological analysis, compared with ERA-40, NCEP/NCAR and NCEP/DOE reanalyses, the results show that surface temperature simulates a bit lower in land and higher over ocean, 850-hPa specific humidity reflects weakened anomaly and the zonal wind value anomaly is focus on equatorial tropics. Meanwhile, the reanalysis dataset shows good ability for various climate index, such as subtropical high index, ESMI (East-Asia subtropical Summer Monsoon Index) et al., especially for the Indian and western North Pacific monsoon index. Latter we will further improve the assimilation system and dynamical simulating performance, and obtain 40-years (1979-2018) reanalysis datasets. It will provide a more comprehensive analysis for synoptic and climate diagnosis.
On the Multi-Modal Object Tracking and Image Fusion Using Unsupervised Deep Learning Methodologies
NASA Astrophysics Data System (ADS)
LaHaye, N.; Ott, J.; Garay, M. J.; El-Askary, H. M.; Linstead, E.
2017-12-01
The number of different modalities of remote-sensors has been on the rise, resulting in large datasets with different complexity levels. Such complex datasets can provide valuable information separately, yet there is a bigger value in having a comprehensive view of them combined. As such, hidden information can be deduced through applying data mining techniques on the fused data. The curse of dimensionality of such fused data, due to the potentially vast dimension space, hinders our ability to have deep understanding of them. This is because each dataset requires a user to have instrument-specific and dataset-specific knowledge for optimum and meaningful usage. Once a user decides to use multiple datasets together, deeper understanding of translating and combining these datasets in a correct and effective manner is needed. Although there exists data centric techniques, generic automated methodologies that can potentially solve this problem completely don't exist. Here we are developing a system that aims to gain a detailed understanding of different data modalities. Such system will provide an analysis environment that gives the user useful feedback and can aid in research tasks. In our current work, we show the initial outputs our system implementation that leverages unsupervised deep learning techniques so not to burden the user with the task of labeling input data, while still allowing for a detailed machine understanding of the data. Our goal is to be able to track objects, like cloud systems or aerosols, across different image-like data-modalities. The proposed system is flexible, scalable and robust to understand complex likenesses within multi-modal data in a similar spatio-temporal range, and also to be able to co-register and fuse these images when needed.
NASA Technical Reports Server (NTRS)
Wang, Weile; Nemani, Ramakrishna R.; Michaelis, Andrew; Hashimoto, Hirofumi; Dungan, Jennifer L.; Thrasher, Bridget L.; Dixon, Keith W.
2016-01-01
The NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset is comprised of downscaled climate projections that are derived from 21 General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) and across two of the four greenhouse gas emissions scenarios (RCP4.5 and RCP8.5). Each of the climate projections includes daily maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2100 and the spatial resolution is 0.25 degrees (approximately 25 km x 25 km). The GDDP dataset has received warm welcome from the science community in conducting studies of climate change impacts at local to regional scales, but a comprehensive evaluation of its uncertainties is still missing. In this study, we apply the Perfect Model Experiment framework (Dixon et al. 2016) to quantify the key sources of uncertainties from the observational baseline dataset, the downscaling algorithm, and some intrinsic assumptions (e.g., the stationary assumption) inherent to the statistical downscaling techniques. We developed a set of metrics to evaluate downscaling errors resulted from bias-correction ("quantile-mapping"), spatial disaggregation, as well as the temporal-spatial non-stationarity of climate variability. Our results highlight the spatial disaggregation (or interpolation) errors, which dominate the overall uncertainties of the GDDP dataset, especially over heterogeneous and complex terrains (e.g., mountains and coastal area). In comparison, the temporal errors in the GDDP dataset tend to be more constrained. Our results also indicate that the downscaled daily precipitation also has relatively larger uncertainties than the temperature fields, reflecting the rather stochastic nature of precipitation in space. Therefore, our results provide insights in improving statistical downscaling algorithms and products in the future.
Picking Cell Lines for High-Throughput Transcriptomic Toxicity ...
High throughput, whole genome transcriptomic profiling is a promising approach to comprehensively evaluate chemicals for potential biological effects. To be useful for in vitro toxicity screening, gene expression must be quantified in a set of representative cell types that captures the diversity of potential responses across chemicals. The ideal dataset to select these cell types would consist of hundreds of cell types treated with thousands of chemicals, but does not yet exist. However, basal gene expression data may be useful as a surrogate for representing the relevant biological space necessary for cell type selection. The goal of this study was to identify a small (< 20) number of cell types that capture a large, quantifiable fraction of basal gene expression diversity. Three publicly available collections of Affymetrix U133+2.0 cellular gene expression data were used: 1) 59 cell lines from the NCI60 set; 2) 303 primary cell types from the Mabbott et al (2013) expression atlas; and 3) 1036 cell lines from the Cancer Cell Line Encyclopedia. The data were RMA normalized, log-transformed, and the probe sets mapped to HUGO gene identifiers. The results showed that <20 cell lines capture only a small fraction of the total diversity in basal gene expression when evaluated using either the entire set of 20960 HUGO genes or a subset of druggable genes likely to be chemical targets. The fraction of the total gene expression variation explained was consistent when
National housing and impervious surface scenarios for integrated climate impact assessments
Bierwagen, Britta G.; Theobald, David M.; Pyke, Christopher R.; Choate, Anne; Groth, Philip; Thomas, John V.; Morefield, Philip
2010-01-01
Understanding the impacts of climate change on people and the environment requires an understanding of the dynamics of both climate and land use/land cover changes. A range of future climate scenarios is available for the conterminous United States that have been developed based on widely used international greenhouse gas emissions storylines. Climate scenarios derived from these emissions storylines have not been matched with logically consistent land use/cover maps for the United States. This gap is a critical barrier to conducting effective integrated assessments. This study develops novel national scenarios of housing density and impervious surface cover that are logically consistent with emissions storylines. Analysis of these scenarios suggests that combinations of climate and land use/cover can be important in determining environmental conditions regulated under the Clean Air and Clean Water Acts. We found significant differences in patterns of habitat loss and the distribution of potentially impaired watersheds among scenarios, indicating that compact development patterns can reduce habitat loss and the number of impaired watersheds. These scenarios are also associated with lower global greenhouse gas emissions and, consequently, the potential to reduce both the drivers of anthropogenic climate change and the impacts of changing conditions. The residential housing and impervious surface datasets provide a substantial first step toward comprehensive national land use/land cover scenarios, which have broad applicability for integrated assessments as these data and tools are publicly available. PMID:21078956
3Drefine: an interactive web server for efficient protein structure refinement.
Bhattacharya, Debswapna; Nowotny, Jackson; Cao, Renzhi; Cheng, Jianlin
2016-07-08
3Drefine is an interactive web server for consistent and computationally efficient protein structure refinement with the capability to perform web-based statistical and visual analysis. The 3Drefine refinement protocol utilizes iterative optimization of hydrogen bonding network combined with atomic-level energy minimization on the optimized model using a composite physics and knowledge-based force fields for efficient protein structure refinement. The method has been extensively evaluated on blind CASP experiments as well as on large-scale and diverse benchmark datasets and exhibits consistent improvement over the initial structure in both global and local structural quality measures. The 3Drefine web server allows for convenient protein structure refinement through a text or file input submission, email notification, provided example submission and is freely available without any registration requirement. The server also provides comprehensive analysis of submissions through various energy and statistical feedback and interactive visualization of multiple refined models through the JSmol applet that is equipped with numerous protein model analysis tools. The web server has been extensively tested and used by many users. As a result, the 3Drefine web server conveniently provides a useful tool easily accessible to the community. The 3Drefine web server has been made publicly available at the URL: http://sysbio.rnet.missouri.edu/3Drefine/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Molecular Targeted Therapies of Childhood Choroid Plexus Carcinoma
2013-10-01
Microarray intensities were analyzed in PGS, using the benign human choroid plexus papilloma (CPP) samples as an expression baseline reference. This...additional human and mouse CPC genomic profiles (timeframe: months 1-5). The goal of these studies is to expand our number of genomic profiles (DNA and...mRNA arrays) of both human and mouse CPCs to provide a comprehensive dataset with which to identify key candidate oncogenes, tumor suppressor genes
Molecular Targeted Therapies of Childhood Choroid Plexus Carcinoma
2012-10-01
Microarray intensities were analyzed in PGS, using the benign human choroid plexus papilloma (CPP) samples as an expression baseline reference...identify candidate drug targets of CPC. Task 1: Generation of additional human and mouse CPC genomic profiles (timeframe: months 1-5). The goal...of these studies is to expand our number of genomic profiles (DNA and mRNA arrays) of both human and mouse CPCs to provide a comprehensive dataset
Molecular Targeted Therapies of Childhood Choroid Plexus Carcinoma
2011-10-01
were analyzed in PGS, using the benign human choroid plexus papilloma (CPP) samples as an expression baseline reference. This analysis highlights...Task 1: Generation of additional human and mouse CPC genomic profiles (timeframe: months 1-5). The goal of these studies is to expand our...number of genomic profiles (DNA and mRNA arrays) of both human and mouse CPCs to provide a comprehensive dataset with which to identify key candidate
Joint Sparse Representation for Robust Multimodal Biometrics Recognition
2014-01-01
comprehensive multimodal dataset and a face database are described in section V. Finally, in section VI, we discuss the computational complexity of...fingerprint, iris, palmprint , hand geometry and voice from subjects of different age, gender and ethnicity as described in Table I. It is a...Taylor, “Constructing nonlinear discriminants from multiple data views,” Machine Learning and Knowl- edge Discovery in Databases , pp. 328–343, 2010
Planform: an application and database of graph-encoded planarian regenerative experiments.
Lobo, Daniel; Malone, Taylor J; Levin, Michael
2013-04-15
Understanding the mechanisms governing the regeneration capabilities of many organisms is a fundamental interest in biology and medicine. An ever-increasing number of manipulation and molecular experiments are attempting to discover a comprehensive model for regeneration, with the planarian flatworm being one of the most important model species. Despite much effort, no comprehensive, constructive, mechanistic models exist yet, and it is now clear that computational tools are needed to mine this huge dataset. However, until now, there is no database of regenerative experiments, and the current genotype-phenotype ontologies and databases are based on textual descriptions, which are not understandable by computers. To overcome these difficulties, we present here Planform (Planarian formalization), a manually curated database and software tool for planarian regenerative experiments, based on a mathematical graph formalism. The database contains more than a thousand experiments from the main publications in the planarian literature. The software tool provides the user with a graphical interface to easily interact with and mine the database. The presented system is a valuable resource for the regeneration community and, more importantly, will pave the way for the application of novel artificial intelligence tools to extract knowledge from this dataset. The database and software tool are freely available at http://planform.daniel-lobo.com.
New generation of hydraulic pedotransfer functions for Europe
Tóth, B; Weynants, M; Nemes, A; Makó, A; Bilas, G; Tóth, G
2015-01-01
A range of continental-scale soil datasets exists in Europe with different spatial representation and based on different principles. We developed comprehensive pedotransfer functions (PTFs) for applications principally on spatial datasets with continental coverage. The PTF development included the prediction of soil water retention at various matric potentials and prediction of parameters to characterize soil moisture retention and the hydraulic conductivity curve (MRC and HCC) of European soils. We developed PTFs with a hierarchical approach, determined by the input requirements. The PTFs were derived by using three statistical methods: (i) linear regression where there were quantitative input variables, (ii) a regression tree for qualitative, quantitative and mixed types of information and (iii) mean statistics of developer-defined soil groups (class PTF) when only qualitative input parameters were available. Data of the recently established European Hydropedological Data Inventory (EU-HYDI), which holds the most comprehensive geographical and thematic coverage of hydro-pedological data in Europe, were used to train and test the PTFs. The applied modelling techniques and the EU-HYDI allowed the development of hydraulic PTFs that are more reliable and applicable for a greater variety of input parameters than those previously available for Europe. Therefore the new set of PTFs offers tailored advanced tools for a wide range of applications in the continent. PMID:25866465
DAPAGLOCO - A global daily precipitation dataset from satellite and rain-gauge measurements
NASA Astrophysics Data System (ADS)
Spangehl, T.; Danielczok, A.; Dietzsch, F.; Andersson, A.; Schroeder, M.; Fennig, K.; Ziese, M.; Becker, A.
2017-12-01
The BMBF funded project framework MiKlip(Mittelfristige Klimaprognosen) develops a global climate forecast system on decadal time scales for operational applications. Herein, the DAPAGLOCO project (Daily Precipitation Analysis for the validation of Global medium-range Climate predictions Operationalized) provides a global precipitation dataset as a combination of microwave-based satellite measurements over ocean and rain gauge measurements over land on daily scale. The DAPAGLOCO dataset is created for the evaluation of the MiKlip forecast system in the first place. The HOAPS dataset (Hamburg Ocean Atmosphere Parameter and Fluxes from Satellite data) is used for the derivation of precipitation rates over ocean and is extended by the use of measurements from TMI, GMI, and AMSR-E, in addition to measurements from SSM/I and SSMIS. A 1D-Var retrieval scheme is developed to retrieve rain rates from microwave imager data, which also allows for the determination of uncertainty estimates. Over land, the GPCC (Global Precipitation Climatology Center) Full Data Daily product is used. It consists of rain gauge measurements that are interpolated on a regular grid by ordinary Kriging. The currently available dataset is based on a neuronal network approach, consists of 21 years of data from 1988 to 2008 and is currently extended until 2015 using the 1D-Var scheme and with improved sampling. Three different spatial resolved dataset versions are available with 1° and 2.5° global, and 0.5° for Europe. The evaluation of the MiKlip forecast system by DAPAGLOCO is based on ETCCDI (Expert Team on Climate Change and Detection Indices). Hindcasts are used for the index-based comparison between model and observations. These indices allow for the evaluation of precipitation extremes, their spatial and temporal distribution as well as for the duration of dry and wet spells, average precipitation amounts and percentiles on global scale. Besides, an ETCCDI-based climatology of the DAPAGLOCO precipitation dataset has been derived.
Reference-tissue correction of T2-weighted signal intensity for prostate cancer detection
NASA Astrophysics Data System (ADS)
Peng, Yahui; Jiang, Yulei; Oto, Aytekin
2014-03-01
The purpose of this study was to investigate whether correction with respect to reference tissue of T2-weighted MRimage signal intensity (SI) improves its effectiveness for classification of regions of interest (ROIs) as prostate cancer (PCa) or normal prostatic tissue. Two image datasets collected retrospectively were used in this study: 71 cases acquired with GE scanners (dataset A), and 59 cases acquired with Philips scanners (dataset B). Through a consensus histology- MR correlation review, 175 PCa and 108 normal-tissue ROIs were identified and drawn manually. Reference-tissue ROIs were selected in each case from the levator ani muscle, urinary bladder, and pubic bone. T2-weighted image SI was corrected as the ratio of the average T2-weighted image SI within an ROI to that of a reference-tissue ROI. Area under the receiver operating characteristic curve (AUC) was used to evaluate the effectiveness of T2-weighted image SIs for differentiation of PCa from normal-tissue ROIs. AUC (+/- standard error) for uncorrected T2-weighted image SIs was 0.78+/-0.04 (datasets A) and 0.65+/-0.05 (datasets B). AUC for corrected T2-weighted image SIs with respect to muscle, bladder, and bone reference was 0.77+/-0.04 (p=1.0), 0.77+/-0.04 (p=1.0), and 0.75+/-0.04 (p=0.8), respectively, for dataset A; and 0.81+/-0.04 (p=0.002), 0.78+/-0.04 (p<0.001), and 0.79+/-0.04 (p<0.001), respectively, for dataset B. Correction in reference to the levator ani muscle yielded the most consistent results between GE and Phillips images. Correction of T2-weighted image SI in reference to three types of extra-prostatic tissue can improve its effectiveness for differentiation of PCa from normal-tissue ROIs, and correction in reference to the levator ani muscle produces consistent T2-weighted image SIs between GE and Phillips MR images.
MSWEP V2 global 3-hourly 0.1° precipitation: methodology and quantitative appraisal
NASA Astrophysics Data System (ADS)
Beck, H.; Yang, L.; Pan, M.; Wood, E. F.; William, L.
2017-12-01
Here, we present Multi-Source Weighted-Ensemble Precipitation (MSWEP) V2, the first fully global gridded precipitation (P) dataset with a 0.1° spatial resolution. The dataset covers the period 1979-2016, has a 3-hourly temporal resolution, and was derived by optimally merging a wide range of data sources based on gauges (WorldClim, GHCN-D, GSOD, and others), satellites (CMORPH, GridSat, GSMaP, and TMPA 3B42RT), and reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR). MSWEP V2 implements some major improvements over V1, such as (i) the correction of distributional P biases using cumulative distribution function matching, (ii) increasing the spatial resolution from 0.25° to 0.1°, (iii) the inclusion of ocean areas, (iv) the addition of NCEP-CFSR P estimates, (v) the addition of thermal infrared-based P estimates for the pre-TRMM era, (vi) the addition of 0.1° daily interpolated gauge data, (vii) the use of a daily gauge correction scheme that accounts for regional differences in the 24-hour accumulation period of gauges, and (viii) extension of the data record to 2016. The gauge-based assessment of the reanalysis and satellite P datasets, necessary for establishing the merging weights, revealed that the reanalysis datasets strongly overestimate the P frequency for the entire globe, and that the satellite (resp. reanalysis) datasets consistently performed better at low (high) latitudes. Compared to other state-of-the-art P datasets, MSWEP V2 exhibits more plausible global patterns in mean annual P, percentiles, and annual number of dry days, and better resolves the small-scale variability over topographically complex terrain. Other P datasets appear to consistently underestimate P amounts over mountainous regions. Long-term mean P estimates for the global, land, and ocean domains based on MSWEP V2 are 959, 796, and 1026 mm/yr, respectively, in close agreement with the best previous published estimates.
GEO Label: User and Producer Perspectives on a Label for Geospatial Data
NASA Astrophysics Data System (ADS)
Lush, V.; Lumsden, J.; Masó, J.; Díaz, P.; McCallum, I.
2012-04-01
One of the aims of the Science and Technology Committee (STC) of the Group on Earth Observations (GEO) was to establish a GEO Label- a label to certify geospatial datasets and their quality. As proposed, the GEO Label will be used as a value indicator for geospatial data and datasets accessible through the Global Earth Observation System of Systems (GEOSS). It is suggested that the development of such a label will significantly improve user recognition of the quality of geospatial datasets and that its use will help promote trust in datasets that carry the established GEO Label. Furthermore, the GEO Label is seen as an incentive to data providers. At the moment GEOSS contains a large amount of data and is constantly growing. Taking this into account, a GEO Label could assist in searching by providing users with visual cues of dataset quality and possibly relevance; a GEO Label could effectively stand as a decision support mechanism for dataset selection. Currently our project - GeoViQua, - together with EGIDA and ID-03 is undertaking research to define and evaluate the concept of a GEO Label. The development and evaluation process will be carried out in three phases. In phase I we have conducted an online survey (GEO Label Questionnaire) to identify the initial user and producer views on a GEO Label or its potential role. In phase II we will conduct a further study presenting some GEO Label examples that will be based on Phase I. We will elicit feedback on these examples under controlled conditions. In phase III we will create physical prototypes which will be used in a human subject study. The most successful prototypes will then be put forward as potential GEO Label options. At the moment we are in phase I, where we developed an online questionnaire to collect the initial GEO Label requirements and to identify the role that a GEO Label should serve from the user and producer standpoint. The GEO Label Questionnaire consists of generic questions to identify whether users and producers believe a GEO Label is relevant to geospatial data; whether they want a single "one-for-all" label or separate labels that will serve a particular role; the function that would be most relevant for a GEO Label to carry; and the functionality that users and producers would like to see from common rating and review systems they use. To distribute the questionnaire, relevant user and expert groups were contacted at meetings or by email. At this stage we successfully collected over 80 valid responses from geospatial data users and producers. This communication will provide a comprehensive analysis of the survey results, indicating to what extent the users surveyed in Phase I value a GEO Label, and suggesting in what directions a GEO Label may develop. Potential GEO Label examples based on the results of the survey will be presented for use in Phase II.
Targeted metabolomics and medication classification data from participants in the ADNI1 cohort.
St John-Williams, Lisa; Blach, Colette; Toledo, Jon B; Rotroff, Daniel M; Kim, Sungeun; Klavins, Kristaps; Baillie, Rebecca; Han, Xianlin; Mahmoudiandehkordi, Siamak; Jack, John; Massaro, Tyler J; Lucas, Joseph E; Louie, Gregory; Motsinger-Reif, Alison A; Risacher, Shannon L; Saykin, Andrew J; Kastenmüller, Gabi; Arnold, Matthias; Koal, Therese; Moseley, M Arthur; Mangravite, Lara M; Peters, Mette A; Tenenbaum, Jessica D; Thompson, J Will; Kaddurah-Daouk, Rima
2017-10-17
Alzheimer's disease (AD) is the most common neurodegenerative disease presenting major health and economic challenges that continue to grow. Mechanisms of disease are poorly understood but significant data point to metabolic defects that might contribute to disease pathogenesis. The Alzheimer Disease Metabolomics Consortium (ADMC) in partnership with Alzheimer Disease Neuroimaging Initiative (ADNI) is creating a comprehensive biochemical database for AD. Using targeted and non- targeted metabolomics and lipidomics platforms we are mapping metabolic pathway and network failures across the trajectory of disease. In this report we present quantitative metabolomics data generated on serum from 199 control, 356 mild cognitive impairment and 175 AD subjects enrolled in ADNI1 using AbsoluteIDQ-p180 platform, along with the pipeline for data preprocessing and medication classification for confound correction. The dataset presented here is the first of eight metabolomics datasets being generated for broad biochemical investigation of the AD metabolome. We expect that these collective metabolomics datasets will provide valuable resources for researchers to identify novel molecular mechanisms contributing to AD pathogenesis and disease phenotypes.
Lu, Chenqi; Liu, Xiaoqin; Wang, Lin; Jiang, Ning; Yu, Jun; Zhao, Xiaobo; Hu, Hairong; Zheng, Saihua; Li, Xuelian; Wang, Guiying
2017-01-10
Due to genetic heterogeneity and variable diagnostic criteria, genetic studies of polycystic ovary syndrome are particularly challenging. Furthermore, lack of sufficiently large cohorts limits the identification of susceptibility genes contributing to polycystic ovary syndrome. Here, we carried out a systematic search of studies deposited in the Gene Expression Omnibus database through August 31, 2016. The present analyses included studies with: 1) patients with polycystic ovary syndrome and normal controls, 2) gene expression profiling of messenger RNA, and 3) sufficient data for our analysis. Ultimately, a total of 9 studies with 13 datasets met the inclusion criteria and were performed for the subsequent integrated analyses. Through comprehensive analyses, there were 13 genetic factors overlapped in all datasets and identified as significant specific genes for polycystic ovary syndrome. After quality control assessment, there were six datasets remained. Further gene ontology enrichment and pathway analyses suggested that differentially expressed genes mainly enriched in oocyte pathways. These findings provide potential molecular markers for diagnosis and prognosis of polycystic ovary syndrome, and need in-depth studies on the exact function and mechanism in polycystic ovary syndrome.
Carlson, Colin J.; Bond, Alexander L.
2018-01-01
Abstract Background Despite much present-day attention on recently extinct North American birds species, little contemporary research has focused on the Carolina parakeet (Conuropsis carolinesis). While the last captive Carolina parakeet died 100 years ago this year, the Carolina parakeet was officially declared extinct in 1920, but they likely persisted in small, isolated populations until at least the 1930s, and perhaps longer. How this once wide-ranging and plentiful species went extinct remains a mystery. Here, we present a georeferenced dataset of Carolina parakeet sightings spanning nearly 400 years by combining both written observations and specimen data. New information Because we include both observations and specimen data, the Carolina parakeet occurrence dataset presented here is the most comprehensive and rigorous datsetset on this species available. The dataset includes 861 sightings from 1564 to 1944. Each datapoint includes geographic coordinates, a measurement of uncertainty, detailed information about each sighting, and an assessment of the sighting's validity. Given that this species is so poorly understood, we make these data freely available to facilitate more research on this colorful and charismatic species.
National Water Model: Providing the Nation with Actionable Water Intelligence
NASA Astrophysics Data System (ADS)
Aggett, G. R.; Bates, B.
2017-12-01
The National Water Model (NWM) provides national, street-level detail of water movement through time and space. Operating hourly, this flood of information offers enormous benefits in the form of water resource management, natural disaster preparedness, and the protection of life and property. The Geo-Intelligence Division at the NOAA National Water Center supplies forecasters and decision-makers with timely, actionable water intelligence through the processing of billions of NWM data points every hour. These datasets include current streamflow estimates, short and medium range streamflow forecasts, and many other ancillary datasets. The sheer amount of NWM data produced yields a dataset too large to allow for direct human comprehension. As such, it is necessary to undergo model data post-processing, filtering, and data ingestion by visualization web apps that make use of cartographic techniques to bring attention to the areas of highest urgency. This poster illustrates NWM output post-processing and cartographic visualization techniques being developed and employed by the Geo-Intelligence Division at the NOAA National Water Center to provide national actionable water intelligence.
Ten years of maintaining and expanding a microbial genome and metagenome analysis system.
Markowitz, Victor M; Chen, I-Min A; Chu, Ken; Pati, Amrita; Ivanova, Natalia N; Kyrpides, Nikos C
2015-11-01
Launched in March 2005, the Integrated Microbial Genomes (IMG) system is a comprehensive data management system that supports multidimensional comparative analysis of genomic data. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets sequenced at the Joint Genome Institute or provided by scientific users, as well as public genome datasets available at the National Center for Biotechnology Information Genbank sequence data archive. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and are integrated into the data warehouse using IMG's data integration toolkits. Microbial genome and metagenome application specific data marts and user interfaces provide access to different subsets of IMG's data and analysis toolkits. This review article revisits IMG's original aims, highlights key milestones reached by the system during the past 10 years, and discusses the main challenges faced by a rapidly expanding system, in particular the complexity of maintaining such a system in an academic setting with limited budgets and computing and data management infrastructure. Copyright © 2015 Elsevier Ltd. All rights reserved.
Targeted metabolomics and medication classification data from participants in the ADNI1 cohort
St John-Williams, Lisa; Blach, Colette; Toledo, Jon B.; Rotroff, Daniel M.; Kim, Sungeun; Klavins, Kristaps; Baillie, Rebecca; Han, Xianlin; Mahmoudiandehkordi, Siamak; Jack, John; Massaro, Tyler J.; Lucas, Joseph E.; Louie, Gregory; Motsinger-Reif, Alison A.; Risacher, Shannon L.; Saykin, Andrew J.; Kastenmüller, Gabi; Arnold, Matthias; Koal, Therese; Moseley, M. Arthur; Mangravite, Lara M.; Peters, Mette A.; Tenenbaum, Jessica D.; Thompson, J. Will; Kaddurah-Daouk, Rima
2017-01-01
Alzheimer’s disease (AD) is the most common neurodegenerative disease presenting major health and economic challenges that continue to grow. Mechanisms of disease are poorly understood but significant data point to metabolic defects that might contribute to disease pathogenesis. The Alzheimer Disease Metabolomics Consortium (ADMC) in partnership with Alzheimer Disease Neuroimaging Initiative (ADNI) is creating a comprehensive biochemical database for AD. Using targeted and non- targeted metabolomics and lipidomics platforms we are mapping metabolic pathway and network failures across the trajectory of disease. In this report we present quantitative metabolomics data generated on serum from 199 control, 356 mild cognitive impairment and 175 AD subjects enrolled in ADNI1 using AbsoluteIDQ-p180 platform, along with the pipeline for data preprocessing and medication classification for confound correction. The dataset presented here is the first of eight metabolomics datasets being generated for broad biochemical investigation of the AD metabolome. We expect that these collective metabolomics datasets will provide valuable resources for researchers to identify novel molecular mechanisms contributing to AD pathogenesis and disease phenotypes. PMID:29039849
A data discovery index for the social sciences
Krämer, Thomas; Klas, Claus-Peter; Hausstein, Brigitte
2018-01-01
This paper describes a novel search index for social and economic research data, one that enables users to search up-to-date references for data holdings in these disciplines. The index can be used for comparative analysis of publication of datasets in different areas of social science. The core of the index is the da|ra registration agency’s database for social and economic data, which contains high-quality searchable metadata from registered data publishers. Research data’s metadata records are harvested from data providers around the world and included in the index. In this paper, we describe the currently available indices on social science datasets and their shortcomings. Next, we describe the motivation behind and the purpose for the data discovery index as a dedicated and curated platform for finding social science research data and gesisDataSearch, its user interface. Further, we explain the harvesting, filtering and indexing procedure and give usage instructions for the dataset index. Lastly, we show that the index is currently the most comprehensive and most accessible collection of social science data descriptions available. PMID:29633988
A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie
Hanke, Michael; Baumgartner, Florian J.; Ibe, Pierre; Kaule, Falko R.; Pollmann, Stefan; Speck, Oliver; Zinke, Wolf; Stadler, Jörg
2014-01-01
Here we present a high-resolution functional magnetic resonance (fMRI) dataset – 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film (“Forrest Gump”). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response patterns to complex auditory stimulation. Among the potential uses of this dataset are the study of auditory attention and cognition, language and music perception, and social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures – from stimulus creation to data analysis. In order to facilitate replicative and derived works, only free and open-source software was utilized. PMID:25977761
Big Data in HEP: A comprehensive use case study
NASA Astrophysics Data System (ADS)
Gutsche, Oliver; Cremonesi, Matteo; Elmer, Peter; Jayatilaka, Bo; Kowalkowski, Jim; Pivarski, Jim; Sehrish, Saba; Mantilla Surez, Cristina; Svyatkovskiy, Alexey; Tran, Nhan
2017-10-01
Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems collectively called Big Data technologies have emerged to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity. In this talk, we present an active LHC Run 2 analysis, searching for dark matter with the CMS detector, as a testbed for Big Data technologies. We directly compare the traditional NTuple-based analysis with an equivalent analysis using Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the analysis with the official experiment data formats and produce publication physics plots. We will discuss advantages and disadvantages of each approach and give an outlook on further studies needed.
Gesch, Dean B.; Oimoen, Michael J.; Evans, Gayla A.
2014-01-01
The National Elevation Dataset (NED) is the primary elevation data product produced and distributed by the U.S. Geological Survey. The NED provides seamless raster elevation data of the conterminous United States, Alaska, Hawaii, U.S. island territories, Mexico, and Canada. The NED is derived from diverse source datasets that are processed to a specification with consistent resolutions, coordinate system, elevation units, and horizontal and vertical datums. The NED serves as the elevation layer of The National Map, and it provides basic elevation information for earth science studies and mapping applications in the United States and most of North America. An important part of supporting scientific and operational use of the NED is provision of thorough dataset documentation including data quality and accuracy metrics. The focus of this report is on the vertical accuracy of the NED and on comparison of the NED with other similar large-area elevation datasets, namely data from the Shuttle Radar Topography Mission (SRTM) and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER).
Barth, Amy E.; Barnes, Marcia; Francis, David J.; Vaughn, Sharon; York, Mary
2015-01-01
Separate mixed model analyses of variance (ANOVA) were conducted to examine the effect of textual distance on the accuracy and speed of text consistency judgments among adequate and struggling comprehenders across grades 6–12 (n = 1203). Multiple regressions examined whether accuracy in text consistency judgments uniquely accounted for variance in comprehension. Results suggest that there is considerable growth across the middle and high school years, particularly for adequate comprehenders in those text integration processes that maintain local coherence. Accuracy in text consistency judgments accounted for significant unique variance for passage-level, but not sentence-level comprehension, particularly for adequate comprehenders. PMID:26166946
Giambartolomei, Claudia; Vukcevic, Damjan; Schadt, Eric E; Franke, Lude; Hingorani, Aroon D; Wallace, Chris; Plagnol, Vincent
2014-05-01
Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
Dual African Origins of Global Aedes aegypti s.l. Populations Revealed by Mitochondrial DNA
Moore, Michelle; Sylla, Massamba; Goss, Laura; Burugu, Marion Warigia; Sang, Rosemary; Kamau, Luna W.; Kenya, Eucharia Unoma; Bosio, Chris; Munoz, Maria de Lourdes; Sharakova, Maria; Black, William Cormack
2013-01-01
Background Aedes aegypti is the primary global vector to humans of yellow fever and dengue flaviviruses. Over the past 50 years, many population genetic studies have documented large genetic differences among global populations of this species. These studies initially used morphological polymorphisms, followed later by allozymes, and most recently various molecular genetic markers including microsatellites and mitochondrial markers. In particular, since 2000, fourteen publications and four unpublished datasets have used sequence data from the NADH dehydrogenase subunit 4 mitochondrial gene to compare Ae. aegypti collections and collectively 95 unique mtDNA haplotypes have been found. Phylogenetic analyses in these many studies consistently resolved two clades but no comprehensive study of mtDNA haplotypes have been made in Africa, the continent in which the species originated. Methods and Findings ND4 haplotypes were sequenced in 426 Ae. aegypti s.l. from Senegal, West Africa and Kenya, East Africa. In Senegal 15 and in Kenya 7 new haplotypes were discovered. When added to the 95 published haplotypes and including 6 African Aedes species as outgroups, phylogenetic analyses showed that all but one Senegal haplotype occurred in a basal clade while most East African haplotypes occurred in a second clade arising from the basal clade. Globally distributed haplotypes occurred in both clades demonstrating that populations outside Africa consist of mixtures of mosquitoes from both clades. Conclusions Populations of Ae. aegypti outside Africa consist of mosquitoes arising from one of two ancestral clades. One clade is basal and primarily associated with West Africa while the second arises from the first and contains primarily mosquitoes from East Africa PMID:23638196
NASA Astrophysics Data System (ADS)
Mercier, Lény; Panfili, Jacques; Paillon, Christelle; N'diaye, Awa; Mouillot, David; Darnaude, Audrey M.
2011-05-01
Accurate knowledge of fish age and growth is crucial for species conservation and management of exploited marine stocks. In exploited species, age estimation based on otolith reading is routinely used for building growth curves that are used to implement fishery management models. However, the universal fit of the von Bertalanffy growth function (VBGF) on data from commercial landings can lead to uncertainty in growth parameter inference, preventing accurate comparison of growth-based history traits between fish populations. In the present paper, we used a comprehensive annual sample of wild gilthead seabream ( Sparus aurata L.) in the Gulf of Lions (France, NW Mediterranean) to test a methodology improving growth modelling for exploited fish populations. After validating the timing for otolith annual increment formation for all life stages, a comprehensive set of growth models (including VBGF) were fitted to the obtained age-length data, used as a whole or sub-divided between group 0 individuals and those coming from commercial landings (ages 1-6). Comparisons in growth model accuracy based on Akaike Information Criterion allowed assessment of the best model for each dataset and, when no model correctly fitted the data, a multi-model inference (MMI) based on model averaging was carried out. The results provided evidence that growth parameters inferred with VBGF must be used with high caution. Hence, VBGF turned to be among the less accurate for growth prediction irrespective of the dataset and its fit to the whole population, the juvenile or the adult datasets provided different growth parameters. The best models for growth prediction were the Tanaka model, for group 0 juveniles, and the MMI, for the older fish, confirming that growth differs substantially between juveniles and adults. All asymptotic models failed to correctly describe the growth of adult S. aurata, probably because of the poor representation of old individuals in the dataset. Multi-model inference associated with separate analysis of juveniles and adult fish is then advised to obtain objective estimations of growth parameters when sampling cannot be corrected towards older fish.
A PDB-wide, evolution-based assessment of protein-protein interfaces.
Baskaran, Kumaran; Duarte, Jose M; Biyani, Nikhil; Bliven, Spencer; Capitani, Guido
2014-10-18
Thanks to the growth in sequence and structure databases, more than 50 million sequences are now available in UniProt and 100,000 structures in the PDB. Rich information about protein-protein interfaces can be obtained by a comprehensive study of protein contacts in the PDB, their sequence conservation and geometric features. An automated computational pipeline was developed to run our Evolutionary Protein-Protein Interface Classifier (EPPIC) software on the entire PDB and store the results in a relational database, currently containing > 800,000 interfaces. This allows the analysis of interface data on a PDB-wide scale. Two large benchmark datasets of biological interfaces and crystal contacts, each containing about 3000 entries, were automatically generated based on criteria thought to be strong indicators of interface type. The BioMany set of biological interfaces includes NMR dimers solved as crystal structures and interfaces that are preserved across diverse crystal forms, as catalogued by the Protein Common Interface Database (ProtCID) from Xu and Dunbrack. The second dataset, XtalMany, is derived from interfaces that would lead to infinite assemblies and are therefore crystal contacts. BioMany and XtalMany were used to benchmark the EPPIC approach. The performance of EPPIC was also compared to classifications from the Protein Interfaces, Surfaces, and Assemblies (PISA) program on a PDB-wide scale, finding that the two approaches give the same call in about 88% of PDB interfaces. By comparing our safest predictions to the PDB author annotations, we provide a lower-bound estimate of the error rate of biological unit annotations in the PDB. Additionally, we developed a PyMOL plugin for direct download and easy visualization of EPPIC interfaces for any PDB entry. Both the datasets and the PyMOL plugin are available at http://www.eppic-web.org/ewui/\\#downloads. Our computational pipeline allows us to analyze protein-protein contacts and their sequence conservation across the entire PDB. Two new benchmark datasets are provided, which are over an order of magnitude larger than existing manually curated ones. These tools enable the comprehensive study of several aspects of protein-protein contacts in the PDB and represent a basis for future, even larger scale studies of protein-protein interactions.
ERIC Educational Resources Information Center
Polanin, Joshua R.; Wilson, Sandra Jo
2014-01-01
The purpose of this project is to demonstrate the practical methods developed to utilize a dataset consisting of both multivariate and multilevel effect size data. The context for this project is a large-scale meta-analytic review of the predictors of academic achievement. This project is guided by three primary research questions: (1) How do we…
Rachel Riemann; Ty Wilson; Andrew Lister
2012-01-01
We recently developed an assessment protocol that provides information on the magnitude, location, frequency and type of error in geospatial datasets of continuous variables (Riemann et al. 2010). The protocol consists of a suite of assessment metrics which include an examination of data distributions and areas estimates, at several scales, examining each in the form...
Updated population metadata for United States historical climatology network stations
Owen, T.W.; Gallo, K.P.
2000-01-01
The United States Historical Climatology Network (HCN) serial temperature dataset is comprised of 1221 high-quality, long-term climate observing stations. The HCN dataset is available in several versions, one of which includes population-based temperature modifications to adjust urban temperatures for the "heat-island" effect. Unfortunately, the decennial population metadata file is not complete as missing values are present for 17.6% of the 12 210 population values associated with the 1221 individual stations during the 1900-90 interval. Retrospective grid-based populations. Within a fixed distance of an HCN station, were estimated through the use of a gridded population density dataset and historically available U.S. Census county data. The grid-based populations for the HCN stations provide values derived from a consistent methodology compared to the current HCN populations that can vary as definitions of the area associated with a city change over time. The use of grid-based populations may minimally be appropriate to augment populations for HCN climate stations that lack any population data, and are recommended when consistent and complete population data are required. The recommended urban temperature adjustments based on the HCN and grid-based methods of estimating station population can be significantly different for individual stations within the HCN dataset.
NASA Astrophysics Data System (ADS)
Khan, S.; Salas, F.; Sampson, K. M.; Read, L. K.; Cosgrove, B.; Li, Z.; Gochis, D. J.
2017-12-01
The representation of inland surface water bodies in distributed hydrologic models at the continental scale is a challenge. The National Water Model (NWM) utilizes the National Hydrography Dataset Plus Version 2 (NHDPlusV2) "waterbody" dataset to represent lakes and reservoirs. The "waterbody" layer is a comprehensive dataset that represents surface water bodies using common features like lakes, ponds, reservoirs, estuaries, playas and swamps/marshes. However, a major issue that remains unresolved even in the latest revision of NHDPlus Version 2 is the inconsistency in waterbody digitization and delineation errors. Manually correcting the water body polygons becomes tedious and quickly impossible for continental-scale hydrologic models such as the NWM. In this study, we improved spatial representation of 6,802 lakes and reservoirs by analyzing 379,110 waterbodies in the contiguous United States (excluding the Laurentian Great Lakes). We performed a step-by- step process that integrates a set of geospatial analyses to identify, track, and correct the extent of lakes and reservoirs features that are larger than 0.75 km2. The following assumptions were applied while developing the new dataset: a) lakes and reservoirs cannot directly feed into each other; b) each waterbody must have one outlet; and c) a single lake or reservoir feature cannot have multiple parts. The majority of the NHDplusV2 waterbody features in the original dataset are delineated correctly. However approximately 3 % of the lake and reservoir polygons were found to be incorrect with topological errors and were corrected accordingly. It is important to fix these digitizing errors because the waterbody features are closely linked to the river topology. This new waterbody dataset will ensure that model-simulated water is directed into and through the lakes and reservoirs in a manner that supports the NWM code base and assumptions. The improved dataset will facilitate more effective integration of lakes and reservoirs with correct spatial features into the updated NWM.
Copes, Lynn E.; Lucas, Lynn M.; Thostenson, James O.; Hoekstra, Hopi E.; Boyer, Doug M.
2016-01-01
A dataset of high-resolution microCT scans of primate skulls (crania and mandibles) and certain postcranial elements was collected to address questions about primate skull morphology. The sample consists of 489 scans taken from 431 specimens, representing 59 species of most Primate families. These data have transformative reuse potential as such datasets are necessary for conducting high power research into primate evolution, but require significant time and funding to collect. Similar datasets were previously only available to select research groups across the world. The physical specimens are vouchered at Harvard’s Museum of Comparative Zoology. The data collection took place at the Center for Nanoscale Systems at Harvard. The dataset is archived on MorphoSource.org. Though this is the largest high fidelity comparative dataset yet available, its provisioning on a web archive that allows unlimited researcher contributions promises a future with vastly increased digital collections available at researchers’ finger tips. PMID:26836025
Federal Register 2010, 2011, 2012, 2013, 2014
2010-10-18
... the MSRB's Real-time Transaction Reporting System (``RTRS''). The proposed rule change consists of fee changes to the MSRB's Real-Time Transaction Price Service and Comprehensive Transaction Price Service of... Consisting of Fee Changes to Its Real-Time Transaction Price Service and Comprehensive Transaction Price...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-11-26
... relating to the MSRB's Real-time Transaction Reporting System (``RTRS''). The proposed rule change was... Change Consisting of Fee Changes to Its Real-Time Transaction Price Service and Comprehensive Transaction... change consists of fee changes to the MSRB's Real-Time Transaction Price Service and Comprehensive...
ERIC Educational Resources Information Center
Jacobucci, Leanne; Richert, Judy; Ronan, Susan; Tanis, Ariana
This report describes a program for improving inconsistent reading comprehension. The targeted population consisted of first, third, and fifth grade classrooms in a diverse middle class community located in Illinois. The problems of low academic achievement were documented through teacher observation, reading comprehension test scores, and low…
Television Literacy: Comprehension of Program Content Using Closed Captions for the Deaf.
ERIC Educational Resources Information Center
Lewis, Margaret S. Jelinek; Jackson, Dorothy W.
2001-01-01
This study assessed deaf and hearing students' comprehension of captions with and without visuals/video. Results indicate that reading grade level is highly correlated with caption comprehension test scores. Comprehension of the deaf students was consistently below that of hearing students. The captioned video produced significantly better…
Ronan, Lisa; Voets, Natalie L.; Hough, Morgan; Mackay, Clare; Roberts, Neil; Suckling, John; Bullmore, Edward; James, Anthony; Fletcher, Paul C.
2012-01-01
Several studies have sought to test the neurodevelopmental hypothesis of schizophrenia through analysis of cortical gyrification. However, to date, results have been inconsistent. A possible reason for this is that gyrification measures at the centimeter scale may be insensitive to subtle morphological changes at smaller scales. The lack of consistency in such studies may impede further interpretation of cortical morphology as an aid to understanding the etiology of schizophrenia. In this study we developed a new approach, examining whether millimeter-scale measures of cortical curvature are sensitive to changes in fundamental geometric properties of the cortical surface in schizophrenia. We determined and compared millimeter-scale and centimeter-scale curvature in three separate case–control studies; specifically two adult groups and one adolescent group. The datasets were of different sizes, with different ages and gender-spreads. The results clearly show that millimeter-scale intrinsic curvature measures were more robust and consistent in identifying reduced gyrification in patients across all three datasets. To further interpret this finding we quantified the ratio of expansion in the upper and lower cortical layers. The results suggest that reduced gyrification in schizophrenia is driven by a reduction in the expansion of upper cortical layers. This may plausibly be related to a reduction in short-range connectivity. PMID:22743195
Near-real-time cheatgrass percent cover in the Northern Great Basin, USA, 2015
Boyte, Stephen; Wylie, Bruce K.
2016-01-01
Cheatgrass (Bromus tectorum L.) dramatically changes shrub steppe ecosystems in the Northern Great Basin, United States.Current-season cheatgrass location and percent cover are difficult to estimate rapidly.We explain the development of a near-real-time cheatgrass percent cover dataset and map in the Northern Great Basin for the current year (2015), display the current year’s map, provide analysis of the map, and provide a website link to download the map (as a PDF) and the associated dataset.The near-real-time cheatgrass percent cover dataset and map were consistent with non-expedited, historical cheatgrass percent cover datasets and maps.Having cheatgrass maps available mid-summer can help land managers, policy makers, and Geographic Information Systems personnel as they work to protect socially relevant areas such as critical wildlife habitats.
NHDPlusHR: A national geospatial framework for surface-water information
Viger, Roland; Rea, Alan H.; Simley, Jeffrey D.; Hanson, Karen M.
2016-01-01
The U.S. Geological Survey is developing a new geospatial hydrographic framework for the United States, called the National Hydrography Dataset Plus High Resolution (NHDPlusHR), that integrates a diversity of the best-available information, robustly supports ongoing dataset improvements, enables hydrographic generalization to derive alternate representations of the network while maintaining feature identity, and supports modern scientific computing and Internet accessibility needs. This framework is based on the High Resolution National Hydrography Dataset, the Watershed Boundaries Dataset, and elevation from the 3-D Elevation Program, and will provide an authoritative, high precision, and attribute-rich geospatial framework for surface-water information for the United States. Using this common geospatial framework will provide a consistent basis for indexing water information in the United States, eliminate redundancy, and harmonize access to, and exchange of water information.
Embedded sparse representation of fMRI data via group-wise dictionary optimization
NASA Astrophysics Data System (ADS)
Zhu, Dajiang; Lin, Binbin; Faskowitz, Joshua; Ye, Jieping; Thompson, Paul M.
2016-03-01
Sparse learning enables dimension reduction and efficient modeling of high dimensional signals and images, but it may need to be tailored to best suit specific applications and datasets. Here we used sparse learning to efficiently represent functional magnetic resonance imaging (fMRI) data from the human brain. We propose a novel embedded sparse representation (ESR), to identify the most consistent dictionary atoms across different brain datasets via an iterative group-wise dictionary optimization procedure. In this framework, we introduced additional criteria to make the learned dictionary atoms more consistent across different subjects. We successfully identified four common dictionary atoms that follow the external task stimuli with very high accuracy. After projecting the corresponding coefficient vectors back into the 3-D brain volume space, the spatial patterns are also consistent with traditional fMRI analysis results. Our framework reveals common features of brain activation in a population, as a new, efficient fMRI analysis method.
NASA Astrophysics Data System (ADS)
Tamminen, J.; Sofieva, V.; Kyrölä, E.; Laine, M.; Degenstein, D. A.; Bourassa, A. E.; Roth, C.; Zawada, D.; Weber, M.; Rozanov, A.; Rahpoe, N.; Stiller, G. P.; Laeng, A.; von Clarmann, T.; Walker, K. A.; Sheese, P.; Hubert, D.; Van Roozendael, M.; Zehner, C.; Damadeo, R. P.; Zawodny, J. M.; Kramarova, N. A.; Bhartia, P. K.
2017-12-01
We present a merged dataset of ozone profiles from several satellite instruments: SAGE II on ERBS, GOMOS, SCIAMACHY and MIPAS on Envisat, OSIRIS on Odin, ACE-FTS on SCISAT, and OMPS on Suomi-NPP. The merged dataset is created in the framework of European Space Agency Climate Change Initiative (Ozone_cci) with the aim of analyzing stratospheric ozone trends. For the merged dataset, we used the latest versions of the original ozone datasets. The datasets from the individual instruments have been extensively validated and inter-compared; only those datasets, which are in good agreement and do not exhibit significant drifts with respect to collocated ground-based observations and with respect to each other, are used for merging. The long-term SAGE-CCI-OMPS dataset is created by computation and merging of deseasonalized anomalies from individual instruments. The merged SAGE-CCI-OMPS dataset consists of deseasonalized anomalies of ozone in 10° latitude bands from 90°S to 90°N and from 10 to 50 km in steps of 1 km covering the period from October 1984 to July 2016. This newly created dataset is used for evaluating ozone trends in the stratosphere through multiple linear regression. Negative ozone trends in the upper stratosphere are observed before 1997 and positive trends are found after 1997. The upper stratospheric trends are statistically significant at mid-latitudes in the upper stratosphere and indicate ozone recovery, as expected from the decrease of stratospheric halogens that started in the middle of the 1990s.
Siebert, Tobias; Leichsenring, Kay; Rode, Christian; Wick, Carolin; Stutzig, Norman; Schubert, Harald; Blickhan, Reinhard; Böl, Markus
2015-01-01
The vastly increasing number of neuro-muscular simulation studies (with increasing numbers of muscles used per simulation) is in sharp contrast to a narrow database of necessary muscle parameters. Simulation results depend heavily on rough parameter estimates often obtained by scaling of one muscle parameter set. However, in vivo muscles differ in their individual properties and architecture. Here we provide a comprehensive dataset of dynamic (n = 6 per muscle) and geometric (three-dimensional architecture, n = 3 per muscle) muscle properties of the rabbit calf muscles gastrocnemius, plantaris, and soleus. For completeness we provide the dynamic muscle properties for further important shank muscles (flexor digitorum longus, extensor digitorum longus, and tibialis anterior; n = 1 per muscle). Maximum shortening velocity (normalized to optimal fiber length) of the gastrocnemius is about twice that of soleus, while plantaris showed an intermediate value. The force-velocity relation is similar for gastrocnemius and plantaris but is much more bent for the soleus. Although the muscles vary greatly in their three-dimensional architecture their mean pennation angle and normalized force-length relationships are almost similar. Forces of the muscles were enhanced in the isometric phase following stretching and were depressed following shortening compared to the corresponding isometric forces. While the enhancement was independent of the ramp velocity, the depression was inversely related to the ramp velocity. The lowest effect strength for soleus supports the idea that these effects adapt to muscle function. The careful acquisition of typical dynamical parameters (e.g. force-length and force-velocity relations, force elongation relations of passive components), enhancement and depression effects, and 3D muscle architecture of calf muscles provides valuable comprehensive datasets for e.g. simulations with neuro-muscular models, development of more realistic muscle models, or simulation of muscle packages. PMID:26114955
Chen, Wen; Zhang, Xuan; Li, Jing; Huang, Shulan; Xiang, Shuanglin; Hu, Xiang; Liu, Changning
2018-05-09
Zebrafish is a full-developed model system for studying development processes and human disease. Recent studies of deep sequencing had discovered a large number of long non-coding RNAs (lncRNAs) in zebrafish. However, only few of them had been functionally characterized. Therefore, how to take advantage of the mature zebrafish system to deeply investigate the lncRNAs' function and conservation is really intriguing. We systematically collected and analyzed a series of zebrafish RNA-seq data, then combined them with resources from known database and literatures. As a result, we obtained by far the most complete dataset of zebrafish lncRNAs, containing 13,604 lncRNA genes (21,128 transcripts) in total. Based on that, a co-expression network upon zebrafish coding and lncRNA genes was constructed and analyzed, and used to predict the Gene Ontology (GO) and the KEGG annotation of lncRNA. Meanwhile, we made a conservation analysis on zebrafish lncRNA, identifying 1828 conserved zebrafish lncRNA genes (1890 transcripts) that have their putative mammalian orthologs. We also found that zebrafish lncRNAs play important roles in regulation of the development and function of nervous system; these conserved lncRNAs present a significant sequential and functional conservation, with their mammalian counterparts. By integrative data analysis and construction of coding-lncRNA gene co-expression network, we gained the most comprehensive dataset of zebrafish lncRNAs up to present, as well as their systematic annotations and comprehensive analyses on function and conservation. Our study provides a reliable zebrafish-based platform to deeply explore lncRNA function and mechanism, as well as the lncRNA commonality between zebrafish and human.
NASA Astrophysics Data System (ADS)
Ma, Yingzhao; Yang, Yuan; Han, Zhongying; Tang, Guoqiang; Maguire, Lane; Chu, Zhigang; Hong, Yang
2018-01-01
The objective of this study is to comprehensively evaluate the new Ensemble Multi-Satellite Precipitation Dataset using the Dynamic Bayesian Model Averaging scheme (EMSPD-DBMA) at daily and 0.25° scales from 2001 to 2015 over the Tibetan Plateau (TP). Error analysis against gauge observations revealed that EMSPD-DBMA captured the spatiotemporal pattern of daily precipitation with an acceptable Correlation Coefficient (CC) of 0.53 and a Relative Bias (RB) of -8.28%. Moreover, EMSPD-DBMA outperformed IMERG and GSMaP-MVK in almost all metrics in the summers of 2014 and 2015, with the lowest RB and Root Mean Square Error (RMSE) values of -2.88% and 8.01 mm/d, respectively. It also better reproduced the Probability Density Function (PDF) in terms of daily rainfall amount and estimated moderate and heavy rainfall better than both IMERG and GSMaP-MVK. Further, hydrological evaluation with the Coupled Routing and Excess STorage (CREST) model in the Upper Yangtze River region indicated that the EMSPD-DBMA forced simulation showed satisfying hydrological performance in terms of streamflow prediction, with Nash-Sutcliffe coefficient of Efficiency (NSE) values of 0.82 and 0.58, compared to gauge forced simulation (0.88 and 0.60) at the calibration and validation periods, respectively. EMSPD-DBMA also performed a greater fitness for peak flow simulation than a new Multi-Source Weighted-Ensemble Precipitation Version 2 (MSWEP V2) product, indicating a promising prospect of hydrological utility for the ensemble satellite precipitation data. This study belongs to early comprehensive evaluation of the blended multi-satellite precipitation data across the TP, which would be significant for improving the DBMA algorithm in regions with complex terrain.
Trainor, Patrick J; DeFilippis, Andrew P; Rai, Shesh N
2017-06-21
Statistical classification is a critical component of utilizing metabolomics data for examining the molecular determinants of phenotypes. Despite this, a comprehensive and rigorous evaluation of the accuracy of classification techniques for phenotype discrimination given metabolomics data has not been conducted. We conducted such an evaluation using both simulated and real metabolomics datasets, comparing Partial Least Squares-Discriminant Analysis (PLS-DA), Sparse PLS-DA, Random Forests, Support Vector Machines (SVM), Artificial Neural Network, k -Nearest Neighbors ( k -NN), and Naïve Bayes classification techniques for discrimination. We evaluated the techniques on simulated data generated to mimic global untargeted metabolomics data by incorporating realistic block-wise correlation and partial correlation structures for mimicking the correlations and metabolite clustering generated by biological processes. Over the simulation studies, covariance structures, means, and effect sizes were stochastically varied to provide consistent estimates of classifier performance over a wide range of possible scenarios. The effects of the presence of non-normal error distributions, the introduction of biological and technical outliers, unbalanced phenotype allocation, missing values due to abundances below a limit of detection, and the effect of prior-significance filtering (dimension reduction) were evaluated via simulation. In each simulation, classifier parameters, such as the number of hidden nodes in a Neural Network, were optimized by cross-validation to minimize the probability of detecting spurious results due to poorly tuned classifiers. Classifier performance was then evaluated using real metabolomics datasets of varying sample medium, sample size, and experimental design. We report that in the most realistic simulation studies that incorporated non-normal error distributions, unbalanced phenotype allocation, outliers, missing values, and dimension reduction, classifier performance (least to greatest error) was ranked as follows: SVM, Random Forest, Naïve Bayes, sPLS-DA, Neural Networks, PLS-DA and k -NN classifiers. When non-normal error distributions were introduced, the performance of PLS-DA and k -NN classifiers deteriorated further relative to the remaining techniques. Over the real datasets, a trend of better performance of SVM and Random Forest classifier performance was observed.
NASA Astrophysics Data System (ADS)
Tsontos, V. M.; Arms, S. C.; Thompson, C. K.; Quach, N.; Lam, T.
2016-12-01
Earth science applications increasingly rely on the integration of multivariate data from diverse observational platforms. Whether for satellite mission cal/val, science or decision support, the coupling of remote sensing and in-situ field data is integral also to oceanographic workflows. This has prompted archives such as the PO.DAAC, NASA's physical oceanographic data archive, that historically has had a remote sensing focus, to adapt to better accommodate complex field campaign datasets. However, the inherent heterogeneity of in-situ datasets and their variable adherence to meta/data standards poses a significant impediment to interoperability, a problem originating early in the data lifecycle and significantly impacting stewardship and usability of these data long-term. Here we introduce a new initiative underway at PO.DAAC that seeks to catalyze efforts to address these challenges. It involves the enhancement and integration of available high TRL (Technology Readiness level) components for improved interoperability and support of in-situ data with a focus on a novel yet representative class of oceanographic field data: data from electronic tags deployed on a variety of marine species as biological sampling platforms in support of fisheries management and ocean observation efforts. This project seeks to demonstrate, deliver and ultimately sustain operationally a reusable and accessible set of tools to: 1) mediate reconciliation of heterogeneous source data into a tractable number of standardized formats consistent with earth science data standards; 2) harmonize existing metadata models for satellite and field datasets; 3) demonstrate the value added of integrated data access via a range of available tools and services hosted at the PO.DAAC, including a web-based visualization tool for comprehensive mapping of satellite and in-situ data. An innovative part of our project plan involves partnering with the leading electronic tag manufacturer to promote the adoption of appropriate data standards in their processing software. The proposed project thus adopts a model lifecycle approach complimented by broadly applicable technologies to address key data management and interoperability issues for in-situ data
James Wickham; Collin Homer; James Vogelmann; Alexa McKerrow; Rick Mueler; Nate Herold; John Coulston
2014-01-01
The Multi-Resolution Land Characteristics (MRLC) Consortium demonstrates the national benefits of USA Federal collaboration. Starting in the mid-1990s as a small group with the straightforward goal of compiling a comprehensive national Landsat dataset that could be used to meet agenciesâ needs, MRLC has grown into a group of 10 USA Federal Agencies that coordinate the...
Soft Clustering Criterion Functions for Partitional Document Clustering
2004-05-26
in the clus- ter that it already belongs to. The refinement phase ends, as soon as we perform an iteration in which no documents moved between...for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 26 MAY 2004 2... it with the one obtained by the hard criterion functions. We present a comprehensive experimental evaluation involving twelve differ- ent datasets
Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge
May, Jody C.; McLean, John A.
2017-01-01
Hybrid analytical instrumentation constructed around mass spectrometry (MS) are becoming preferred techniques for addressing many grand challenges in science and medicine. From the omics sciences to drug discovery and synthetic biology, multidimensional separations based on MS provide the high peak capacity and high measurement throughput necessary to obtain large-scale measurements which are used to infer systems-level information. In this review, we describe multidimensional MS configurations as technologies which are big data drivers and discuss some new and emerging strategies for mining information from large-scale datasets. A discussion is included on the information content which can be obtained from individual dimensions, as well as the unique information which can be derived by comparing different levels of data. Finally, we discuss some emerging data visualization strategies which seek to make highly dimensional datasets both accessible and comprehensible. PMID:27306312
Network Anomaly Detection Based on Wavelet Analysis
NASA Astrophysics Data System (ADS)
Lu, Wei; Ghorbani, Ali A.
2008-12-01
Signal processing techniques have been applied recently for analyzing and detecting network anomalies due to their potential to find novel or unknown intrusions. In this paper, we propose a new network signal modelling technique for detecting network anomalies, combining the wavelet approximation and system identification theory. In order to characterize network traffic behaviors, we present fifteen features and use them as the input signals in our system. We then evaluate our approach with the 1999 DARPA intrusion detection dataset and conduct a comprehensive analysis of the intrusions in the dataset. Evaluation results show that the approach achieves high-detection rates in terms of both attack instances and attack types. Furthermore, we conduct a full day's evaluation in a real large-scale WiFi ISP network where five attack types are successfully detected from over 30 millions flows.
A large dataset of protein dynamics in the mammalian heart proteome.
Lau, Edward; Cao, Quan; Ng, Dominic C M; Bleakley, Brian J; Dincer, T Umut; Bot, Brian M; Wang, Ding; Liem, David A; Lam, Maggie P Y; Ge, Junbo; Ping, Peipei
2016-03-15
Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems.
Miao, Zhichao; Westhof, Eric
2016-07-08
RBscore&NBench combines a web server, RBscore and a database, NBench. RBscore predicts RNA-/DNA-binding residues in proteins and visualizes the prediction scores and features on protein structures. The scoring scheme of RBscore directly links feature values to nucleic acid binding probabilities and illustrates the nucleic acid binding energy funnel on the protein surface. To avoid dataset, binding site definition and assessment metric biases, we compared RBscore with 18 web servers and 3 stand-alone programs on 41 datasets, which demonstrated the high and stable accuracy of RBscore. A comprehensive comparison led us to develop a benchmark database named NBench. The web server is available on: http://ahsoka.u-strasbg.fr/rbscorenbench/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Xu, Huayong; Yu, Hui; Tu, Kang; Shi, Qianqian; Wei, Chaochun; Li, Yuan-Yuan; Li, Yi-Xue
2013-01-01
We are witnessing rapid progress in the development of methodologies for building the combinatorial gene regulatory networks involving both TFs (Transcription Factors) and miRNAs (microRNAs). There are a few tools available to do these jobs but most of them are not easy to use and not accessible online. A web server is especially needed in order to allow users to upload experimental expression datasets and build combinatorial regulatory networks corresponding to their particular contexts. In this work, we compiled putative TF-gene, miRNA-gene and TF-miRNA regulatory relationships from forward-engineering pipelines and curated them as built-in data libraries. We streamlined the R codes of our two separate forward-and-reverse engineering algorithms for combinatorial gene regulatory network construction and formalized them as two major functional modules. As a result, we released the cGRNB (combinatorial Gene Regulatory Networks Builder): a web server for constructing combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. The cGRNB enables two major network-building modules, one for MPGE (miRNA-perturbed gene expression) datasets and the other for parallel miRNA/mRNA expression datasets. A miRNA-centered two-layer combinatorial regulatory cascade is the output of the first module and a comprehensive genome-wide network involving all three types of combinatorial regulations (TF-gene, TF-miRNA, and miRNA-gene) are the output of the second module. In this article we propose cGRNB, a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. Since parallel miRNA/mRNA expression datasets are rapidly accumulated by the advance of next-generation sequencing techniques, cGRNB will be very useful tool for researchers to build combinatorial gene regulatory networks based on expression datasets. The cGRNB web-server is free and available online at http://www.scbit.org/cgrnb.
BESST--efficient scaffolding of large fragmented assemblies.
Sahlin, Kristoffer; Vezzi, Francesco; Nystedt, Björn; Lundeberg, Joakim; Arvestad, Lars
2014-08-15
The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.
One tree to link them all: a phylogenetic dataset for the European tetrapoda.
Roquet, Cristina; Lavergne, Sébastien; Thuiller, Wilfried
2014-08-08
Since the ever-increasing availability of phylogenetic informative data, the last decade has seen an upsurge of ecological studies incorporating information on evolutionary relationships among species. However, detailed species-level phylogenies are still lacking for many large groups and regions, which are necessary for comprehensive large-scale eco-phylogenetic analyses. Here, we provide a dataset of 100 dated phylogenetic trees for all European tetrapods based on a mixture of supermatrix and supertree approaches. Phylogenetic inference was performed separately for each of the main Tetrapoda groups of Europe except mammals (i.e. amphibians, birds, squamates and turtles) by means of maximum likelihood (ML) analyses of supermatrix applying a tree constraint at the family (amphibians and squamates) or order (birds and turtles) levels based on consensus knowledge. For each group, we inferred 100 ML trees to be able to provide a phylogenetic dataset that accounts for phylogenetic uncertainty, and assessed node support with bootstrap analyses. Each tree was dated using penalized-likelihood and fossil calibration. The trees obtained were well-supported by existing knowledge and previous phylogenetic studies. For mammals, we modified the most complete supertree dataset available on the literature to include a recent update of the Carnivora clade. As a final step, we merged the phylogenetic trees of all groups to obtain a set of 100 phylogenetic trees for all European Tetrapoda species for which data was available (91%). We provide this phylogenetic dataset (100 chronograms) for the purpose of comparative analyses, macro-ecological or community ecology studies aiming to incorporate phylogenetic information while accounting for phylogenetic uncertainty.
Harnessing Diversity towards the Reconstructing of Large Scale Gene Regulatory Networks
Yamanaka, Ryota; Kitano, Hiroaki
2013-01-01
Elucidating gene regulatory network (GRN) from large scale experimental data remains a central challenge in systems biology. Recently, numerous techniques, particularly consensus driven approaches combining different algorithms, have become a potentially promising strategy to infer accurate GRNs. Here, we develop a novel consensus inference algorithm, TopkNet that can integrate multiple algorithms to infer GRNs. Comprehensive performance benchmarking on a cloud computing framework demonstrated that (i) a simple strategy to combine many algorithms does not always lead to performance improvement compared to the cost of consensus and (ii) TopkNet integrating only high-performance algorithms provide significant performance improvement compared to the best individual algorithms and community prediction. These results suggest that a priori determination of high-performance algorithms is a key to reconstruct an unknown regulatory network. Similarity among gene-expression datasets can be useful to determine potential optimal algorithms for reconstruction of unknown regulatory networks, i.e., if expression-data associated with known regulatory network is similar to that with unknown regulatory network, optimal algorithms determined for the known regulatory network can be repurposed to infer the unknown regulatory network. Based on this observation, we developed a quantitative measure of similarity among gene-expression datasets and demonstrated that, if similarity between the two expression datasets is high, TopkNet integrating algorithms that are optimal for known dataset perform well on the unknown dataset. The consensus framework, TopkNet, together with the similarity measure proposed in this study provides a powerful strategy towards harnessing the wisdom of the crowds in reconstruction of unknown regulatory networks. PMID:24278007
NASA Technical Reports Server (NTRS)
Chen, Junye; DelGenio, Anthony D.; Carlson, Barbara e.; Bosilovich, Michael G.
2007-01-01
The dominant interannual El Nino-Southern Oscillation phenomenon (ENSO) and the short length of climate observation records make it difficult to study long-term climate variations in the spatiotemporal domain. Based on the fact that the ENS0 signal spreads to remote regions and induces delayed climate variation through atmospheric teleconnections, we develop an ENSO-removal method through which the ENS0 signal can be approximately removed at the grid box level from the spatiotemporal field of a climate parameter. After this signal is removed, long-term climate variations, namely, the global warming trend (GW) and the Pacific pan-decadal variability (PDV), are isolated at middle and low latitudes in the climate parameter fields from observed and reanalyses datasets. Except for known GW characteristics, the warming that occurs in the Pacific basin (approximately 0.4K in the 2oth century) is much weaker than in surrounding regions and the other two ocean basins (approximately 0.8K). The modest warming in the Pacific basin is likely due to its dynamic nature on the interannual and decadal time scales and/or the leakage of upper ocean water through the Indonesian Throughflow. Based on NCEP/NCAR and ERA-40 reanalyses, a comprehensive atmospheric structure associated with GW is given. Significant discrepancies exist between the two datasets, especially in the tightly coupled dynamic and water vapor fields. The dynamic field based on NCEP/NCAR reanalysis, which shows a change in the Walker Circulation, is consistent with the GW change in the surface temperature field. However, intensification in the Hadley Circulation is associated with GW trend in the ERA-40 reanalysis.
Aoki, Yuta; Aoki, Ai; Suwa, Hiroshi
2012-01-01
Structural and functional neuroimaging findings suggest that disturbance of the cortico–striato–thalamo–cortical (CSTC) circuits may underlie obsessive-compulsive disorder (OCD). However, some studies with 1H-magnetic resonance spectroscopy (1H-MRS) reported altered level of N-acetylaspartate (NAA), they yielded inconsistency in direction and location of abnormality within CSTC circuits. We conducted a comprehensive literature search and a meta-analysis of 1H-MRS studies in OCD. Seventeen met the inclusion criteria for a meta-analysis. Data were separated by frontal cortex region: medial prefrontal cortex (mPFC), dorsolateral prefrontal cortex, orbitofrontal cortex, basal ganglia and thalamus. The mean and s.d. of the NAA measure were calculated for each region. A random effects model integrating 16 separate datasets with 225 OCD patients and 233 healthy comparison subjects demonstrated that OCD patients exhibit decreased NAA levels in the frontal cortex (P=0.025), but no significant changes in the basal ganglia (P=0.770) or thalamus (P=0.466). Sensitivity analysis in an anatomically specified subgroup consisting of datasets examining the mPFC demonstrated marginally significant reduction of NAA (P=0.061). Meta-regression revealed that NAA reduction in the mPFC was positively correlated with symptom severity measured by Yale–Brown Obsessive Compulsive Scale (P=0.011). The specific reduction of NAA in the mPFC and significant relationship between neurochemical alteration in the mPFC and symptom severity indicate that the mPFC is one of the brain regions that directly related to abnormal behavior in the pathophysiology of OCD. The current meta-analysis indicates that cortices and sub-cortices contribute in different ways to the etiology of OCD. PMID:22892718
Wear simulation of total knee prostheses using load and kinematics waveforms from stair climbing.
Abdel-Jaber, Sami; Belvedere, Claudio; Leardini, Alberto; Affatato, Saverio
2015-11-05
Knee wear simulators are meant to perform load cycles on knee implants under physiological conditions, matching exactly, if possible, those experienced at the replaced joint during daily living activities. Unfortunately, only conditions of low demanding level walking, specified in ISO-14243, are used conventionally during such tests. A recent study has provided a consistent knee kinematic and load data-set measured during stair climbing in patients implanted with a specific modern total knee prosthesis design. In the present study, wear simulation tests were performed for the first time using this data-set on the same prosthesis design. It was hypothesised that more demanding tasks would result in wear rates that differ from those observed in retrievals. Four prostheses for total knee arthroplasty were tested using a displacement-controlled knee wear simulator for two million cycles at 1.1 Hz, under kinematics and load conditions typical of stair climbing. After simulation, the corresponding damage scars on the bearings were qualified and compared with equivalent explanted prostheses. An average mass loss of 20.2±1.5 mg was found. Scanning digital microscopy revealed similar features, though the explant had a greater variety of damage modes, including a high prevalence of adhesive wear damage and burnishing in the overall articulating surface. This study confirmed that the results from wear simulation machines are strongly affected by kinematics and loads applied during simulations. Based on the present results for the full understanding of the current clinical failure of knee implants, a more comprehensive series of conditions are necessary for equivalent simulations in vitro. Copyright © 2015 Elsevier Ltd. All rights reserved.
Development of the Large-Scale Forcing Data to Support MC3E Cloud Modeling Studies
NASA Astrophysics Data System (ADS)
Xie, S.; Zhang, Y.
2011-12-01
The large-scale forcing fields (e.g., vertical velocity and advective tendencies) are required to run single-column and cloud-resolving models (SCMs/CRMs), which are the two key modeling frameworks widely used to link field data to climate model developments. In this study, we use an advanced objective analysis approach to derive the required forcing data from the soundings collected by the Midlatitude Continental Convective Cloud Experiment (MC3E) in support of its cloud modeling studies. MC3E is the latest major field campaign conducted during the period 22 April 2011 to 06 June 2011 in south-central Oklahoma through a joint effort between the DOE ARM program and the NASA Global Precipitation Measurement Program. One of its primary goals is to provide a comprehensive dataset that can be used to describe the large-scale environment of convective cloud systems and evaluate model cumulus parameterizations. The objective analysis used in this study is the constrained variational analysis method. A unique feature of this approach is the use of domain-averaged surface and top-of-the atmosphere (TOA) observations (e.g., precipitation and radiative and turbulent fluxes) as constraints to adjust atmospheric state variables from soundings by the smallest possible amount to conserve column-integrated mass, moisture, and static energy so that the final analysis data is dynamically and thermodynamically consistent. To address potential uncertainties in the surface observations, an ensemble forcing dataset will be developed. Multi-scale forcing will be also created for simulating various scale convective systems. At the meeting, we will provide more details about the forcing development and present some preliminary analysis of the characteristics of the large-scale forcing structures for several selected convective systems observed during MC3E.
Pereira, Anieli G; Sterli, Juliana; Moreira, Filipe R R; Schrago, Carlos G
2017-08-01
Despite their complex evolutionary history and the rich fossil record, the higher level phylogeny and historical biogeography of living turtles have not been investigated in a comprehensive and statistical framework. To tackle these issues, we assembled a large molecular dataset, maximizing both taxonomic and gene sampling. As different models provide alternative biogeographical scenarios, we have explicitly tested such hypotheses in order to reconstruct a robust biogeographical history of Testudines. We scanned publicly available databases for nucleotide sequences and composed a dataset comprising 13 loci for 294 living species of Testudines, which accounts for all living genera and 85% of their extant species diversity. Phylogenetic relationships and species divergence times were estimated using a thorough evaluation of fossil information as calibration priors. We then carried out the analysis of historical biogeography of Testudines in a fully statistical framework. Our study recovered the first large-scale phylogeny of turtles with well-supported relationships following the topology proposed by phylogenomic works. Our dating result consistently indicated that the origin of the main clades, Pleurodira and Cryptodira, occurred in the early Jurassic. The phylogenetic and historical biogeographical inferences permitted us to clarify how geological events affected the evolutionary dynamics of crown turtles. For instance, our analyses support the hypothesis that the breakup of Pangaea would have driven the divergence between the cryptodiran and pleurodiran lineages. The reticulated pattern in the ancestral distribution of the cryptodiran lineage suggests a complex biogeographic history for the clade, which was supposedly related to the complex paleogeographic history of Laurasia. On the other hand, the biogeographical history of Pleurodira indicated a tight correlation with the paleogeography of the Gondwanan landmasses. Copyright © 2017 Elsevier Inc. All rights reserved.
Teo, Guoshou; Kim, Sinae; Tsou, Chih-Chiang; Collins, Ben; Gingras, Anne-Claude; Nesvizhskii, Alexey I; Choi, Hyungwon
2015-11-03
Data independent acquisition (DIA) mass spectrometry is an emerging technique that offers more complete detection and quantification of peptides and proteins across multiple samples. DIA allows fragment-level quantification, which can be considered as repeated measurements of the abundance of the corresponding peptides and proteins in the downstream statistical analysis. However, few statistical approaches are available for aggregating these complex fragment-level data into peptide- or protein-level statistical summaries. In this work, we describe a software package, mapDIA, for statistical analysis of differential protein expression using DIA fragment-level intensities. The workflow consists of three major steps: intensity normalization, peptide/fragment selection, and statistical analysis. First, mapDIA offers normalization of fragment-level intensities by total intensity sums as well as a novel alternative normalization by local intensity sums in retention time space. Second, mapDIA removes outlier observations and selects peptides/fragments that preserve the major quantitative patterns across all samples for each protein. Last, using the selected fragments and peptides, mapDIA performs model-based statistical significance analysis of protein-level differential expression between specified groups of samples. Using a comprehensive set of simulation datasets, we show that mapDIA detects differentially expressed proteins with accurate control of the false discovery rates. We also describe the analysis procedure in detail using two recently published DIA datasets generated for 14-3-3β dynamic interaction network and prostate cancer glycoproteome. The software was written in C++ language and the source code is available for free through SourceForge website http://sourceforge.net/projects/mapdia/.This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015 Elsevier B.V. All rights reserved.
OpenFIRE - A Web GIS Service for Distributing the Finnish Reflection Experiment Datasets
NASA Astrophysics Data System (ADS)
Väkevä, Sakari; Aalto, Aleksi; Heinonen, Aku; Heikkinen, Pekka; Korja, Annakaisa
2017-04-01
The Finnish Reflection Experiment (FIRE) is a land-based deep seismic reflection survey conducted between 2001 and 2003 by a research consortium of the Universities of Helsinki and Oulu, the Geological Survey of Finland, and a Russian state-owned enterprise SpetsGeofysika. The dataset consists of 2100 kilometers of high-resolution profiles across the Archaean and Proterozoic nuclei of the Fennoscandian Shield. Although FIRE data have been available on request since 2009, the data have remained underused outside the original research consortium. The original FIRE data have been quality-controlled. The shot gathers have been cross-checked and comprehensive errata has been created. The brute stacks provided by the Russian seismic contractor have been reprocessed into seismic sections and replotted. A complete documentation of the intermediate processing steps is provided together with guidelines for setting up a computing environment and plotting the data. An open access web service "OpenFIRE" for the visualization and the downloading of FIRE data has been created. The service includes a mobile-responsive map application capable of enriching seismic sections with data from other sources such as open data from the National Land Survey and the Geological Survey of Finland. The AVAA team of the Finnish Open Science and Research Initiative has provided a tailored Liferay portal with necessary web components such as an API (Application Programming Interface) for download requests. INSPIRE (Infrastructure for Spatial Information in Europe) -compliant discovery metadata have been produced and geospatial data will be exposed as Open Geospatial Consortium standard services. The technical guidelines of the European Plate Observing System have been followed and the service could be considered as a reference application for sharing reflection seismic data. The OpenFIRE web service is available at www.seismo.helsinki.fi/openfire
17 years of aerosol and clouds from the ATSR Series of Instruments
NASA Astrophysics Data System (ADS)
Poulsen, C. A.
2015-12-01
Aerosols play a significant role in Earth's climate by scattering and absorbing incoming sunlight and affecting the formation and radiative properties of clouds. The extent to which aerosols affect cloud remains one of the largest sources of uncertainty amongst all influences on climate change. Now, a new comprehensive datasets has been developed under the ESA Climate Change Initiative (CCI) programme to quantify how changes in aerosol levels affect these clouds. The unique dataset is constructed from the Optimal Retrieval of Aerosol and Cloud (ORAC) algorithm used in (A)ATSR (Along Track Scanning Radiometer) retrievals of aerosols generated in the Aerosol CCI and the CC4CL ( Community Code for CLimate) for cloud retrieval in the Cloud CCI. The ATSR instrument is a dual viewing instrument with on board visible and infra red calibration systems making it an ideal instrument to study trends of Aerosol and Clouds and their interactions. The data set begins in 1995 and ends in 2012. A new instrument in the series SLSTR(Sea and Land Surface Temperature Radiometer) will be launch in 2015. The Aerosol and Clouds are retreived using similar algorithms to maximise the consistency of the results These state-of-the-art retrievals have been merged together to quantify the susceptibility of cloud properties to changes in aerosol concentration. Aerosol-cloud susceptibilities are calculated from several thousand samples in each 1x1 degree globally gridded region. Two-D histograms of the aerosol and cloud properties are also included to facilitate seamless comparisons between other satellite and modelling data sets. The analysis of these two long term records will be discussed individually and the initial comparisons between these new joint products and models will be presented.
NASA Astrophysics Data System (ADS)
Anwer, Rao Muhammad; Khan, Fahad Shahbaz; van de Weijer, Joost; Molinier, Matthieu; Laaksonen, Jorma
2018-04-01
Designing discriminative powerful texture features robust to realistic imaging conditions is a challenging computer vision problem with many applications, including material recognition and analysis of satellite or aerial imagery. In the past, most texture description approaches were based on dense orderless statistical distribution of local features. However, most recent approaches to texture recognition and remote sensing scene classification are based on Convolutional Neural Networks (CNNs). The de facto practice when learning these CNN models is to use RGB patches as input with training performed on large amounts of labeled data (ImageNet). In this paper, we show that Local Binary Patterns (LBP) encoded CNN models, codenamed TEX-Nets, trained using mapped coded images with explicit LBP based texture information provide complementary information to the standard RGB deep models. Additionally, two deep architectures, namely early and late fusion, are investigated to combine the texture and color information. To the best of our knowledge, we are the first to investigate Binary Patterns encoded CNNs and different deep network fusion architectures for texture recognition and remote sensing scene classification. We perform comprehensive experiments on four texture recognition datasets and four remote sensing scene classification benchmarks: UC-Merced with 21 scene categories, WHU-RS19 with 19 scene classes, RSSCN7 with 7 categories and the recently introduced large scale aerial image dataset (AID) with 30 aerial scene types. We demonstrate that TEX-Nets provide complementary information to standard RGB deep model of the same network architecture. Our late fusion TEX-Net architecture always improves the overall performance compared to the standard RGB network on both recognition problems. Furthermore, our final combination leads to consistent improvement over the state-of-the-art for remote sensing scene classification.
Wang, Ze-Huan; Peng, Hua; Kilian, Norbert
2013-01-01
The first comprehensive molecular phylogenetic reconstruction of the Cichorieae subtribe Lactucinae is provided. Sequences for two datasets, one of the nuclear rDNA ITS region, the other of five concatenated non-coding chloroplast DNA markers including the petD region and the psbA-trnH, 5′trnL(UAA)-trnF, rpl32-trnL(UAG) and trnQ(UUG)-5′rps16 spacers, were, with few exceptions, newly generated for 130 samples of 78 species. The sampling spans the entire subtribe Lactucinae while focusing on its Chinese centre of diversity; more than 3/4 of the Chinese Lactucinae species are represented. The nuclear and plastid phylogenies inferred from the two independent datasets show various hard topological incongruences. They concern the internal topology of major lineages, in one case the placement of taxa in major lineages, the relationships between major lineages and even the circumscription of the subtribe, indicating potential events of ancient as well as of more recent reticulation and chloroplast capture in the evolution of the subtribe. The core of the subtribe is clearly monophyletic, consisting of the six lineages, Cicerbita, Cicerbita II, Lactuca, Melanoseris, Notoseris and Paraprenanthes. The Faberia lineage and the monospecific Prenanthes purpurea lineage are part of a monophyletic subtribe Lactucinae only in the nuclear or plastid phylogeny, respectively. Morphological and karyological support for their placement is considered. In the light of the molecular phylogenetic reconstruction and of additional morphological data, the conflicting taxonomies of the Chinese Lactuca alliance are discussed and it is concluded that the major lineages revealed are best treated at generic rank. An improved species level taxonomy of the Chinese Lactucinae is outlined; new synonymies and some new combinations are provided. PMID:24376566
NASA Astrophysics Data System (ADS)
Zhang, Z.; Zimmermann, N. E.; Poulter, B.
2015-11-01
Simulations of the spatial-temporal dynamics of wetlands are key to understanding the role of wetland biogeochemistry under past and future climate variability. Hydrologic inundation models, such as TOPMODEL, are based on a fundamental parameter known as the compound topographic index (CTI) and provide a computationally cost-efficient approach to simulate wetland dynamics at global scales. However, there remains large discrepancy in the implementations of TOPMODEL in land-surface models (LSMs) and thus their performance against observations. This study describes new improvements to TOPMODEL implementation and estimates of global wetland dynamics using the LPJ-wsl dynamic global vegetation model (DGVM), and quantifies uncertainties by comparing three digital elevation model products (HYDRO1k, GMTED, and HydroSHEDS) at different spatial resolution and accuracy on simulated inundation dynamics. In addition, we found that calibrating TOPMODEL with a benchmark wetland dataset can help to successfully delineate the seasonal and interannual variations of wetlands, as well as improve the spatial distribution of wetlands to be consistent with inventories. The HydroSHEDS DEM, using a river-basin scheme for aggregating the CTI, shows best accuracy for capturing the spatio-temporal dynamics of wetlands among the three DEM products. The estimate of global wetland potential/maximum is ∼ 10.3 Mkm2 (106 km2), with a mean annual maximum of ∼ 5.17 Mkm2 for 1980-2010. This study demonstrates the feasibility to capture spatial heterogeneity of inundation and to estimate seasonal and interannual variations in wetland by coupling a hydrological module in LSMs with appropriate benchmark datasets. It additionally highlights the importance of an adequate investigation of topographic indices for simulating global wetlands and shows the opportunity to converge wetland estimates across LSMs by identifying the uncertainty associated with existing wetland products.
Matott, L Shawn; Jiang, Zhengzheng; Rabideau, Alan J; Allen-King, Richelle M
2015-01-01
Numerous isotherm expressions have been developed for describing sorption of hydrophobic organic compounds (HOCs), including "dual-mode" approaches that combine nonlinear behavior with a linear partitioning component. Choosing among these alternative expressions for describing a given dataset is an important task that can significantly influence subsequent transport modeling and/or mechanistic interpretation. In this study, a series of numerical experiments were undertaken to identify "best-in-class" isotherms by refitting 10 alternative models to a suite of 13 previously published literature datasets. The corrected Akaike Information Criterion (AICc) was used for ranking these alternative fits and distinguishing between plausible and implausible isotherms for each dataset. The occurrence of multiple plausible isotherms was inversely correlated with dataset "richness", such that datasets with fewer observations and/or a narrow range of aqueous concentrations resulted in a greater number of plausible isotherms. Overall, only the Polanyi-partition dual-mode isotherm was classified as "plausible" across all 13 of the considered datasets, indicating substantial statistical support consistent with current advances in sorption theory. However, these findings are predicated on the use of the AICc measure as an unbiased ranking metric and the adoption of a subjective, but defensible, threshold for separating plausible and implausible isotherms. Copyright © 2015 Elsevier B.V. All rights reserved.
The Berkeley SuperNova Ia Program (BSNIP): Dataset and Initial Analysis
NASA Astrophysics Data System (ADS)
Silverman, Jeffrey; Ganeshalingam, M.; Kong, J.; Li, W.; Filippenko, A.
2012-01-01
I will present spectroscopic data from the Berkeley SuperNova Ia Program (BSNIP), their initial analysis, and the results of attempts to use spectral information to improve cosmological distance determinations to Type Ia supernova (SNe Ia). The dataset consists of 1298 low-redshift (z< 0.2) optical spectra of 582 SNe Ia observed from 1989 through the end of 2008. Many of the SNe have well-calibrated light curves with measured distance moduli as well as spectra that have been corrected for host-galaxy contamination. I will also describe the spectral classification scheme employed (using the SuperNova Identification code, SNID; Blondin & Tonry 2007) which utilizes a newly constructed set of SNID spectral templates. The sheer size of the BSNIP dataset and the consistency of the observation and reduction methods make this sample unique among all other published SN Ia datasets. I will also discuss measurements of the spectral features of about one-third of the spectra which were obtained within 20 days of maximum light. I will briefly describe the adopted method of automated, robust spectral-feature definition and measurement which expands upon similar previous studies. Comparisons of these measurements of SN Ia spectral features to photometric observables will be presented with an eye toward using spectral information to calculate more accurate cosmological distances. Finally, I will comment on related projects which also utilize the BSNIP dataset that are planned for the near future. This research was supported by NSF grant AST-0908886 and the TABASGO Foundation. I am grateful to Marc J. Staley for a Graduate Fellowship.
SPAR: small RNA-seq portal for analysis of sequencing experiments.
Kuksa, Pavel P; Amlie-Wolf, Alexandre; Katanic, Živadin; Valladares, Otto; Wang, Li-San; Leung, Yuk Yee
2018-05-04
The introduction of new high-throughput small RNA sequencing protocols that generate large-scale genomics datasets along with increasing evidence of the significant regulatory roles of small non-coding RNAs (sncRNAs) have highlighted the urgent need for tools to analyze and interpret large amounts of small RNA sequencing data. However, it remains challenging to systematically and comprehensively discover and characterize sncRNA genes and specifically-processed sncRNA products from these datasets. To fill this gap, we present Small RNA-seq Portal for Analysis of sequencing expeRiments (SPAR), a user-friendly web server for interactive processing, analysis, annotation and visualization of small RNA sequencing data. SPAR supports sequencing data generated from various experimental protocols, including smRNA-seq, short total RNA sequencing, microRNA-seq, and single-cell small RNA-seq. Additionally, SPAR includes publicly available reference sncRNA datasets from our DASHR database and from ENCODE across 185 human tissues and cell types to produce highly informative small RNA annotations across all major small RNA types and other features such as co-localization with various genomic features, precursor transcript cleavage patterns, and conservation. SPAR allows the user to compare the input experiment against reference ENCODE/DASHR datasets. SPAR currently supports analyses of human (hg19, hg38) and mouse (mm10) sequencing data. SPAR is freely available at https://www.lisanwanglab.org/SPAR.
OperomeDB: A Database of Condition-Specific Transcription Units in Prokaryotic Genomes.
Chetal, Kashish; Janga, Sarath Chandra
2015-01-01
Background. In prokaryotic organisms, a substantial fraction of adjacent genes are organized into operons-codirectionally organized genes in prokaryotic genomes with the presence of a common promoter and terminator. Although several available operon databases provide information with varying levels of reliability, very few resources provide experimentally supported results. Therefore, we believe that the biological community could benefit from having a new operon prediction database with operons predicted using next-generation RNA-seq datasets. Description. We present operomeDB, a database which provides an ensemble of all the predicted operons for bacterial genomes using available RNA-sequencing datasets across a wide range of experimental conditions. Although several studies have recently confirmed that prokaryotic operon structure is dynamic with significant alterations across environmental and experimental conditions, there are no comprehensive databases for studying such variations across prokaryotic transcriptomes. Currently our database contains nine bacterial organisms and 168 transcriptomes for which we predicted operons. User interface is simple and easy to use, in terms of visualization, downloading, and querying of data. In addition, because of its ability to load custom datasets, users can also compare their datasets with publicly available transcriptomic data of an organism. Conclusion. OperomeDB as a database should not only aid experimental groups working on transcriptome analysis of specific organisms but also enable studies related to computational and comparative operomics.
Understanding Depressive Symptoms and Psychosocial Stressors on Twitter: A Corpus-Based Study.
Mowery, Danielle; Smith, Hilary; Cheney, Tyler; Stoddard, Greg; Coppersmith, Glen; Bryan, Craig; Conway, Mike
2017-02-28
With a lifetime prevalence of 16.2%, major depressive disorder is the fifth biggest contributor to the disease burden in the United States. The aim of this study, building on previous work qualitatively analyzing depression-related Twitter data, was to describe the development of a comprehensive annotation scheme (ie, coding scheme) for manually annotating Twitter data with Diagnostic and Statistical Manual of Mental Disorders, Edition 5 (DSM 5) major depressive symptoms (eg, depressed mood, weight change, psychomotor agitation, or retardation) and Diagnostic and Statistical Manual of Mental Disorders, Edition IV (DSM-IV) psychosocial stressors (eg, educational problems, problems with primary support group, housing problems). Using this annotation scheme, we developed an annotated corpus, Depressive Symptom and Psychosocial Stressors Acquired Depression, the SAD corpus, consisting of 9300 tweets randomly sampled from the Twitter application programming interface (API) using depression-related keywords (eg, depressed, gloomy, grief). An analysis of our annotated corpus yielded several key results. First, 72.09% (6829/9473) of tweets containing relevant keywords were nonindicative of depressive symptoms (eg, "we're in for a new economic depression"). Second, the most prevalent symptoms in our dataset were depressed mood and fatigue or loss of energy. Third, less than 2% of tweets contained more than one depression related category (eg, diminished ability to think or concentrate, depressed mood). Finally, we found very high positive correlations between some depression-related symptoms in our annotated dataset (eg, fatigue or loss of energy and educational problems; educational problems and diminished ability to think). We successfully developed an annotation scheme and an annotated corpus, the SAD corpus, consisting of 9300 tweets randomly-selected from the Twitter application programming interface using depression-related keywords. Our analyses suggest that keyword queries alone might not be suitable for public health monitoring because context can change the meaning of keyword in a statement. However, postprocessing approaches could be useful for reducing the noise and improving the signal needed to detect depression symptoms using social media. ©Danielle Mowery, Hilary Smith, Tyler Cheney, Greg Stoddard, Glen Coppersmith, Craig Bryan, Mike Conway. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 28.02.2017.
Chan, Wing Cheuk; Jackson, Gary; Wright, Craig Shawe; Orr-Walker, Brandon; Drury, Paul L; Boswell, D Ross; Lee, Mildred Ai Wei; Papa, Dean; Jackson, Rod
2014-01-01
Objectives To determine the diabetes screening levels and known glycaemic status of all individuals by age, gender and ethnicity within a defined geographic location in a timely and consistent way to potentially facilitate systematic disease prevention and management. Design Retrospective observational study. Setting Auckland region of New Zealand. Participants 1 475 347 people who had utilised publicly funded health service in New Zealand and domicile in the Auckland region of New Zealand in 2010. The health service utilisation population was individually linked to a comprehensive regional laboratory repository dating back to 2004. Outcome measures The two outcomes measures were glycaemia-related blood testing coverage (glycated haemoglobin (HbA1c), fasting and random glucose and glucose tolerance tests), and the proportions and number of people with known dysglycaemia in 2010 using modified American Diabetes Association (ADA) and WHO criteria. Results Within the health service utilisation population, 792 560 people had had at least one glucose or HbA1c blood test in the previous 5.5 years. Overall, 81% of males (n=198 086) and 87% of females (n=128 982) in the recommended age groups for diabetes screening had a blood test to assess their glycaemic status. The estimated age-standardised prevalence of dysglycaemia was highest in people of Pacific Island ethnicity at 11.4% (95% CI 11.2% to 11.5%) for males and 11.6% (11.4% to 11.8%) for females, followed closely by people of Indian ethnicity at 10.8% (10.6% to 11.1%) and 9.3% (9.1% to 9.6%), respectively. Among the indigenous Maori population, the prevalence was 8.2% (7.9% to 8.4%) and 7% (6.8% to 7.2%), while for ‘Others’ (mainly Europeans) it was 3% (3% to 3.1%) and 2.2% (2.1% to 2.2%), respectively. Conclusions We have demonstrated that the data linkage between a laboratory repository and national administrative datasets has the potential to provide a systematic and consistent individual level clinical information that is relevant to medical auditing for a large geographically defined population. PMID:24776708
Chan, Wing Cheuk; Jackson, Gary; Wright, Craig Shawe; Orr-Walker, Brandon; Drury, Paul L; Boswell, D Ross; Lee, Mildred Ai Wei; Papa, Dean; Jackson, Rod
2014-04-28
To determine the diabetes screening levels and known glycaemic status of all individuals by age, gender and ethnicity within a defined geographic location in a timely and consistent way to potentially facilitate systematic disease prevention and management. Retrospective observational study. Auckland region of New Zealand. 1 475 347 people who had utilised publicly funded health service in New Zealand and domicile in the Auckland region of New Zealand in 2010. The health service utilisation population was individually linked to a comprehensive regional laboratory repository dating back to 2004. The two outcomes measures were glycaemia-related blood testing coverage (glycated haemoglobin (HbA1c), fasting and random glucose and glucose tolerance tests), and the proportions and number of people with known dysglycaemia in 2010 using modified American Diabetes Association (ADA) and WHO criteria. Within the health service utilisation population, 792 560 people had had at least one glucose or HbA1c blood test in the previous 5.5 years. Overall, 81% of males (n=198 086) and 87% of females (n=128 982) in the recommended age groups for diabetes screening had a blood test to assess their glycaemic status. The estimated age-standardised prevalence of dysglycaemia was highest in people of Pacific Island ethnicity at 11.4% (95% CI 11.2% to 11.5%) for males and 11.6% (11.4% to 11.8%) for females, followed closely by people of Indian ethnicity at 10.8% (10.6% to 11.1%) and 9.3% (9.1% to 9.6%), respectively. Among the indigenous Maori population, the prevalence was 8.2% (7.9% to 8.4%) and 7% (6.8% to 7.2%), while for 'Others' (mainly Europeans) it was 3% (3% to 3.1%) and 2.2% (2.1% to 2.2%), respectively. We have demonstrated that the data linkage between a laboratory repository and national administrative datasets has the potential to provide a systematic and consistent individual level clinical information that is relevant to medical auditing for a large geographically defined population.
A land cover change detection and classification protocol for updating Alaska NLCD 2001 to 2011
Jin, Suming; Yang, Limin; Zhu, Zhe; Homer, Collin G.
2017-01-01
Monitoring and mapping land cover changes are important ways to support evaluation of the status and transition of ecosystems. The Alaska National Land Cover Database (NLCD) 2001 was the first 30-m resolution baseline land cover product of the entire state derived from circa 2001 Landsat imagery and geospatial ancillary data. We developed a comprehensive approach named AKUP11 to update Alaska NLCD from 2001 to 2011 and provide a 10-year cyclical update of the state's land cover and land cover changes. Our method is designed to characterize the main land cover changes associated with different drivers, including the conversion of forests to shrub and grassland primarily as a result of wildland fire and forest harvest, the vegetation successional processes after disturbance, and changes of surface water extent and glacier ice/snow associated with weather and climate changes. For natural vegetated areas, a component named AKUP11-VEG was developed for updating the land cover that involves four major steps: 1) identify the disturbed and successional areas using Landsat images and ancillary datasets; 2) update the land cover status for these areas using a SKILL model (System of Knowledge-based Integrated-trajectory Land cover Labeling); 3) perform decision tree classification; and 4) develop a final land cover and land cover change product through the postprocessing modeling. For water and ice/snow areas, another component named AKUP11-WIS was developed for initial land cover change detection, removal of the terrain shadow effects, and exclusion of ephemeral snow changes using a 3-year MODIS snow extent dataset from 2010 to 2012. The overall approach was tested in three pilot study areas in Alaska, with each area consisting of four Landsat image footprints. The results from the pilot study show that the overall accuracy in detecting change and no-change is 90% and the overall accuracy of the updated land cover label for 2011 is 86%. The method provided a robust, consistent, and efficient means for capturing major disturbance events and updating land cover for Alaska. The method has subsequently been applied to generate the land cover and land cover change products for the entire state of Alaska.
Understanding Depressive Symptoms and Psychosocial Stressors on Twitter: A Corpus-Based Study
Smith, Hilary; Cheney, Tyler; Stoddard, Greg; Coppersmith, Glen; Bryan, Craig; Conway, Mike
2017-01-01
Background With a lifetime prevalence of 16.2%, major depressive disorder is the fifth biggest contributor to the disease burden in the United States. Objective The aim of this study, building on previous work qualitatively analyzing depression-related Twitter data, was to describe the development of a comprehensive annotation scheme (ie, coding scheme) for manually annotating Twitter data with Diagnostic and Statistical Manual of Mental Disorders, Edition 5 (DSM 5) major depressive symptoms (eg, depressed mood, weight change, psychomotor agitation, or retardation) and Diagnostic and Statistical Manual of Mental Disorders, Edition IV (DSM-IV) psychosocial stressors (eg, educational problems, problems with primary support group, housing problems). Methods Using this annotation scheme, we developed an annotated corpus, Depressive Symptom and Psychosocial Stressors Acquired Depression, the SAD corpus, consisting of 9300 tweets randomly sampled from the Twitter application programming interface (API) using depression-related keywords (eg, depressed, gloomy, grief). An analysis of our annotated corpus yielded several key results. Results First, 72.09% (6829/9473) of tweets containing relevant keywords were nonindicative of depressive symptoms (eg, “we’re in for a new economic depression”). Second, the most prevalent symptoms in our dataset were depressed mood and fatigue or loss of energy. Third, less than 2% of tweets contained more than one depression related category (eg, diminished ability to think or concentrate, depressed mood). Finally, we found very high positive correlations between some depression-related symptoms in our annotated dataset (eg, fatigue or loss of energy and educational problems; educational problems and diminished ability to think). Conclusions We successfully developed an annotation scheme and an annotated corpus, the SAD corpus, consisting of 9300 tweets randomly-selected from the Twitter application programming interface using depression-related keywords. Our analyses suggest that keyword queries alone might not be suitable for public health monitoring because context can change the meaning of keyword in a statement. However, postprocessing approaches could be useful for reducing the noise and improving the signal needed to detect depression symptoms using social media. PMID:28246066
Baumsteiger, Jason; Kinziger, Andrew P; Aguilar, Andres
2012-12-01
The west coast of North America contains a number of biogeographic freshwater provinces which reflect an ever-changing aquatic landscape. Clues to understanding this complex structure are often encapsulated genetically in the ichthyofauna, though frequently as unresolved evolutionary relationships and putative cryptic species. Advances in molecular phylogenetics through species tree analyses now allow for improved exploration of these relationships. Using a comprehensive approach, we analyzed two mitochondrial and nine nuclear loci for a group of endemic freshwater fish (sculpin-Cottus) known for a wide ranging distribution and complex species structure in this region. Species delimitation techniques identified three novel cryptic lineages, all well supported by phylogenetic analyses. Comparative phylogenetic analyses consistently found five distinct clades reflecting a number of unique biogeographic provinces. Some internal node relationships varied by species tree reconstruction method, and were associated with either Bayesian or maximum likelihood statistical approaches or between mitochondrial, nuclear, and combined datasets. Limited cases of mitochondrial capture were also evident, suggestive of putative ancestral hybridization between species. Biogeographic diversification was associated with four major regions and revealed historical faunal exchanges across regions. Mapping of an important life-history character (amphidromy) revealed two separate instances of trait evolution, a transition that has occurred repeatedly in Cottus. This study demonstrates the power of current phylogenetic methods, the need for a comprehensive phylogenetic approach, and the potential for sculpin to serve as an indicator of biogeographic history for native ichthyofauna in the region. Copyright © 2012 Elsevier Inc. All rights reserved.
HydroSHEDS: A global comprehensive hydrographic dataset
NASA Astrophysics Data System (ADS)
Wickel, B. A.; Lehner, B.; Sindorf, N.
2007-12-01
The Hydrological data and maps based on SHuttle Elevation Derivatives at multiple Scales (HydroSHEDS) is an innovative product that, for the first time, provides hydrographic information in a consistent and comprehensive format for regional and global-scale applications. HydroSHEDS offers a suite of geo-referenced data sets, including stream networks, watershed boundaries, drainage directions, and ancillary data layers such as flow accumulations, distances, and river topology information. The goal of developing HydroSHEDS was to generate key data layers to support regional and global watershed analyses, hydrological modeling, and freshwater conservation planning at a quality, resolution and extent that had previously been unachievable. Available resolutions range from 3 arc-second (approx. 90 meters at the equator) to 5 minute (approx. 10 km at the equator) with seamless near-global extent. HydroSHEDS is derived from elevation data of the Shuttle Radar Topography Mission (SRTM) at 3 arc-second resolution. The original SRTM data have been hydrologically conditioned using a sequence of automated procedures. Existing methods of data improvement and newly developed algorithms have been applied, including void filling, filtering, stream burning, and upscaling techniques. Manual corrections were made where necessary. Preliminary quality assessments indicate that the accuracy of HydroSHEDS significantly exceeds that of existing global watershed and river maps. HydroSHEDS was developed by the Conservation Science Program of the World Wildlife Fund (WWF) in partnership with the U.S. Geological Survey (USGS), the International Centre for Tropical Agriculture (CIAT), The Nature Conservancy (TNC), and the Center for Environmental Systems Research (CESR) of the University of Kassel, Germany.
Oakton Community College Comprehensive Annual Financial Report, Fiscal Year Ended June 30, 1996.
ERIC Educational Resources Information Center
Hilquist, David E.
Consisting primarily of tables, this report provides financial data on Oakton Community College in Illinois for the fiscal year ending on June 30, 1996. This comprehensive annual financial report consists of an introductory section, financial section, statistical section, and special reports section. The introductory section includes a transmittal…
34 CFR 303.321 - Comprehensive child find system.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 34 Education 2 2011-07-01 2010-07-01 true Comprehensive child find system. 303.321 Section 303.321... Services Identification and Evaluation § 303.321 Comprehensive child find system. (a) General. (1) Each system must include a comprehensive child find system that is consistent with part B of the Act (see 34...
Gordon, Phillip V; Swanson, Jonathan R; MacQueen, Brianna C; Christensen, Robert D
2017-02-01
In the last decades the reported incidence of preterm necrotizing enterocolitis (NEC) has been declining in large part due to implementing comprehensive NEC prevention initiatives, including breast milk feeding, standardized feeding protocols, transfusion guidelines, and antibiotic stewardship and improving the rigor with which non-NEC cases are excluded from NEC data. However, after more than 60 years of NEC research in animal models, the promise of a "magic bullet" to prevent NEC has yet to materialize. There are also serious issues involving clinical NEC research. There is a lack of a common, comprehensive definition of NEC. National datasets have their own unique definition and staging definitions. Even within academia, randomized trials and single center studies have widely disparate definitions. This makes NEC metadata of very limited value. The world of neonatology needs a comprehensive, universal, consensus definition of NEC. It also needs a de-identified, international data warehouse. Copyright © 2016 Elsevier Inc. All rights reserved.
Fuzzy neural network technique for system state forecasting.
Li, Dezhi; Wang, Wilson; Ismail, Fathy
2013-10-01
In many system state forecasting applications, the prediction is performed based on multiple datasets, each corresponding to a distinct system condition. The traditional methods dealing with multiple datasets (e.g., vector autoregressive moving average models and neural networks) have some shortcomings, such as limited modeling capability and opaque reasoning operations. To tackle these problems, a novel fuzzy neural network (FNN) is proposed in this paper to effectively extract information from multiple datasets, so as to improve forecasting accuracy. The proposed predictor consists of both autoregressive (AR) nodes modeling and nonlinear nodes modeling; AR models/nodes are used to capture the linear correlation of the datasets, and the nonlinear correlation of the datasets are modeled with nonlinear neuron nodes. A novel particle swarm technique [i.e., Laplace particle swarm (LPS) method] is proposed to facilitate parameters estimation of the predictor and improve modeling accuracy. The effectiveness of the developed FNN predictor and the associated LPS method is verified by a series of tests related to Mackey-Glass data forecast, exchange rate data prediction, and gear system prognosis. Test results show that the developed FNN predictor and the LPS method can capture the dynamics of multiple datasets effectively and track system characteristics accurately.
Full-motion video analysis for improved gender classification
NASA Astrophysics Data System (ADS)
Flora, Jeffrey B.; Lochtefeld, Darrell F.; Iftekharuddin, Khan M.
2014-06-01
The ability of computer systems to perform gender classification using the dynamic motion of the human subject has important applications in medicine, human factors, and human-computer interface systems. Previous works in motion analysis have used data from sensors (including gyroscopes, accelerometers, and force plates), radar signatures, and video. However, full-motion video, motion capture, range data provides a higher resolution time and spatial dataset for the analysis of dynamic motion. Works using motion capture data have been limited by small datasets in a controlled environment. In this paper, we explore machine learning techniques to a new dataset that has a larger number of subjects. Additionally, these subjects move unrestricted through a capture volume, representing a more realistic, less controlled environment. We conclude that existing linear classification methods are insufficient for the gender classification for larger dataset captured in relatively uncontrolled environment. A method based on a nonlinear support vector machine classifier is proposed to obtain gender classification for the larger dataset. In experimental testing with a dataset consisting of 98 trials (49 subjects, 2 trials per subject), classification rates using leave-one-out cross-validation are improved from 73% using linear discriminant analysis to 88% using the nonlinear support vector machine classifier.
A Self-Directed Method for Cell-Type Identification and Separation of Gene Expression Microarrays
Zuckerman, Neta S.; Noam, Yair; Goldsmith, Andrea J.; Lee, Peter P.
2013-01-01
Gene expression analysis is generally performed on heterogeneous tissue samples consisting of multiple cell types. Current methods developed to separate heterogeneous gene expression rely on prior knowledge of the cell-type composition and/or signatures - these are not available in most public datasets. We present a novel method to identify the cell-type composition, signatures and proportions per sample without need for a-priori information. The method was successfully tested on controlled and semi-controlled datasets and performed as accurately as current methods that do require additional information. As such, this method enables the analysis of cell-type specific gene expression using existing large pools of publically available microarray datasets. PMID:23990767
Maswadeh, Waleed M; Snyder, A Peter
2015-05-30
Variable responses are fundamental for all experiments, and they can consist of information-rich, redundant, and low signal intensities. A dataset can consist of a collection of variable responses over multiple classes or groups. Usually some of the variables are removed in a dataset that contain very little information. Sometimes all the variables are used in the data analysis phase. It is common practice to discriminate between two distributions of data; however, there is no formal algorithm to arrive at a degree of separation (DS) between two distributions of data. The DS is defined herein as the average of the sum of the areas from the probability density functions (PDFs) of A and B that contain a≥percentage of A and/or B. Thus, DS90 is the average of the sum of the PDF areas of A and B that contain ≥90% of A and/or B. To arrive at a DS value, two synthesized PDFs or very large experimental datasets are required. Experimentally it is common practice to generate relatively small datasets. Therefore, the challenge was to find a statistical parameter that can be used on small datasets to estimate and highly correlate with the DS90 parameter. Established statistical methods include the overlap area of the two data distribution profiles, Welch's t-test, Kolmogorov-Smirnov (K-S) test, Mann-Whitney-Wilcoxon test, and the area under the receiver operating characteristics (ROC) curve (AUC). The area between the ROC curve and diagonal (ACD) and the length of the ROC curve (LROC) are introduced. The established, ACD, and LROC methods were correlated to the DS90 when applied on many pairs of synthesized PDFs. The LROC method provided the best linear correlation with, and estimation of, the DS90. The estimated DS90 from the LROC (DS90-LROC) is applied to a database, as an example, of three Italian wines consisting of thirteen variable responses for variable ranking consideration. An important highlight of the DS90-LROC method is utilizing the LROC curve methodology to test all variables one-at-a-time with all pairs of classes in a dataset. Copyright © 2015 Elsevier B.V. All rights reserved.
Data publication, documentation and user friendly landing pages - improving data discovery and reuse
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland
2016-04-01
Research data are the basis for scientific research and often irreplaceable (e.g. observational data). Storage of such data in appropriate, theme specific or institutional repositories is an essential part of ensuring their long term preservation and access. The free and open access to research data for reuse and scrutiny has been identified as a key issue by the scientific community as well as by research agencies and the public. To ensure the datasets to intelligible and usable for others they must be accompanied by comprehensive data description and standardized metadata for data discovery, and ideally should be published using digital object identifier (DOI). These make datasets citable and ensure their long-term accessibility and are accepted in reference lists of journal articles (http://www.copdess.org/statement-of-commitment/). The GFZ German Research Centre for Geosciences is the national laboratory for Geosciences in Germany and part of the Helmholtz Association, Germany's largest scientific organization. The development and maintenance of data systems is a key component of 'GFZ Data Services' to support state-of-the-art research. The datasets, archived in and published by the GFZ Data Repository cover all geoscientific disciplines and range from large dynamic datasets deriving from global monitoring seismic or geodetic networks with real-time data acquisition, to remotely sensed satellite products, to automatically generated data publications from a database for data from micro meteorological stations, to various model results, to geochemical and rock mechanical analyses from various labs, and field observations. The user-friendly presentation of published datasets via a DOI landing page is as important for reuse as the storage itself, and the required information is highly specific for each scientific discipline. If dataset descriptions are too general, or require the download of a dataset before knowing its suitability, many researchers often decide not to reuse a published dataset. In contrast to large data repositories without thematic specification, theme-specific data repositories have a large expertise in data discovery and opportunity to develop usable, discipline-specific formats and layouts for specific datasets, including consultation to different formats for the data description (e.g., via a Data Report or an article in a Data Journal) with full consideration of international metadata standards.
101 Labeled Brain Images and a Consistent Human Cortical Labeling Protocol
Klein, Arno; Tourville, Jason
2012-01-01
We introduce the Mindboggle-101 dataset, the largest and most complete set of free, publicly accessible, manually labeled human brain images. To manually label the macroscopic anatomy in magnetic resonance images of 101 healthy participants, we created a new cortical labeling protocol that relies on robust anatomical landmarks and minimal manual edits after initialization with automated labels. The “Desikan–Killiany–Tourville” (DKT) protocol is intended to improve the ease, consistency, and accuracy of labeling human cortical areas. Given how difficult it is to label brains, the Mindboggle-101 dataset is intended to serve as brain atlases for use in labeling other brains, as a normative dataset to establish morphometric variation in a healthy population for comparison against clinical populations, and contribute to the development, training, testing, and evaluation of automated registration and labeling algorithms. To this end, we also introduce benchmarks for the evaluation of such algorithms by comparing our manual labels with labels automatically generated by probabilistic and multi-atlas registration-based approaches. All data and related software and updated information are available on the http://mindboggle.info/data website. PMID:23227001
Check your biosignals here: a new dataset for off-the-person ECG biometrics.
da Silva, Hugo Plácido; Lourenço, André; Fred, Ana; Raposo, Nuno; Aires-de-Sousa, Marta
2014-02-01
The Check Your Biosignals Here initiative (CYBHi) was developed as a way of creating a dataset and consistently repeatable acquisition framework, to further extend research in electrocardiographic (ECG) biometrics. In particular, our work targets the novel trend towards off-the-person data acquisition, which opens a broad new set of challenges and opportunities both for research and industry. While datasets with ECG signals collected using medical grade equipment at the chest can be easily found, for off-the-person ECG data the solution is generally for each team to collect their own corpus at considerable expense of resources. In this paper we describe the context, experimental considerations, methods, and preliminary findings of two public datasets created by our team, one for short-term and another for long-term assessment, with ECG data collected at the hand palms and fingers. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Willemse, Elias J; Joubert, Johan W
2016-09-01
In this article we present benchmark datasets for the Mixed Capacitated Arc Routing Problem under Time restrictions with Intermediate Facilities (MCARPTIF). The problem is a generalisation of the Capacitated Arc Routing Problem (CARP), and closely represents waste collection routing. Four different test sets are presented, each consisting of multiple instance files, and which can be used to benchmark different solution approaches for the MCARPTIF. An in-depth description of the datasets can be found in "Constructive heuristics for the Mixed Capacity Arc Routing Problem under Time Restrictions with Intermediate Facilities" (Willemseand Joubert, 2016) [2] and "Splitting procedures for the Mixed Capacitated Arc Routing Problem under Time restrictions with Intermediate Facilities" (Willemseand Joubert, in press) [4]. The datasets are publicly available from "Library of benchmark test sets for variants of the Capacitated Arc Routing Problem under Time restrictions with Intermediate Facilities" (Willemse and Joubert, 2016) [3].
Adaptive ingredients against food spoilage in Japanese cuisine.
Ohtsubo, Yohsuke
2009-12-01
Billing and Sherman proposed the antimicrobial hypothesis to explain the worldwide spice use pattern. The present study explored whether two antimicrobial ingredients (i.e. spices and vinegar) are used in ways consistent with the antimicrobial hypothesis. Four specific predictions were tested: meat-based recipes would call for more spices/vinegar than vegetable-based recipes; summer recipes would call for more spices/vinegar than winter recipes; recipes in hotter regions would call for more spices/vinegar; and recipes including unheated ingredients would call for more spices/vinegar. Spice/vinegar use patterns were compiled from two types of traditional Japanese cookbooks. Dataset I included recipes provided by elderly Japanese housewives. Dataset II included recipes provided by experts in traditional Japanese foods. The analyses of Dataset I revealed that the vinegar use pattern conformed to the predictions. In contrast, analyses of Dataset II generally supported the predictions in terms of spices, but not vinegar.
NASA Astrophysics Data System (ADS)
Gummeson, Anna; Arvidsson, Ida; Ohlsson, Mattias; Overgaard, Niels C.; Krzyzanowska, Agnieszka; Heyden, Anders; Bjartell, Anders; Aström, Kalle
2017-03-01
Prostate cancer is the most diagnosed cancer in men. The diagnosis is confirmed by pathologists based on ocular inspection of prostate biopsies in order to classify them according to Gleason score. The main goal of this paper is to automate the classification using convolutional neural networks (CNNs). The introduction of CNNs has broadened the field of pattern recognition. It replaces the classical way of designing and extracting hand-made features used for classification with the substantially different strategy of letting the computer itself decide which features are of importance. For automated prostate cancer classification into the classes: Benign, Gleason grade 3, 4 and 5 we propose a CNN with small convolutional filters that has been trained from scratch using stochastic gradient descent with momentum. The input consists of microscopic images of haematoxylin and eosin stained tissue, the output is a coarse segmentation into regions of the four different classes. The dataset used consists of 213 images, each considered to be of one class only. Using four-fold cross-validation we obtained an error rate of 7.3%, which is significantly better than previous state of the art using the same dataset. Although the dataset was rather small, good results were obtained. From this we conclude that CNN is a promising method for this problem. Future work includes obtaining a larger dataset, which potentially could diminish the error margin.
A methodological investigation of hominoid craniodental morphology and phylogenetics.
Bjarnason, Alexander; Chamberlain, Andrew T; Lockwood, Charles A
2011-01-01
The evolutionary relationships of extant great apes and humans have been largely resolved by molecular studies, yet morphology-based phylogenetic analyses continue to provide conflicting results. In order to further investigate this discrepancy we present bootstrap clade support of morphological data based on two quantitative datasets, one dataset consisting of linear measurements of the whole skull from 5 hominoid genera and the second dataset consisting of 3D landmark data from the temporal bone of 5 hominoid genera, including 11 sub-species. Using similar protocols for both datasets, we were able to 1) compare distance-based phylogenetic methods to cladistic parsimony of quantitative data converted into discrete character states, 2) vary outgroup choice to observe its effect on phylogenetic inference, and 3) analyse male and female data separately to observe the effect of sexual dimorphism on phylogenies. Phylogenetic analysis was sensitive to methodological decisions, particularly outgroup selection, where designation of Pongo as an outgroup and removal of Hylobates resulted in greater congruence with the proposed molecular phylogeny. The performance of distance-based methods also justifies their use in phylogenetic analysis of morphological data. It is clear from our analyses that hominoid phylogenetics ought not to be used as an example of conflict between the morphological and molecular, but as an example of how outgroup and methodological choices can affect the outcome of phylogenetic analysis. Copyright © 2010 Elsevier Ltd. All rights reserved.
Data assimilation and model evaluation experiment datasets
NASA Technical Reports Server (NTRS)
Lai, Chung-Cheng A.; Qian, Wen; Glenn, Scott M.
1994-01-01
The Institute for Naval Oceanography, in cooperation with Naval Research Laboratories and universities, executed the Data Assimilation and Model Evaluation Experiment (DAMEE) for the Gulf Stream region during fiscal years 1991-1993. Enormous effort has gone into the preparation of several high-quality and consistent datasets for model initialization and verification. This paper describes the preparation process, the temporal and spatial scopes, the contents, the structure, etc., of these datasets. The goal of DAMEE and the need of data for the four phases of experiment are briefly stated. The preparation of DAMEE datasets consisted of a series of processes: (1) collection of observational data; (2) analysis and interpretation; (3) interpolation using the Optimum Thermal Interpolation System package; (4) quality control and re-analysis; and (5) data archiving and software documentation. The data products from these processes included a time series of 3D fields of temperature and salinity, 2D fields of surface dynamic height and mixed-layer depth, analysis of the Gulf Stream and rings system, and bathythermograph profiles. To date, these are the most detailed and high-quality data for mesoscale ocean modeling, data assimilation, and forecasting research. Feedback from ocean modeling groups who tested this data was incorporated into its refinement. Suggestions for DAMEE data usages include (1) ocean modeling and data assimilation studies, (2) diagnosis and theoretical studies, and (3) comparisons with locally detailed observations.
ERIC Educational Resources Information Center
Lin, Sheau-Wen; Liu, Yu
2017-01-01
The purpose of this study was to explore elementary students' listening comprehension changes using a Web-based teaching system that can diagnose and remediate students' science listening comprehension problems during scientific inquiry. The 3-component system consisted of a 9-item science listening comprehension test, a 37-item diagnostic test,…
Emotion word comprehension from 4 to 16 years old: a developmental survey.
Baron-Cohen, Simon; Golan, Ofer; Wheelwright, Sally; Granader, Yael; Hill, Jacqueline
2010-01-01
Whilst previous studies have examined comprehension of the emotional lexicon at different ages in typically developing children, no survey has been conducted looking at this across different ages from childhood to adolescence. To report how the emotion lexicon grows with age. Comprehension of 336 emotion words was tested in n = 377 children and adolescents, aged 4-16 years old, divided into 6 age-bands. Parents or teachers of children under 12, or adolescents themselves, were asked to indicate which words they knew the meaning of. Between 4 and 11 years old, the size of the emotional lexicon doubled every 2 years, but between 12 and 16 years old, developmental rate of growth of the emotional lexicon leveled off. This survey also allows emotion words to be ordered in terms of difficulty. Studies using emotion terms in English need to be developmentally sensitive, since during childhood there is considerable change. The absence of change after adolescence may be an artifact of the words included in this study. This normative developmental data-set for emotion vocabulary comprehension may be useful when testing for delays in this ability, as might arise for environmental or neurodevelopmental reasons.
NASA Astrophysics Data System (ADS)
Li, W.; Shao, H.
2017-12-01
For geospatial cyberinfrastructure enabled web services, the ability of rapidly transmitting and sharing spatial data over the Internet plays a critical role to meet the demands of real-time change detection, response and decision-making. Especially for the vector datasets which serve as irreplaceable and concrete material in data-driven geospatial applications, their rich geometry and property information facilitates the development of interactive, efficient and intelligent data analysis and visualization applications. However, the big-data issues of vector datasets have hindered their wide adoption in web services. In this research, we propose a comprehensive optimization strategy to enhance the performance of vector data transmitting and processing. This strategy combines: 1) pre- and on-the-fly generalization, which automatically determines proper simplification level through the introduction of appropriate distance tolerance (ADT) to meet various visualization requirements, and at the same time speed up simplification efficiency; 2) a progressive attribute transmission method to reduce data size and therefore the service response time; 3) compressed data transmission and dynamic adoption of a compression method to maximize the service efficiency under different computing and network environments. A cyberinfrastructure web portal was developed for implementing the proposed technologies. After applying our optimization strategies, substantial performance enhancement is achieved. We expect this work to widen the use of web service providing vector data to support real-time spatial feature sharing, visual analytics and decision-making.
MoCha: Molecular Characterization of Unknown Pathways.
Lobo, Daniel; Hammelman, Jennifer; Levin, Michael
2016-04-01
Automated methods for the reverse-engineering of complex regulatory networks are paving the way for the inference of mechanistic comprehensive models directly from experimental data. These novel methods can infer not only the relations and parameters of the known molecules defined in their input datasets, but also unknown components and pathways identified as necessary by the automated algorithms. Identifying the molecular nature of these unknown components is a crucial step for making testable predictions and experimentally validating the models, yet no specific and efficient tools exist to aid in this process. To this end, we present here MoCha (Molecular Characterization), a tool optimized for the search of unknown proteins and their pathways from a given set of known interacting proteins. MoCha uses the comprehensive dataset of protein-protein interactions provided by the STRING database, which currently includes more than a billion interactions from over 2,000 organisms. MoCha is highly optimized, performing typical searches within seconds. We demonstrate the use of MoCha with the characterization of unknown components from reverse-engineered models from the literature. MoCha is useful for working on network models by hand or as a downstream step of a model inference engine workflow and represents a valuable and efficient tool for the characterization of unknown pathways using known data from thousands of organisms. MoCha and its source code are freely available online under the GPLv3 license.
NASA Astrophysics Data System (ADS)
Kotlarski, Sven; Gutiérrez, José M.; Boberg, Fredrik; Bosshard, Thomas; Cardoso, Rita M.; Herrera, Sixto; Maraun, Douglas; Mezghani, Abdelkader; Pagé, Christian; Räty, Olle; Stepanek, Petr; Soares, Pedro M. M.; Szabo, Peter
2016-04-01
VALUE is an open European network to validate and compare downscaling methods for climate change research (http://www.value-cost.eu). A key deliverable of VALUE is the development of a systematic validation framework to enable the assessment and comparison of downscaling methods. Such assessments can be expected to crucially depend on the existence of accurate and reliable observational reference data. In dynamical downscaling, observational data can influence model development itself and, later on, model evaluation, parameter calibration and added value assessment. In empirical-statistical downscaling, observations serve as predictand data and directly influence model calibration with corresponding effects on downscaled climate change projections. We here present a comprehensive assessment of the influence of uncertainties in observational reference data and of scale-related issues on several of the above-mentioned aspects. First, temperature and precipitation characteristics as simulated by a set of reanalysis-driven EURO-CORDEX RCM experiments are validated against three different gridded reference data products, namely (1) the EOBS dataset (2) the recently developed EURO4M-MESAN regional re-analysis, and (3) several national high-resolution and quality-controlled gridded datasets that recently became available. The analysis reveals a considerable influence of the choice of the reference data on the evaluation results, especially for precipitation. It is also illustrated how differences between the reference data sets influence the ranking of RCMs according to a comprehensive set of performance measures.
Interoperable Solar Data and Metadata via LISIRD 3
NASA Astrophysics Data System (ADS)
Wilson, A.; Lindholm, D. M.; Pankratz, C. K.; Snow, M. A.; Woods, T. N.
2015-12-01
LISIRD 3 is a major upgrade of the LASP Interactive Solar Irradiance Data Center (LISIRD), which serves several dozen space based solar irradiance and related data products to the public. Through interactive plots, LISIRD 3 provides data browsing supported by data subsetting and aggregation. Incorporating a semantically enabled metadata repository, LISIRD 3 users see current, vetted, consistent information about the datasets offered. Users can now also search for datasets based on metadata fields such as dataset type and/or spectral or temporal range. This semantic database enables metadata browsing, so users can discover the relationships between datasets, instruments, spacecraft, mission and PI. The database also enables creation and publication of metadata records in a variety of formats, such as SPASE or ISO, making these datasets more discoverable. The database also enables the possibility of a public SPARQL endpoint, making the metadata browsable in an automated fashion. LISIRD 3's data access middleware, LaTiS, provides dynamic, on demand reformatting of data and timestamps, subsetting and aggregation, and other server side functionality via a RESTful OPeNDAP compliant API, enabling interoperability between LASP datasets and many common tools. LISIRD 3's templated front end design, coupled with the uniform data interface offered by LaTiS, allows easy integration of new datasets. Consequently the number and variety of datasets offered by LISIRD has grown to encompass several dozen, with many more to come. This poster will discuss design and implementation of LISIRD 3, including tools used, capabilities enabled, and issues encountered.
EEG datasets for motor imagery brain-computer interface.
Cho, Hohyun; Ahn, Minkyu; Ahn, Sangtae; Kwon, Moonyoung; Jun, Sung Chan
2017-07-01
Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states. © The Authors 2017. Published by Oxford University Press.
Structural covariance networks across healthy young adults and their consistency.
Guo, Xiaojuan; Wang, Yan; Guo, Taomei; Chen, Kewei; Zhang, Jiacai; Li, Ke; Jin, Zhen; Yao, Li
2015-08-01
To investigate structural covariance networks (SCNs) as measured by regional gray matter volumes with structural magnetic resonance imaging (MRI) from healthy young adults, and to examine their consistency and stability. Two independent cohorts were included in this study: Group 1 (82 healthy subjects aged 18-28 years) and Group 2 (109 healthy subjects aged 20-28 years). Structural MRI data were acquired at 3.0T and 1.5T using a magnetization prepared rapid-acquisition gradient echo sequence for these two groups, respectively. We applied independent component analysis (ICA) to construct SCNs and further applied the spatial overlap ratio and correlation coefficient to evaluate the spatial consistency of the SCNs between these two datasets. Seven and six independent components were identified for Group 1 and Group 2, respectively. Moreover, six SCNs including the posterior default mode network, the visual and auditory networks consistently existed across the two datasets. The overlap ratios and correlation coefficients of the visual network reached the maximums of 72% and 0.71. This study demonstrates the existence of consistent SCNs corresponding to general functional networks. These structural covariance findings may provide insight into the underlying organizational principles of brain anatomy. © 2014 Wiley Periodicals, Inc.
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments
Micsinai, Mariann; Parisi, Fabio; Strino, Francesco; Asp, Patrik; Dynlacht, Brian D.; Kluger, Yuval
2012-01-01
Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development. PMID:22307239
Big Data in HEP: A comprehensive use case study
Gutsche, Oliver; Cremonesi, Matteo; Elmer, Peter; ...
2017-11-23
Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems collectively called Big Data technologies have emerged to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity.more » In this talk, we present an active LHC Run 2 analysis, searching for dark matter with the CMS detector, as a testbed for Big Data technologies. We directly compare the traditional NTuple-based analysis with an equivalent analysis using Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the analysis with the official experiment data formats and produce publication physics plots. Lastly, we will discuss advantages and disadvantages of each approach and give an outlook on further studies needed.« less
Big Data in HEP: A comprehensive use case study
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gutsche, Oliver; Cremonesi, Matteo; Elmer, Peter
Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems collectively called Big Data technologies have emerged to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity.more » In this talk, we present an active LHC Run 2 analysis, searching for dark matter with the CMS detector, as a testbed for Big Data technologies. We directly compare the traditional NTuple-based analysis with an equivalent analysis using Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the analysis with the official experiment data formats and produce publication physics plots. Lastly, we will discuss advantages and disadvantages of each approach and give an outlook on further studies needed.« less
Testing the Neutral Theory of Biodiversity with Human Microbiome Datasets.
Li, Lianwei; Ma, Zhanshan Sam
2016-08-16
The human microbiome project (HMP) has made it possible to test important ecological theories for arguably the most important ecosystem to human health-the human microbiome. Existing limited number of studies have reported conflicting evidence in the case of the neutral theory; the present study aims to comprehensively test the neutral theory with extensive HMP datasets covering all five major body sites inhabited by the human microbiome. Utilizing 7437 datasets of bacterial community samples, we discovered that only 49 communities (less than 1%) satisfied the neutral theory, and concluded that human microbial communities are not neutral in general. The 49 positive cases, although only a tiny minority, do demonstrate the existence of neutral processes. We realize that the traditional doctrine of microbial biogeography "Everything is everywhere, but the environment selects" first proposed by Baas-Becking resolves the apparent contradiction. The first part of Baas-Becking doctrine states that microbes are not dispersal-limited and therefore are neutral prone, and the second part reiterates that the freely dispersed microbes must endure selection by the environment. Therefore, in most cases, it is the host environment that ultimately shapes the community assembly and tip the human microbiome to niche regime.
Testing the Neutral Theory of Biodiversity with Human Microbiome Datasets
Li, Lianwei; Ma, Zhanshan (Sam)
2016-01-01
The human microbiome project (HMP) has made it possible to test important ecological theories for arguably the most important ecosystem to human health—the human microbiome. Existing limited number of studies have reported conflicting evidence in the case of the neutral theory; the present study aims to comprehensively test the neutral theory with extensive HMP datasets covering all five major body sites inhabited by the human microbiome. Utilizing 7437 datasets of bacterial community samples, we discovered that only 49 communities (less than 1%) satisfied the neutral theory, and concluded that human microbial communities are not neutral in general. The 49 positive cases, although only a tiny minority, do demonstrate the existence of neutral processes. We realize that the traditional doctrine of microbial biogeography “Everything is everywhere, but the environment selects” first proposed by Baas-Becking resolves the apparent contradiction. The first part of Baas-Becking doctrine states that microbes are not dispersal-limited and therefore are neutral prone, and the second part reiterates that the freely dispersed microbes must endure selection by the environment. Therefore, in most cases, it is the host environment that ultimately shapes the community assembly and tip the human microbiome to niche regime. PMID:27527985
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments.
Micsinai, Mariann; Parisi, Fabio; Strino, Francesco; Asp, Patrik; Dynlacht, Brian D; Kluger, Yuval
2012-05-01
Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development.
SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome.
Li, Yiwei; Ilie, Lucian
2017-11-15
Proteins perform their functions usually by interacting with other proteins. Predicting which proteins interact is a fundamental problem. Experimental methods are slow, expensive, and have a high rate of error. Many computational methods have been proposed among which sequence-based ones are very promising. However, so far no such method is able to predict effectively the entire human interactome: they require too much time or memory. We present SPRINT (Scoring PRotein INTeractions), a new sequence-based algorithm and tool for predicting protein-protein interactions. We comprehensively compare SPRINT with state-of-the-art programs on seven most reliable human PPI datasets and show that it is more accurate while running orders of magnitude faster and using very little memory. SPRINT is the only sequence-based program that can effectively predict the entire human interactome: it requires between 15 and 100 min, depending on the dataset. Our goal is to transform the very challenging problem of predicting the entire human interactome into a routine task. The source code of SPRINT is freely available from https://github.com/lucian-ilie/SPRINT/ and the datasets and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/ .
NASA Astrophysics Data System (ADS)
Huang, Xiaomeng; Hu, Chenqi; Huang, Xing; Chu, Yang; Tseng, Yu-heng; Zhang, Guang Jun; Lin, Yanluan
2018-01-01
Mesoscale convective systems (MCSs) are important components of tropical weather systems and the climate system. Long-term data of MCS are of great significance in weather and climate research. Using long-term (1985-2008) global satellite infrared (IR) data, we developed a novel objective automatic tracking algorithm, which combines a Kalman filter (KF) with the conventional area-overlapping method, to generate a comprehensive MCS dataset. The new algorithm can effectively track small and fast-moving MCSs and thus obtain more realistic and complete tracking results than previous studies. A few examples are provided to illustrate the potential application of the dataset with a focus on the diurnal variations of MCSs over land and ocean regions. We find that the MCSs occurring over land tend to initiate in the afternoon with greater intensity, but the oceanic MCSs are more likely to initiate in the early morning with weaker intensity. A double peak in the maximum spatial coverage is noted over the western Pacific, especially over the southwestern Pacific during the austral summer. Oceanic MCSs also persist for approximately 1 h longer than their continental counterparts.
Re-evaluating the link between brain size and behavioural ecology in primates.
Powell, Lauren E; Isler, Karin; Barton, Robert A
2017-10-25
Comparative studies have identified a wide range of behavioural and ecological correlates of relative brain size, with results differing between taxonomic groups, and even within them. In primates for example, recent studies contradict one another over whether social or ecological factors are critical. A basic assumption of such studies is that with sufficiently large samples and appropriate analysis, robust correlations indicative of selection pressures on cognition will emerge. We carried out a comprehensive re-examination of correlates of primate brain size using two large comparative datasets and phylogenetic comparative methods. We found evidence in both datasets for associations between brain size and ecological variables (home range size, diet and activity period), but little evidence for an effect of social group size, a correlation which has previously formed the empirical basis of the Social Brain Hypothesis. However, reflecting divergent results in the literature, our results exhibited instability across datasets, even when they were matched for species composition and predictor variables. We identify several potential empirical and theoretical difficulties underlying this instability and suggest that these issues raise doubts about inferring cognitive selection pressures from behavioural correlates of brain size. © 2017 The Author(s).
SAR image dataset of military ground targets with multiple poses for ATR
NASA Astrophysics Data System (ADS)
Belloni, Carole; Balleri, Alessio; Aouf, Nabil; Merlet, Thomas; Le Caillec, Jean-Marc
2017-10-01
Automatic Target Recognition (ATR) is the task of automatically detecting and classifying targets. Recognition using Synthetic Aperture Radar (SAR) images is interesting because SAR images can be acquired at night and under any weather conditions, whereas optical sensors operating in the visible band do not have this capability. Existing SAR ATR algorithms have mostly been evaluated using the MSTAR dataset.1 The problem with the MSTAR is that some of the proposed ATR methods have shown good classification performance even when targets were hidden,2 suggesting the presence of a bias in the dataset. Evaluations of SAR ATR techniques are currently challenging due to the lack of publicly available data in the SAR domain. In this paper, we present a high resolution SAR dataset consisting of images of a set of ground military target models taken at various aspect angles, The dataset can be used for a fair evaluation and comparison of SAR ATR algorithms. We applied the Inverse Synthetic Aperture Radar (ISAR) technique to echoes from targets rotating on a turntable and illuminated with a stepped frequency waveform. The targets in the database consist of four variants of two 1.7m-long models of T-64 and T-72 tanks. The gun, the turret position and the depression angle are varied to form 26 different sequences of images. The emitted signal spanned the frequency range from 13 GHz to 18 GHz to achieve a bandwidth of 5 GHz sampled with 4001 frequency points. The resolution obtained with respect to the size of the model targets is comparable to typical values obtained using SAR airborne systems. Single polarized images (Horizontal-Horizontal) are generated using the backprojection algorithm.3 A total of 1480 images are produced using a 20° integration angle. The images in the dataset are organized in a suggested training and testing set to facilitate a standard evaluation of SAR ATR algorithms.
NASA Astrophysics Data System (ADS)
Tarquini, S.; Nannipieri, L.; Favalli, M.; Fornaciai, A.; Vinci, S.; Doumaz, F.
2012-04-01
Digital elevation models (DEMs) are fundamental in any kind of environmental or morphological study. DEMs are obtained from a variety of sources and generated in several ways. Nowadays, a few global-coverage elevation datasets are available for free (e.g., SRTM, http://www.jpl.nasa.gov/srtm; ASTER, http://asterweb.jpl.nasa.gov/). When the matrix of a DEM is used also for computational purposes, the choice of the elevation dataset which better suits the target of the study is crucial. Recently, the increasing use of DEM-based numerical simulation tools (e.g. for gravity driven mass flows), would largely benefit from the use of a higher resolution/higher accuracy topography than those available at planetary scale. Similar elevation datasets are neither easily nor freely available for all countries worldwide. Here we introduce a new web resource which made available for free (for research purposes only) a 10 m-resolution DEM for the whole Italian territory. The creation of this elevation dataset was presented by Tarquini et al. (2007). This DEM was obtained in triangular irregular network (TIN) format starting from heterogeneous vector datasets, mostly consisting in elevation contour lines and elevation points derived from several sources. The input vector database was carefully cleaned up to obtain an improved seamless TIN refined by using the DEST algorithm, thus improving the Delaunay tessellation. The whole TINITALY/01 DEM was converted in grid format (10-m cell size) according to a tiled structure composed of 193, 50-km side square elements. The grid database consists of more than 3 billions of cells and occupies almost 12 GB of disk memory. A web-GIS has been created (http://tinitaly.pi.ingv.it/ ) where a seamless layer of images in full resolution (10 m) obtained from the whole DEM (both in color-shaded and anaglyph mode) is open for browsing. Accredited navigators are allowed to download the elevation dataset.
In vitro downregulated hypoxia transcriptome is associated with poor prognosis in breast cancer.
Abu-Jamous, Basel; Buffa, Francesca M; Harris, Adrian L; Nandi, Asoke K
2017-06-15
Hypoxia is a characteristic of breast tumours indicating poor prognosis. Based on the assumption that those genes which are up-regulated under hypoxia in cell-lines are expected to be predictors of poor prognosis in clinical data, many signatures of poor prognosis were identified. However, it was observed that cell line data do not always concur with clinical data, and therefore conclusions from cell line analysis should be considered with caution. As many transcriptomic cell-line datasets from hypoxia related contexts are available, integrative approaches which investigate these datasets collectively, while not ignoring clinical data, are required. We analyse sixteen heterogeneous breast cancer cell-line transcriptomic datasets in hypoxia-related conditions collectively by employing the unique capabilities of the method, UNCLES, which integrates clustering results from multiple datasets and can address questions that cannot be answered by existing methods. This has been demonstrated by comparison with the state-of-the-art iCluster method. From this collection of genome-wide datasets include 15,588 genes, UNCLES identified a relatively high number of genes (>1000 overall) which are consistently co-regulated over all of the datasets, and some of which are still poorly understood and represent new potential HIF targets, such as RSBN1 and KIAA0195. Two main, anti-correlated, clusters were identified; the first is enriched with MYC targets participating in growth and proliferation, while the other is enriched with HIF targets directly participating in the hypoxia response. Surprisingly, in six clinical datasets, some sub-clusters of growth genes are found consistently positively correlated with hypoxia response genes, unlike the observation in cell lines. Moreover, the ability to predict bad prognosis by a combined signature of one sub-cluster of growth genes and one sub-cluster of hypoxia-induced genes appears to be comparable and perhaps greater than that of known hypoxia signatures. We present a clustering approach suitable to integrate data from diverse experimental set-ups. Its application to breast cancer cell line datasets reveals new hypoxia-regulated signatures of genes which behave differently when in vitro (cell-line) data is compared with in vivo (clinical) data, and are of a prognostic value comparable or exceeding the state-of-the-art hypoxia signatures.