Sample records for dataset working group

  1. Reference datasets for bioequivalence trials in a two-group parallel design.

    PubMed

    Fuglsang, Anders; Schütz, Helmut; Labes, Detlew

    2015-03-01

    In order to help companies qualify and validate the software used to evaluate bioequivalence trials with two parallel treatment groups, this work aims to define datasets with known results. This paper puts a total 11 datasets into the public domain along with proposed consensus obtained via evaluations from six different software packages (R, SAS, WinNonlin, OpenOffice Calc, Kinetica, EquivTest). Insofar as possible, datasets were evaluated with and without the assumption of equal variances for the construction of a 90% confidence interval. Not all software packages provide functionality for the assumption of unequal variances (EquivTest, Kinetica), and not all packages can handle datasets with more than 1000 subjects per group (WinNonlin). Where results could be obtained across all packages, one showed questionable results when datasets contained unequal group sizes (Kinetica). A proposal is made for the results that should be used as validation targets.

  2. A Research Graph dataset for connecting research data repositories using RD-Switchboard.

    PubMed

    Aryani, Amir; Poblet, Marta; Unsworth, Kathryn; Wang, Jingbo; Evans, Ben; Devaraju, Anusuriya; Hausstein, Brigitte; Klas, Claus-Peter; Zapilko, Benjamin; Kaplun, Samuele

    2018-05-29

    This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants. The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation. Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.

  3. ClimateNet: A Machine Learning dataset for Climate Science Research

    NASA Astrophysics Data System (ADS)

    Prabhat, M.; Biard, J.; Ganguly, S.; Ames, S.; Kashinath, K.; Kim, S. K.; Kahou, S.; Maharaj, T.; Beckham, C.; O'Brien, T. A.; Wehner, M. F.; Williams, D. N.; Kunkel, K.; Collins, W. D.

    2017-12-01

    Deep Learning techniques have revolutionized commercial applications in Computer vision, speech recognition and control systems. The key for all of these developments was the creation of a curated, labeled dataset ImageNet, for enabling multiple research groups around the world to develop methods, benchmark performance and compete with each other. The success of Deep Learning can be largely attributed to the broad availability of this dataset. Our empirical investigations have revealed that Deep Learning is similarly poised to benefit the task of pattern detection in climate science. Unfortunately, labeled datasets, a key pre-requisite for training, are hard to find. Individual research groups are typically interested in specialized weather patterns, making it hard to unify, and share datasets across groups and institutions. In this work, we are proposing ClimateNet: a labeled dataset that provides labeled instances of extreme weather patterns, as well as associated raw fields in model and observational output. We develop a schema in NetCDF to enumerate weather pattern classes/types, store bounding boxes, and pixel-masks. We are also working on a TensorFlow implementation to natively import such NetCDF datasets, and are providing a reference convolutional architecture for binary classification tasks. Our hope is that researchers in Climate Science, as well as ML/DL, will be able to use (and extend) ClimateNet to make rapid progress in the application of Deep Learning for Climate Science research.

  4. REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

    NASA Astrophysics Data System (ADS)

    Moulik, P.; Lekic, V.; Romanowicz, B. A.

    2017-12-01

    A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history. By assessing inter-dataset consistency across similar paths, we quantify travel-time measurement errors for both surface and body waves. Finally, we discuss challenges associated with combining high frequency ( 1 Hz) and long period (10-20s) body-wave measurements into the REM-3D reference dataset.

  5. Open NASA Earth Exchange (OpenNEX): Strategies for enabling cross organization collaboration in the earth sciences

    NASA Astrophysics Data System (ADS)

    Michaelis, A.; Ganguly, S.; Nemani, R. R.; Votava, P.; Wang, W.; Lee, T. J.; Dungan, J. L.

    2014-12-01

    Sharing community-valued codes, intermediary datasets and results from individual efforts with others that are not in a direct funded collaboration can be a challenge. Cross organization collaboration is often impeded due to infrastructure security constraints, rigid financial controls, bureaucracy, and workforce nationalities, etc., which can force groups to work in a segmented fashion and/or through awkward and suboptimal web services. We show how a focused community may come together, share modeling and analysis codes, computing configurations, scientific results, knowledge and expertise on a public cloud platform; diverse groups of researchers working together at "arms length". Through the OpenNEX experimental workshop, users can view short technical "how-to" videos and explore encapsulated working environment. Workshop participants can easily instantiate Amazon Machine Images (AMI) or launch full cluster and data processing configurations within minutes. Enabling users to instantiate computing environments from configuration templates on large public cloud infrastructures, such as Amazon Web Services, may provide a mechanism for groups to easily use each others work and collaborate indirectly. Moreover, using the public cloud for this workshop allowed a single group to host a large read only data archive, making datasets of interest to the community widely available on the public cloud, enabling other groups to directly connect to the data and reduce the costs of the collaborative work by freeing other individual groups from redundantly retrieving, integrating or financing the storage of the datasets of interest.

  6. EnviroAtlas - Percentage of Working Age Population Who Are Employed by Block Group for the Conterminous United States

    EPA Pesticide Factsheets

    This EnviroAtlas dataset shows the employment rate, or the percent of the population aged 16-64 who have worked in the past 12 months. The employment rate is a measure of the percent of the working-age population who are employed. It is an indicator of the prevalence of unemployment, which is often used to assess labor market conditions by economists. It is a widely used metric to evaluate the sustainable development of communities (NRC, 2011, UNECE, 2009). This dataset is based on the American Community Survey 5-year data for 2008-2012. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  7. Report of the Association of Coloproctology of Great Britain and Ireland/British Society of Gastroenterology Colorectal Polyp Working Group: the development of a complex colorectal polyp minimum dataset.

    PubMed

    Chattree, A; Barbour, J A; Thomas-Gibson, S; Bhandari, P; Saunders, B P; Veitch, A M; Anderson, J; Rembacken, B J; Loughrey, M B; Pullan, R; Garrett, W V; Lewis, G; Dolwani, S; Rutter, M D

    2017-01-01

    The management of large non-pedunculated colorectal polyps (LNPCPs) is complex, with widespread variation in management and outcome, even amongst experienced clinicians. Variations in the assessment and decision-making processes are likely to be a major factor in this variability. The creation of a standardized minimum dataset to aid decision-making may therefore result in improved clinical management. An official working group of 13 multidisciplinary specialists was appointed by the Association of Coloproctology of Great Britain and Ireland (ACPGBI) and the British Society of Gastroenterology (BSG) to develop a minimum dataset on LNPCPs. The literature review used to structure the ACPGBI/BSG guidelines for the management of LNPCPs was used by a steering subcommittee to identify various parameters pertaining to the decision-making processes in the assessment and management of LNPCPs. A modified Delphi consensus process was then used for voting on proposed parameters over multiple voting rounds with at least 80% agreement defined as consensus. The minimum dataset was used in a pilot process to ensure rigidity and usability. A 23-parameter minimum dataset with parameters relating to patient and lesion factors, including six parameters relating to image retrieval, was formulated over four rounds of voting with two pilot processes to test rigidity and usability. This paper describes the development of the first reported evidence-based and expert consensus minimum dataset for the management of LNPCPs. It is anticipated that this dataset will allow comprehensive and standardized lesion assessment to improve decision-making in the assessment and management of LNPCPs. Colorectal Disease © 2016 The Association of Coloproctology of Great Britain and Ireland.

  8. Joint Blind Source Separation by Multi-set Canonical Correlation Analysis

    PubMed Central

    Li, Yi-Ou; Adalı, Tülay; Wang, Wei; Calhoun, Vince D

    2009-01-01

    In this work, we introduce a simple and effective scheme to achieve joint blind source separation (BSS) of multiple datasets using multi-set canonical correlation analysis (M-CCA) [1]. We first propose a generative model of joint BSS based on the correlation of latent sources within and between datasets. We specify source separability conditions, and show that, when the conditions are satisfied, the group of corresponding sources from each dataset can be jointly extracted by M-CCA through maximization of correlation among the extracted sources. We compare source separation performance of the M-CCA scheme with other joint BSS methods and demonstrate the superior performance of the M-CCA scheme in achieving joint BSS for a large number of datasets, group of corresponding sources with heterogeneous correlation values, and complex-valued sources with circular and non-circular distributions. We apply M-CCA to analysis of functional magnetic resonance imaging (fMRI) data from multiple subjects and show its utility in estimating meaningful brain activations from a visuomotor task. PMID:20221319

  9. Software tools for interactive instruction in radiologic anatomy.

    PubMed

    Alvarez, Antonio; Gold, Garry E; Tobin, Brian; Desser, Terry S

    2006-04-01

    To promote active learning in an introductory Radiologic Anatomy course through the use of computer-based exercises. DICOM datasets from our hospital PACS system were transferred to a networked cluster of desktop computers in a medical school classroom. Medical students in the Radiologic Anatomy course were divided into four small groups and assigned to work on a clinical case for 45 minutes. The groups used iPACS viewer software, a free DICOM viewer, to view images and annotate anatomic structures. The classroom instructor monitored and displayed each group's work sequentially on the master screen by running SynchronEyes, a software tool for controlling PC desktops remotely. Students were able to execute the assigned tasks using the iPACS software with minimal oversight or instruction. Course instructors displayed each group's work on the main display screen of the classroom as the students presented the rationale for their decisions. The interactive component of the course received high ratings from the students and overall course ratings were higher than in prior years when the course was given solely in lecture format. DICOM viewing software is an excellent tool for enabling students to learn radiologic anatomy from real-life clinical datasets. Interactive exercises performed in groups can be powerful tools for stimulating students to learn radiologic anatomy.

  10. Unbalanced 2 x 2 Factorial Designs and the Interaction Effect: A Troublesome Combination

    PubMed Central

    2015-01-01

    In this power study, ANOVAs of unbalanced and balanced 2 x 2 datasets are compared (N = 120). Datasets are created under the assumption that H1 of the effects is true. The effects are constructed in two ways, assuming: 1. contributions to the effects solely in the treatment groups; 2. contrasting contributions in treatment and control groups. The main question is whether the two ANOVA correction methods for imbalance (applying Sums of Squares Type II or III; SS II or SS III) offer satisfactory power in the presence of an interaction. Overall, SS II showed higher power, but results varied strongly. When compared to a balanced dataset, for some unbalanced datasets the rejection rate of H0 of main effects was undesirably higher. SS III showed consistently somewhat lower power. When the effects were constructed with equal contributions from control and treatment groups, the interaction could be re-estimated satisfactorily. When an interaction was present, SS III led consistently to somewhat lower rejection rates of H0 of main effects, compared to the rejection rates found in equivalent balanced datasets, while SS II produced strongly varying results. In data constructed with only effects in the treatment groups and no effects in the control groups, the H0 of moderate and strong interaction effects was often not rejected and SS II seemed applicable. Even then, SS III provided slightly better results when a true interaction was present. ANOVA allowed not always for a satisfactory re-estimation of the unique interaction effect. Yet, SS II worked better only when an interaction effect could be excluded, whereas SS III results were just marginally worse in that case. Overall, SS III provided consistently 1 to 5% lower rejection rates of H0 in comparison with analyses of balanced datasets, while results of SS II varied too widely for general application. PMID:25807514

  11. Knowledge-Guided Robust MRI Brain Extraction for Diverse Large-Scale Neuroimaging Studies on Humans and Non-Human Primates

    PubMed Central

    Wang, Yaping; Nie, Jingxin; Yap, Pew-Thian; Li, Gang; Shi, Feng; Geng, Xiujuan; Guo, Lei; Shen, Dinggang

    2014-01-01

    Accurate and robust brain extraction is a critical step in most neuroimaging analysis pipelines. In particular, for the large-scale multi-site neuroimaging studies involving a significant number of subjects with diverse age and diagnostic groups, accurate and robust extraction of the brain automatically and consistently is highly desirable. In this paper, we introduce population-specific probability maps to guide the brain extraction of diverse subject groups, including both healthy and diseased adult human populations, both developing and aging human populations, as well as non-human primates. Specifically, the proposed method combines an atlas-based approach, for coarse skull-stripping, with a deformable-surface-based approach that is guided by local intensity information and population-specific prior information learned from a set of real brain images for more localized refinement. Comprehensive quantitative evaluations were performed on the diverse large-scale populations of ADNI dataset with over 800 subjects (55∼90 years of age, multi-site, various diagnosis groups), OASIS dataset with over 400 subjects (18∼96 years of age, wide age range, various diagnosis groups), and NIH pediatrics dataset with 150 subjects (5∼18 years of age, multi-site, wide age range as a complementary age group to the adult dataset). The results demonstrate that our method consistently yields the best overall results across almost the entire human life span, with only a single set of parameters. To demonstrate its capability to work on non-human primates, the proposed method is further evaluated using a rhesus macaque dataset with 20 subjects. Quantitative comparisons with popularly used state-of-the-art methods, including BET, Two-pass BET, BET-B, BSE, HWA, ROBEX and AFNI, demonstrate that the proposed method performs favorably with superior performance on all testing datasets, indicating its robustness and effectiveness. PMID:24489639

  12. Authorship Identification for Tamil Classical Poem using Subspace Discriminant Algorithm

    NASA Astrophysics Data System (ADS)

    Pandian, A.; Ramalingam, V. V.; Manikandan, K.; Vishnu Preet, R. P.

    2018-04-01

    The Development of extensive perceiving confirmation of a creator's work consolidates stylometry examination that joins various fascinating issues. Extraction of specific kind of highlights from the substance draws in us with the chance to perceive the producers of obscure works. Center of this paper is to briefly recognize the creators of unidentified Tamil dataset in context of crafted by known creators. Content preparing is the technique for getting amazing data from the dataset that joins quantifiable highlights from the dataset. This paper proposes content preparing method to concentrate features and perform grouping on the same. Crafted by a unidentified sonnet or content can be discovered in light of performing arrangement on potential creators' past known work and building a classifier to characterize the obscure lyric or content in any dialect. This procedure can be additionally reached out to every single provincial dialect around the globe. Numerous writing analysts are thinking that it’s hard to sort ballads as the writers of them are not recognized. By playing out this procedure, creators of different lyrics in Tamil vernacular can be perceived which will be significant to the general public.

  13. Using assemblage data in ecological indicators: A comparison and evaluation of commonly available statistical tools

    USGS Publications Warehouse

    Smith, Joseph M.; Mather, Martha E.

    2012-01-01

    Ecological indicators are science-based tools used to assess how human activities have impacted environmental resources. For monitoring and environmental assessment, existing species assemblage data can be used to make these comparisons through time or across sites. An impediment to using assemblage data, however, is that these data are complex and need to be simplified in an ecologically meaningful way. Because multivariate statistics are mathematical relationships, statistical groupings may not make ecological sense and will not have utility as indicators. Our goal was to define a process to select defensible and ecologically interpretable statistical simplifications of assemblage data in which researchers and managers can have confidence. For this, we chose a suite of statistical methods, compared the groupings that resulted from these analyses, identified convergence among groupings, then we interpreted the groupings using species and ecological guilds. When we tested this approach using a statewide stream fish dataset, not all statistical methods worked equally well. For our dataset, logistic regression (Log), detrended correspondence analysis (DCA), cluster analysis (CL), and non-metric multidimensional scaling (NMDS) provided consistent, simplified output. Specifically, the Log, DCA, CL-1, and NMDS-1 groupings were ≥60% similar to each other, overlapped with the fluvial-specialist ecological guild, and contained a common subset of species. Groupings based on number of species (e.g., Log, DCA, CL and NMDS) outperformed groupings based on abundance [e.g., principal components analysis (PCA) and Poisson regression]. Although the specific methods that worked on our test dataset have generality, here we are advocating a process (e.g., identifying convergent groupings with redundant species composition that are ecologically interpretable) rather than the automatic use of any single statistical tool. We summarize this process in step-by-step guidance for the future use of these commonly available ecological and statistical methods in preparing assemblage data for use in ecological indicators.

  14. Comparison of Rehabilitation Outcomes for Long Term Neurological Conditions: A Cohort Analysis of the Australian Rehabilitation Outcomes Centre Dataset for Adults of Working Age

    PubMed Central

    Turner-Stokes, Lynne; Vanderstay, Roxana; Stevermuer, Tara; Simmonds, Frances; Khan, Fary; Eagar, Kathy

    2015-01-01

    Objective To describe and compare outcomes from in-patient rehabilitation (IPR) in working-aged adults across different groups of long-term neurological conditions, as defined by the UK National Service Framework. Design Analysis of a large Australian prospectively collected dataset for completed IPR episodes (n = 28,596) from 2003-2012. Methods De-identified data for adults (16–65 years) with specified neurological impairment codes were extracted, cleaned and divided into ‘Sudden-onset’ conditions: (Stroke (n = 12527), brain injury (n = 7565), spinal cord injury (SCI) (n = 3753), Guillain-Barré syndrome (GBS) (n = 805)) and ‘Progressive/stable’ conditions (Progressive (n = 3750) and Cerebral palsy (n = 196)). Key outcomes included Functional Independence Measure (FIM) scores, length of stay (LOS), and discharge destination. Results Mean LOS ranged from 21–57 days with significant group differences in gender, source of admission and discharge destination. All six groups showed significant change (p<0.001) between admission and discharge that was likely to be clinically important across a range of items. Significant between-group differences were observed for FIM Motor and Cognitive change scores (Kruskal-Wallis p<0.001), and item-by-item analysis confirmed distinct patterns for each of the six groups. SCI and GBS patients were generally at the ceiling of the cognitive subscale. The ‘Progressive/stable’ conditions made smaller improvements in FIM score than the ‘Sudden-onset conditions’, but also had shorter LOS. Conclusion All groups made gains in independence during admission, although pattern of change varied between conditions, and ceiling effects were observed in the FIM-cognitive subscale. Relative cost-efficiency between groups can only be indirectly inferred. Limitations of the current dataset are discussed, together with opportunities for expansion and further development. PMID:26167877

  15. Merging K-means with hierarchical clustering for identifying general-shaped groups.

    PubMed

    Peterson, Anna D; Ghosh, Arka P; Maitra, Ranjan

    2018-01-01

    Clustering partitions a dataset such that observations placed together in a group are similar but different from those in other groups. Hierarchical and K -means clustering are two approaches but have different strengths and weaknesses. For instance, hierarchical clustering identifies groups in a tree-like structure but suffers from computational complexity in large datasets while K -means clustering is efficient but designed to identify homogeneous spherically-shaped clusters. We present a hybrid non-parametric clustering approach that amalgamates the two methods to identify general-shaped clusters and that can be applied to larger datasets. Specifically, we first partition the dataset into spherical groups using K -means. We next merge these groups using hierarchical methods with a data-driven distance measure as a stopping criterion. Our proposal has the potential to reveal groups with general shapes and structure in a dataset. We demonstrate good performance on several simulated and real datasets.

  16. Developing a provisional, international minimal dataset for Juvenile Dermatomyositis: for use in clinical practice to inform research.

    PubMed

    McCann, Liza J; Arnold, Katie; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Beard, Laura; Beresford, Michael W; Wedderburn, Lucy R

    2014-01-01

    Juvenile dermatomyositis (JDM) is a rare but severe autoimmune inflammatory myositis of childhood. International collaboration is essential in order to undertake clinical trials, understand the disease and improve long-term outcome. The aim of this study was to propose from existing collaborative initiatives a preliminary minimal dataset for JDM. This will form the basis of the future development of an international consensus-approved minimum core dataset to be used both in clinical care and inform research, allowing integration of data between centres. A working group of internationally-representative JDM experts was formed to develop a provisional minimal dataset. Clinical and laboratory variables contained within current national and international collaborative databases of patients with idiopathic inflammatory myopathies were scrutinised. Judgements were informed by published literature and a more detailed analysis of the Juvenile Dermatomyositis Cohort Biomarker Study and Repository, UK and Ireland. A provisional minimal JDM dataset has been produced, with an associated glossary of definitions. The provisional minimal dataset will request information at time of patient diagnosis and during on-going prospective follow up. At time of patient diagnosis, information will be requested on patient demographics, diagnostic criteria and treatments given prior to diagnosis. During on-going prospective follow-up, variables will include the presence of active muscle or skin disease, major organ involvement or constitutional symptoms, investigations, treatment, physician global assessments and patient reported outcome measures. An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. This preliminary dataset can now be developed into a consensus-approved minimum core dataset and tested in a wider setting with the aim of achieving international agreement.

  17. Developing a provisional, international Minimal Dataset for Juvenile Dermatomyositis: for use in clinical practice to inform research

    PubMed Central

    2014-01-01

    Background Juvenile dermatomyositis (JDM) is a rare but severe autoimmune inflammatory myositis of childhood. International collaboration is essential in order to undertake clinical trials, understand the disease and improve long-term outcome. The aim of this study was to propose from existing collaborative initiatives a preliminary minimal dataset for JDM. This will form the basis of the future development of an international consensus-approved minimum core dataset to be used both in clinical care and inform research, allowing integration of data between centres. Methods A working group of internationally-representative JDM experts was formed to develop a provisional minimal dataset. Clinical and laboratory variables contained within current national and international collaborative databases of patients with idiopathic inflammatory myopathies were scrutinised. Judgements were informed by published literature and a more detailed analysis of the Juvenile Dermatomyositis Cohort Biomarker Study and Repository, UK and Ireland. Results A provisional minimal JDM dataset has been produced, with an associated glossary of definitions. The provisional minimal dataset will request information at time of patient diagnosis and during on-going prospective follow up. At time of patient diagnosis, information will be requested on patient demographics, diagnostic criteria and treatments given prior to diagnosis. During on-going prospective follow-up, variables will include the presence of active muscle or skin disease, major organ involvement or constitutional symptoms, investigations, treatment, physician global assessments and patient reported outcome measures. Conclusions An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. This preliminary dataset can now be developed into a consensus-approved minimum core dataset and tested in a wider setting with the aim of achieving international agreement. PMID:25075205

  18. Introduction of the ASGARD code (Automated Selection and Grouping of events in AIA Regional Data)

    NASA Astrophysics Data System (ADS)

    Bethge, Christian; Winebarger, Amy; Tiwari, Sanjiv K.; Fayock, Brian

    2017-08-01

    We have developed the ASGARD code to automatically detect and group brightenings ("events") in AIA data. The event selection and grouping can be optimized to the respective dataset with a multitude of control parameters. The code was initially written for IRIS data, but has since been optimized for AIA. However, the underlying algorithm is not limited to either and could be used for other data as well.Results from datasets in various AIA channels show that brightenings are reliably detected and that coherent coronal structures can be isolated by using the obtained information about the start, peak, and end times of events. We are presently working on a follow-up algorithm to automatically determine the heating and cooling timescales of coronal structures. This will be done by correlating the information from different AIA channels with different temperature responses. We will present the code and preliminary results.

  19. SOURCES OF VARIATION IN BASELINE GENE EXPRESSION LEVELS FROM TOXICOGENOMIC STUDY CONTROL ANIMALS ACROSS MULTIPLE LABORATORIES

    EPA Science Inventory

    Variations in study design are typical for toxicogenomic studies, but their impact on gene expression in control animals has not been well characterized. A dataset of control animal microarray expression data was assembled by a working group of the Health and Environmental Scienc...

  20. Al-Qaeda in Iraq (AQI): An Al-Qaeda Affiliate Case Study

    DTIC Science & Technology

    2017-10-01

    a comparative methodology that included eight case studies on groups affiliated or associated with Al-Qaeda. These case studies were then used as a... methodology that included eight case studies on groups affiliated or associated with Al-Qaeda. These case studies were then used as a dataset for cross...Case Study Zack Gold With contributions from Pamela G. Faber October 2017 This work was performed under Federal Government

  1. Automatic Diabetic Macular Edema Detection in Fundus Images Using Publicly Available Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

    2011-01-01

    Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME. This and other two publiclymore » available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing. Our algorithm is robust to segmentation uncertainties, does not need ground truth at lesion level, and is very fast, generating a diagnosis on an average of 4.4 seconds per image on an 2.6 GHz platform with an unoptimised Matlab implementation.« less

  2. Dataset of working conditions and thermo-economic performances for hybrid organic Rankine plants fed by solar and low-grade energy sources.

    PubMed

    Scardigno, Domenico; Fanelli, Emanuele; Viggiano, Annarita; Braccio, Giacobbe; Magi, Vinicio

    2016-06-01

    This article provides the dataset of operating conditions of a hybrid organic Rankine plant generated by the optimization procedure employed in the research article "A genetic optimization of a hybrid organic Rankine plant for solar and low-grade energy sources" (Scardigno et al., 2015) [1]. The methodology used to obtain the data is described. The operating conditions are subdivided into two separate groups: feasible and unfeasible solutions. In both groups, the values of the design variables are given. Besides, the subset of feasible solutions is described in details, by providing the thermodynamic and economic performances, the temperatures at some characteristic sections of the thermodynamic cycle, the net power, the absorbed powers and the area of the heat exchange surfaces.

  3. Bayesian model reduction and empirical Bayes for group (DCM) studies

    PubMed Central

    Friston, Karl J.; Litvak, Vladimir; Oswal, Ashwini; Razi, Adeel; Stephan, Klaas E.; van Wijk, Bernadette C.M.; Ziegler, Gabriel; Zeidman, Peter

    2016-01-01

    This technical note describes some Bayesian procedures for the analysis of group studies that use nonlinear models at the first (within-subject) level – e.g., dynamic causal models – and linear models at subsequent (between-subject) levels. Its focus is on using Bayesian model reduction to finesse the inversion of multiple models of a single dataset or a single (hierarchical or empirical Bayes) model of multiple datasets. These applications of Bayesian model reduction allow one to consider parametric random effects and make inferences about group effects very efficiently (in a few seconds). We provide the relatively straightforward theoretical background to these procedures and illustrate their application using a worked example. This example uses a simulated mismatch negativity study of schizophrenia. We illustrate the robustness of Bayesian model reduction to violations of the (commonly used) Laplace assumption in dynamic causal modelling and show how its recursive application can facilitate both classical and Bayesian inference about group differences. Finally, we consider the application of these empirical Bayesian procedures to classification and prediction. PMID:26569570

  4. Data publication activities in the Natural Environment Research Council

    NASA Astrophysics Data System (ADS)

    Leadbetter, A.; Callaghan, S.; Lowry, R.; Moncoiffé, G.; Donnegan, S.; Pepler, S.; Cunningham, N.; Kirsch, P.; Ault, L.; Bell, P.; Bowie, R.; Harrison, K.; Smith-Haddon, B.; Wetherby, A.; Wright, D.; Thorley, M.

    2012-04-01

    The Natural Environment Research Council (NERC) is implementing its Science Information Strategy in order to provide a world class service to deliver integrated data for earth system science. One project within this strategy is Data Citation and Publication, which aims to put the promotion and recognition stages of the data lifecycle into place alongside the traditional data management activities of NERC's Environmental Data Centres (EDCs). The NERC EDCs have made a distinction between the serving of data and its publication. Data serving is defined in this case as the day-to-day data management tasks of: • acquiring data and metadata from the originating scientists; • metadata and format harmonisation prior to database ingestion; • ensuring the metadata is adequate and accurate and that the data are available in appropriate file formats; • and making the data available for interested parties. Whereas publication: • requires the assignment of a digital object identifier to a dataset which guarantees that an EDC has assessed the quality of the metadata and the file format and will maintain an unchanged version of the data for the foreseeable future • requires the peer-review of the scientific quality of the data by a scientist with knowledge of the scientific domain in which the data were collected, using a framework for peer-review of datasets such as that developed by the CLADDIER project. • requires collaboration with journal publishers who have access to a well established peer-review system The first of these requirements can be managed in-house by the EDCs, while the remainder require collaboration with the wider scientific and publishing communities. It is anticipated that a scientist may achieve a lower level of academic credit for a dataset which is assigned a DOI but does not follow through to the scientific peer-review stage, similar to publication in a report or other non-peer reviewed publication normally described as grey literature, or in a conference proceedings. At the time of writing, the project has successfully assigned DOIs to more than ten legacy datasets held by EDCs through the British Library acting on behalf of the DataCite network. The project is in the process of developing guidelines for which datasets are suitable for submission to an EDC by a scientist wishing to receive a DOI for their data. While maintaining a United Kingdom focus, this project is not operating in isolation as its members are working alongside international groups such as the CODATA-ICSTI Task Group on Data Citations, the DataCite Working Group on Criteria for Datacentres, and the joint Scientific Commission for Oceanography / International Oceanographic Data and Information Exchange / Marine Biological Laboratory, Woods Hole Oceanographic Institution Library working group on data publication.

  5. Discriminating Projections for Estimating Face Age in Wild Images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tokola, Ryan A; Bolme, David S; Ricanek, Karl

    2014-01-01

    We introduce a novel approach to estimating the age of a human from a single uncontrolled image. Current face age estimation algorithms work well in highly controlled images, and some are robust to changes in illumination, but it is usually assumed that images are close to frontal. This bias is clearly seen in the datasets that are commonly used to evaluate age estimation, which either entirely or mostly consist of frontal images. Using pose-specific projections, our algorithm maps image features into a pose-insensitive latent space that is discriminative with respect to age. Age estimation is then performed using a multi-classmore » SVM. We show that our approach outperforms other published results on the Images of Groups dataset, which is the only age-related dataset with a non-trivial number of off-axis face images, and that we are competitive with recent age estimation algorithms on the mostly-frontal FG-NET dataset. We also experimentally demonstrate that our feature projections introduce insensitivity to pose.« less

  6. HDF Update

    NASA Technical Reports Server (NTRS)

    Pourmal, Elena

    2016-01-01

    The HDF Group maintains and evolves HDF software used by NASA ESDIS program to manage remote sense data. In this talk we will discuss new features of HDF (Virtual Datasets, Single writerMultiple reader access, Community supported HDF5 compression filters) that address storage and IO performance requirements of the applications that work with the ESDIS data products.

  7. Large-Scale Astrophysical Visualization on Smartphones

    NASA Astrophysics Data System (ADS)

    Becciani, U.; Massimino, P.; Costa, A.; Gheller, C.; Grillo, A.; Krokos, M.; Petta, C.

    2011-07-01

    Nowadays digital sky surveys and long-duration, high-resolution numerical simulations using high performance computing and grid systems produce multidimensional astrophysical datasets in the order of several Petabytes. Sharing visualizations of such datasets within communities and collaborating research groups is of paramount importance for disseminating results and advancing astrophysical research. Moreover educational and public outreach programs can benefit greatly from novel ways of presenting these datasets by promoting understanding of complex astrophysical processes, e.g., formation of stars and galaxies. We have previously developed VisIVO Server, a grid-enabled platform for high-performance large-scale astrophysical visualization. This article reviews the latest developments on VisIVO Web, a custom designed web portal wrapped around VisIVO Server, then introduces VisIVO Smartphone, a gateway connecting VisIVO Web and data repositories for mobile astrophysical visualization. We discuss current work and summarize future developments.

  8. Using PIDs to Support the Full Research Data Publishing Lifecycle

    NASA Astrophysics Data System (ADS)

    Waard, A. D.

    2016-12-01

    Persistent identifiers can help support scientific research, track scientific impact and let researchers achieve recognition for their work. We discuss a number of ways in which Elsevier utilizes PIDs to support the scholarly lifecycle: To improve the process of storing and sharing data, Mendeley Data (http://data.mendeley.com) makes use of persistent identifiers to support the dynamic nature of data and software, by tracking and recording the provenance and versioning of datasets. This system now allows the comparison of different versions of a dataset, to see precisely what was changed during a versioning update. To present research data in context for the reader, we include PIDs in research articles as hyperlinks: https://www.elsevier.com/books-and-journals/content-innovation/data-base-linking. In some cases, PIDs fetch data files from the repositories provide that allow the embedding of visualizations, e.g. with PANGAEA and PubChem: https://www.elsevier.com/books-and-journals/content-innovation/protein-viewer; https://www.elsevier.com/books-and-journals/content-innovation/pubchem. To normalize referenced data elements, the Resource Identification Initiative - which we developed together with members of the Force11 RRID group - introduces a unified standard for resource identifiers (RRIDs) that can easily be interpreted by both humans and text mining tools. https://www.force11.org/group/resource-identification-initiative/update-resource-identification-initiative, as can be seen in our Antibody Data app: https://www.elsevier.com/books-and-journals/content-innovation/antibody-data To enable better citation practices and support robust metrics system for sharing research data, we have helped develop, and are early adopters of the Force11 Data Citation Principles and Implementation groups (https://www.force11.org/group/dcip) Lastly, through our work with the Research Data Alliance Publishing Data Services group, we helped create a set of guidelines (http://www.scholix.org/guidelines) and a demonstrator service (http://dliservice.research-infrastructures.eu/#/) for a linked data network connecting datasets, articles, and individuals, which all rely on robust PIDs.

  9. Exudate-based diabetic macular edema detection in fundus images using publicly available datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

    2011-01-01

    Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME through the presence of exudation. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME.more » This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing (e.g., the classifier was trained on an independent dataset and tested on MESSIDOR). Our algorithm obtained an AUC between 0.88 and 0.94 depending on the dataset/features used. Additionally, it does not need ground truth at lesion level to reject false positives and is computationally efficient, as it generates a diagnosis on an average of 4.4 s (9.3 s, considering the optic nerve localization) per image on an 2.6 GHz platform with an unoptimized Matlab implementation.« less

  10. Global Data Spatially Interrelate System for Scientific Big Data Spatial-Seamless Sharing

    NASA Astrophysics Data System (ADS)

    Yu, J.; Wu, L.; Yang, Y.; Lei, X.; He, W.

    2014-04-01

    A good data sharing system with spatial-seamless services will prevent the scientists from tedious, boring, and time consuming work of spatial transformation, and hence encourage the usage of the scientific data, and increase the scientific innovation. Having been adopted as the framework of Earth datasets by Group on Earth Observation (GEO), Earth System Spatial Grid (ESSG) is potential to be the spatial reference of the Earth datasets. Based on the implementation of ESSG, SDOG-ESSG, a data sharing system named global data spatially interrelate system (GASE) was design to make the data sharing spatial-seamless. The architecture of GASE was introduced. The implementation of the two key components, V-Pools, and interrelating engine, and the prototype is presented. Any dataset is firstly resampled into SDOG-ESSG, and is divided into small blocks, and then are mapped into hierarchical system of the distributed file system in V-Pools, which together makes the data serving at a uniform spatial reference and at a high efficiency. Besides, the datasets from different data centres are interrelated by the interrelating engine at the uniform spatial reference of SDOGESSG, which enables the system to sharing the open datasets in the internet spatial-seamless.

  11. Cohesion and Coalition Formation in the European Parliament: Roll-Call Votes and Twitter Activities

    PubMed Central

    Cherepnalkoski, Darko; Karpf, Andreas; Mozetič, Igor; Grčar, Miha

    2016-01-01

    We study the cohesion within and the coalitions between political groups in the Eighth European Parliament (2014–2019) by analyzing two entirely different aspects of the behavior of the Members of the European Parliament (MEPs) in the policy-making processes. On one hand, we analyze their co-voting patterns and, on the other, their retweeting behavior. We make use of two diverse datasets in the analysis. The first one is the roll-call vote dataset, where cohesion is regarded as the tendency to co-vote within a group, and a coalition is formed when the members of several groups exhibit a high degree of co-voting agreement on a subject. The second dataset comes from Twitter; it captures the retweeting (i.e., endorsing) behavior of the MEPs and implies cohesion (retweets within the same group) and coalitions (retweets between groups) from a completely different perspective. We employ two different methodologies to analyze the cohesion and coalitions. The first one is based on Krippendorff’s Alpha reliability, used to measure the agreement between raters in data-analysis scenarios, and the second one is based on Exponential Random Graph Models, often used in social-network analysis. We give general insights into the cohesion of political groups in the European Parliament, explore whether coalitions are formed in the same way for different policy areas, and examine to what degree the retweeting behavior of MEPs corresponds to their co-voting patterns. A novel and interesting aspect of our work is the relationship between the co-voting and retweeting patterns. PMID:27835683

  12. Cohesion and Coalition Formation in the European Parliament: Roll-Call Votes and Twitter Activities.

    PubMed

    Cherepnalkoski, Darko; Karpf, Andreas; Mozetič, Igor; Grčar, Miha

    2016-01-01

    We study the cohesion within and the coalitions between political groups in the Eighth European Parliament (2014-2019) by analyzing two entirely different aspects of the behavior of the Members of the European Parliament (MEPs) in the policy-making processes. On one hand, we analyze their co-voting patterns and, on the other, their retweeting behavior. We make use of two diverse datasets in the analysis. The first one is the roll-call vote dataset, where cohesion is regarded as the tendency to co-vote within a group, and a coalition is formed when the members of several groups exhibit a high degree of co-voting agreement on a subject. The second dataset comes from Twitter; it captures the retweeting (i.e., endorsing) behavior of the MEPs and implies cohesion (retweets within the same group) and coalitions (retweets between groups) from a completely different perspective. We employ two different methodologies to analyze the cohesion and coalitions. The first one is based on Krippendorff's Alpha reliability, used to measure the agreement between raters in data-analysis scenarios, and the second one is based on Exponential Random Graph Models, often used in social-network analysis. We give general insights into the cohesion of political groups in the European Parliament, explore whether coalitions are formed in the same way for different policy areas, and examine to what degree the retweeting behavior of MEPs corresponds to their co-voting patterns. A novel and interesting aspect of our work is the relationship between the co-voting and retweeting patterns.

  13. Generating and Using Examples in the Proving Process

    ERIC Educational Resources Information Center

    Sandefur, J.; Mason, J.; Stylianides, G. J.; Watson, A.

    2013-01-01

    We report on our analysis of data from a dataset of 26 videotapes of university students working in groups of 2 and 3 on different proving problems. Our aim is to understand the role of example generation in the proving process, focusing on deliberate changes in representation and symbol manipulation. We suggest and illustrate four aspects of…

  14. Development of a Core Clinical Dataset to Characterize Serious Illness, Injuries, and Resource Requirements for Acute Medical Responses to Public Health Emergencies.

    PubMed

    Murphy, David J; Rubinson, Lewis; Blum, James; Isakov, Alexander; Bhagwanjee, Statish; Cairns, Charles B; Cobb, J Perren; Sevransky, Jonathan E

    2015-11-01

    In developed countries, public health systems have become adept at rapidly identifying the etiology and impact of public health emergencies. However, within the time course of clinical responses, shortfalls in readily analyzable patient-level data limit capabilities to understand clinical course, predict outcomes, ensure resource availability, and evaluate the effectiveness of diagnostic and therapeutic strategies for seriously ill and injured patients. To be useful in the timeline of a public health emergency, multi-institutional clinical investigation systems must be in place to rapidly collect, analyze, and disseminate detailed clinical information regarding patients across prehospital, emergency department, and acute care hospital settings, including ICUs. As an initial step to near real-time clinical learning during public health emergencies, we sought to develop an "all-hazards" core dataset to characterize serious illness and injuries and the resource requirements for acute medical response across the care continuum. A multidisciplinary panel of clinicians, public health professionals, and researchers with expertise in public health emergencies. Group consensus process. The consensus process included regularly scheduled conference calls, electronic communications, and an in-person meeting to generate candidate variables. Candidate variables were then reviewed by the group to meet the competing criteria of utility and feasibility resulting in the core dataset. The 40-member panel generated 215 candidate variables for potential dataset inclusion. The final dataset includes 140 patient-level variables in the domains of demographics and anthropometrics (7), prehospital (11), emergency department (13), diagnosis (8), severity of illness (54), medications and interventions (38), and outcomes (9). The resulting all-hazard core dataset for seriously ill and injured persons provides a foundation to facilitate rapid collection, analyses, and dissemination of information necessary for clinicians, public health officials, and policymakers to optimize public health emergency response. Further work is needed to validate the effectiveness of the dataset in a variety of emergency settings.

  15. Minutes of the CD-ROM Workshop

    NASA Technical Reports Server (NTRS)

    King, Joseph H.; Grayzeck, Edwin J.

    1989-01-01

    The workshop described in this document had two goals: (1) to establish guidelines for the CD-ROM as a tool to distribute datasets; and (2) to evaluate current scientific CD-ROM projects as an archive. Workshop attendees were urged to coordinate with European groups to develop CD-ROM, which is already available at low cost in the U.S., as a distribution medium for astronomical datasets. It was noted that NASA has made the CD Publisher at the National Space Science Data Center (NSSDC) available to the scientific community when the Publisher is not needed for NASA work. NSSDC's goal is to provide the Publisher's user with the hardware and software tools needed to design a user's dataset for distribution. This includes producing a master CD and copies. The prerequisite premastering process is described, as well as guidelines for CD-ROM construction. The production of discs was evaluated. CD-ROM projects, guidelines, and problems of the technology were discussed.

  16. Bayesian model reduction and empirical Bayes for group (DCM) studies.

    PubMed

    Friston, Karl J; Litvak, Vladimir; Oswal, Ashwini; Razi, Adeel; Stephan, Klaas E; van Wijk, Bernadette C M; Ziegler, Gabriel; Zeidman, Peter

    2016-03-01

    This technical note describes some Bayesian procedures for the analysis of group studies that use nonlinear models at the first (within-subject) level - e.g., dynamic causal models - and linear models at subsequent (between-subject) levels. Its focus is on using Bayesian model reduction to finesse the inversion of multiple models of a single dataset or a single (hierarchical or empirical Bayes) model of multiple datasets. These applications of Bayesian model reduction allow one to consider parametric random effects and make inferences about group effects very efficiently (in a few seconds). We provide the relatively straightforward theoretical background to these procedures and illustrate their application using a worked example. This example uses a simulated mismatch negativity study of schizophrenia. We illustrate the robustness of Bayesian model reduction to violations of the (commonly used) Laplace assumption in dynamic causal modelling and show how its recursive application can facilitate both classical and Bayesian inference about group differences. Finally, we consider the application of these empirical Bayesian procedures to classification and prediction. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.

  17. A database on the distribution of butterflies (Lepidoptera) in northern Belgium (Flanders and the Brussels Capital Region)

    PubMed Central

    Maes, Dirk; Vanreusel, Wouter; Herremans, Marc; Vantieghem, Pieter; Brosens, Dimitri; Gielen, Karin; Beck, Olivier; Van Dyck, Hans; Desmet, Peter; Natuurpunt, Vlinderwerkgroep

    2016-01-01

    Abstract In this data paper, we describe two datasets derived from two sources, which collectively represent the most complete overview of butterflies in Flanders and the Brussels Capital Region (northern Belgium). The first dataset (further referred to as the INBO dataset – http://doi.org/10.15468/njgbmh) contains 761,660 records of 70 species and is compiled by the Research Institute for Nature and Forest (INBO) in cooperation with the Butterfly working group of Natuurpunt (Vlinderwerkgroep). It is derived from the database Vlinderdatabank at the INBO, which consists of (historical) collection and literature data (1830-2001), for which all butterfly specimens in institutional and available personal collections were digitized and all entomological and other relevant publications were checked for butterfly distribution data. It also contains observations and monitoring data for the period 1991-2014. The latter type were collected by a (small) butterfly monitoring network where butterflies were recorded using a standardized protocol. The second dataset (further referred to as the Natuurpunt dataset – http://doi.org/10.15468/ezfbee) contains 612,934 records of 63 species and is derived from the database http://waarnemingen.be, hosted at the nature conservation NGO Natuurpunt in collaboration with Stichting Natuurinformatie. This dataset contains butterfly observations by volunteers (citizen scientists), mainly since 2008. Together, these datasets currently contain a total of 1,374,594 records, which are georeferenced using the centroid of their respective 5 × 5 km² Universal Transverse Mercator (UTM) grid cell. Both datasets are published as open data and are available through the Global Biodiversity Information Facility (GBIF). PMID:27199606

  18. A database on the distribution of butterflies (Lepidoptera) in northern Belgium (Flanders and the Brussels Capital Region).

    PubMed

    Maes, Dirk; Vanreusel, Wouter; Herremans, Marc; Vantieghem, Pieter; Brosens, Dimitri; Gielen, Karin; Beck, Olivier; Van Dyck, Hans; Desmet, Peter; Natuurpunt, Vlinderwerkgroep

    2016-01-01

    In this data paper, we describe two datasets derived from two sources, which collectively represent the most complete overview of butterflies in Flanders and the Brussels Capital Region (northern Belgium). The first dataset (further referred to as the INBO dataset - http://doi.org/10.15468/njgbmh) contains 761,660 records of 70 species and is compiled by the Research Institute for Nature and Forest (INBO) in cooperation with the Butterfly working group of Natuurpunt (Vlinderwerkgroep). It is derived from the database Vlinderdatabank at the INBO, which consists of (historical) collection and literature data (1830-2001), for which all butterfly specimens in institutional and available personal collections were digitized and all entomological and other relevant publications were checked for butterfly distribution data. It also contains observations and monitoring data for the period 1991-2014. The latter type were collected by a (small) butterfly monitoring network where butterflies were recorded using a standardized protocol. The second dataset (further referred to as the Natuurpunt dataset - http://doi.org/10.15468/ezfbee) contains 612,934 records of 63 species and is derived from the database http://waarnemingen.be, hosted at the nature conservation NGO Natuurpunt in collaboration with Stichting Natuurinformatie. This dataset contains butterfly observations by volunteers (citizen scientists), mainly since 2008. Together, these datasets currently contain a total of 1,374,594 records, which are georeferenced using the centroid of their respective 5 × 5 km² Universal Transverse Mercator (UTM) grid cell. Both datasets are published as open data and are available through the Global Biodiversity Information Facility (GBIF).

  19. Building a better search engine for earth science data

    NASA Astrophysics Data System (ADS)

    Armstrong, E. M.; Yang, C. P.; Moroni, D. F.; McGibbney, L. J.; Jiang, Y.; Huang, T.; Greguska, F. R., III; Li, Y.; Finch, C. J.

    2017-12-01

    Free text data searching of earth science datasets has been implemented with varying degrees of success and completeness across the spectrum of the 12 NASA earth sciences data centers. At the JPL Physical Oceanography Distributed Active Archive Center (PO.DAAC) the search engine has been developed around the Solr/Lucene platform. Others have chosen other popular enterprise search platforms like Elasticsearch. Regardless, the default implementations of these search engines leveraging factors such as dataset popularity, term frequency and inverse document term frequency do not fully meet the needs of precise relevancy and ranking of earth science search results. For the PO.DAAC, this shortcoming has been identified for several years by its external User Working Group that has assigned several recommendations to improve the relevancy and discoverability of datasets related to remotely sensed sea surface temperature, ocean wind, waves, salinity, height and gravity that comprise a total count of over 500 public availability datasets. Recently, the PO.DAAC has teamed with an effort led by George Mason University to improve the improve the search and relevancy ranking of oceanographic data via a simple search interface and powerful backend services called MUDROD (Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to Improve Data Discovery) funded by the NASA AIST program. MUDROD has mined and utilized the combination of PO.DAAC earth science dataset metadata, usage metrics, and user feedback and search history to objectively extract relevance for improved data discovery and access. In addition to improved dataset relevance and ranking, the MUDROD search engine also returns recommendations to related datasets and related user queries. This presentation will report on use cases that drove the architecture and development, and the success metrics and improvements on search precision and recall that MUDROD has demonstrated over the existing PO.DAAC search interfaces.

  20. Overview of the HUPO Plasma Proteome Project: Results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Omenn, Gilbert; States, David J.; Adamski, Marcin

    2005-08-13

    HUPO initiated the Plasma Proteome Project (PPP) in 2002. Its pilot phase has (1) evaluated advantages and limitations of many depletion, fractionation, and MS technology platforms; (2) compared PPP reference specimens of human serum and EDTA, heparin, and citrate-anticoagulated plasma; and (3) created a publicly-available knowledge base (www.bioinformatics. med.umich.edu/hupo/ppp; www.ebi.ac.uk/pride). Thirty-five participating laboratories in 13 countries submitted datasets. Working groups addressed (a) specimen stability and protein concentrations; (b) protein identifications from 18 MS/MS datasets; (c) independent analyses from raw MS-MS spectra; (d) search engine performance, subproteome analyses, and biological insights; (e) antibody arrays; and (f) direct MS/SELDI analyses. MS-MS datasetsmore » had 15 710 different International Protein Index (IPI) protein IDs; our integration algorithm applied to multiple matches of peptide sequences yielded 9504 IPI proteins identified with one or more peptides and 3020 proteins identified with two or more peptides (the Core Dataset). These proteins have been characterized with Gene Ontology, InterPro, Novartis Atlas, OMIM, and immunoassay based concentration determinations. The database permits examination of many other subsets, such as 1274 proteins identified with three or more peptides. Reverse protein to DNA matching identified proteins for 118 previously unidentified ORFs. We recommend use of plasma instead of serum, with EDTA (or citrate) for anticoagulation. To improve resolution, sensitivity and reproducibility of peptide identifications and protein matches, we recommend combinations of depletion, fractionation, and MS/MS technologies, with explicit criteria for evaluation of spectra, use of search algorithms, and integration of homologous protein matches. This Special Issue of PROTEOMICS presents papers integral to the collaborative analysis plus many reports of supplementary work on various aspects of the PPP workplan. These PPP results on complexity, dynamic range, incomplete sampling, false-positive matches, and integration of diverse datasets for plasma and serum proteins lay a foundation for development and validation of circulating protein biomarkers in health and disease.« less

  1. Recent QCD Studies at the Tevatron

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Group, Robert Craig

    2008-04-01

    Since the beginning of Run II at the Fermilab Tevatron the QCD physics groups of the CDF and D0 experiments have worked to reach unprecedented levels of precision for many QCD observables. Thanks to the large dataset--over 3 fb{sup -1} of integrated luminosity recorded by each experiment--important new measurements have recently been made public and will be summarized in this paper.

  2. Research Applications of Data from Arctic Ocean Drifting Platforms: The Arctic Buoy Program and the Environmental Working Group CD's.

    NASA Astrophysics Data System (ADS)

    Moritz, R. E.; Rigor, I.

    2006-12-01

    ABSTRACT: The Arctic Buoy Program was initiated in 1978 to measure surface air pressure, surface temperature and sea-ice motion in the Arctic Ocean, on the space and time scales of synoptic weather systems, and to make the data available for research, forecasting and operations. The program, subsequently renamed the International Arctic Buoy Programme (IABP), has endured and expanded over the past 28 years. A hallmark of the IABP is the production, dissemination and archival of research-quality datasets and analyses. These datasets have been used by the authors of over 500 papers on meteorolgy, sea-ice physics, oceanography, air-sea interactions, climate, remote sensing and other topics. Elements of the IABP are described briefly, including measurements, analysis, data dissemination and data archival. Selected highlights of the research applications are reviewed, including ice dynamics, ocean-ice modeling, low-frequency variability of Arctic air-sea-ice circulation, and recent changes in the age, thickness and extent of Arctic Sea-ice. The extended temporal coverage of the data disseminated on the Environmental Working Group CD's is important for interpreting results in the context of climate.

  3. The ISLSCP initiative I global datasets: Surface boundary conditions and atmospheric forcings for land-atmosphere studies

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sellers, P.J.; Collatz, J.; Koster, R.

    1996-09-01

    A comprehensive series of global datasets for land-atmosphere models has been collected, formatted to a common grid, and released on a set of CD-ROMs. This paper describes the motivation for and the contents of the dataset. In June of 1992, an interdisciplinary earth science workshop was convened in Columbia, Maryland, to assess progress in land-atmosphere research, specifically in the areas of models, satellite data algorithms, and field experiments. At the workshop, representatives of the land-atmosphere modeling community defined a need for global datasets to prescribe boundary conditions, initialize state variables, and provide near-surface meteorological and radiative forcings for their models.more » The International Satellite Land Surface Climatology Project (ISLSCP), a part of the Global Energy and Water Cycle Experiment, worked with the Distributed Active Archive Center of the National Aeronautics and Space Administration Goddard Space Flight Center to bring the required datasets together in a usable format. The data have since been released on a collection of CD-ROMs. The datasets on the CD-ROMs are grouped under the following headings: vegetation; hydrology and soils; snow, ice, and oceans; radiation and clouds; and near-surface meteorology. All datasets cover the period 1987-88, and all but a few are spatially continuous over the earth`s land surface. All have been mapped to a common 1{degree} x 1{degree} equal-angle grid. The temporal frequency for most of the datasets is monthly. A few of the near-surface meteorological parameters are available both as six-hourly values and as monthly means. 26 refs., 8 figs., 2 tabs.« less

  4. Does Your Cohort Matter? Measuring Peer Effects in College Achievement. NBER Working Paper No. 14032

    ERIC Educational Resources Information Center

    Carrell, Scott E.; Fullerton, Richard L.; West, James E.

    2008-01-01

    To estimate peer effects in college achievement we exploit a unique dataset in which individuals have been exogenously assigned to peer groups of about 30 students with whom they are required to spend the majority of their time interacting. This feature enables us to estimate peer effects that are more comparable to changing the entire cohort of…

  5. The Topp-Leone generalized Rayleigh cure rate model and its application

    NASA Astrophysics Data System (ADS)

    Nanthaprut, Pimwarat; Bodhisuwan, Winai; Patummasut, Mena

    2017-11-01

    Cure rate model is one of the survival analysis when model consider a proportion of the censored data. In clinical trials, the data represent time to recurrence of event or death of patients are used to improve the efficiency of treatments. Each dataset can be separated into two groups: censored and uncensored data. In this work, the new mixture cure rate model is introduced based on the Topp-Leone generalized Rayleigh distribution. The Bayesian approach is employed to estimate its parameters. In addition, a breast cancer dataset is analyzed for model illustration purpose. According to the deviance information criterion, the Topp-Leone generalized Rayleigh cure rate model shows better result than the Weibull and exponential cure rate models.

  6. Multi-Task Convolutional Neural Network for Pose-Invariant Face Recognition

    NASA Astrophysics Data System (ADS)

    Yin, Xi; Liu, Xiaoming

    2018-02-01

    This paper explores multi-task learning (MTL) for face recognition. We answer the questions of how and why MTL can improve the face recognition performance. First, we propose a multi-task Convolutional Neural Network (CNN) for face recognition where identity classification is the main task and pose, illumination, and expression estimations are the side tasks. Second, we develop a dynamic-weighting scheme to automatically assign the loss weight to each side task, which is a crucial problem in MTL. Third, we propose a pose-directed multi-task CNN by grouping different poses to learn pose-specific identity features, simultaneously across all poses. Last but not least, we propose an energy-based weight analysis method to explore how CNN-based MTL works. We observe that the side tasks serve as regularizations to disentangle the variations from the learnt identity features. Extensive experiments on the entire Multi-PIE dataset demonstrate the effectiveness of the proposed approach. To the best of our knowledge, this is the first work using all data in Multi-PIE for face recognition. Our approach is also applicable to in-the-wild datasets for pose-invariant face recognition and achieves comparable or better performance than state of the art on LFW, CFP, and IJB-A datasets.

  7. The phylogeny and evolutionary history of tyrannosauroid dinosaurs.

    PubMed

    Brusatte, Stephen L; Carr, Thomas D

    2016-02-02

    Tyrannosauroids--the group of carnivores including Tyrannosaurs rex--are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work.

  8. The phylogeny and evolutionary history of tyrannosauroid dinosaurs

    PubMed Central

    Brusatte, Stephen L.; Carr, Thomas D.

    2016-01-01

    Tyrannosauroids—the group of carnivores including Tyrannosaurs rex—are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work. PMID:26830019

  9. The phylogeny and evolutionary history of tyrannosauroid dinosaurs

    NASA Astrophysics Data System (ADS)

    Brusatte, Stephen L.; Carr, Thomas D.

    2016-02-01

    Tyrannosauroids—the group of carnivores including Tyrannosaurs rex—are some of the most familiar dinosaurs of all. A surge of recent discoveries has helped clarify some aspects of their evolution, but competing phylogenetic hypotheses raise questions about their relationships, biogeography, and fossil record quality. We present a new phylogenetic dataset, which merges published datasets and incorporates recently discovered taxa. We analyze it with parsimony and, for the first time for a tyrannosauroid dataset, Bayesian techniques. The parsimony and Bayesian results are highly congruent, and provide a framework for interpreting the biogeography and evolutionary history of tyrannosauroids. Our phylogenies illustrate that the body plan of the colossal species evolved piecemeal, imply no clear division between northern and southern species in western North America as had been argued, and suggest that T. rex may have been an Asian migrant to North America. Over-reliance on cranial shape characters may explain why published parsimony studies have diverged and filling three major gaps in the fossil record holds the most promise for future work.

  10. Discriminate the response of Acute Myeloid Leukemia patients to treatment by using proteomics data and Answer Set Programming.

    PubMed

    Chebouba, Lokmane; Miannay, Bertrand; Boughaci, Dalila; Guziolowski, Carito

    2018-03-08

    During the last years, several approaches were applied on biomedical data to detect disease specific proteins and genes in order to better target drugs. It was shown that statistical and machine learning based methods use mainly clinical data and improve later their results by adding omics data. This work proposes a new method to discriminate the response of Acute Myeloid Leukemia (AML) patients to treatment. The proposed approach uses proteomics data and prior regulatory knowledge in the form of networks to predict cancer treatment outcomes by finding out the different Boolean networks specific to each type of response to drugs. To show its effectiveness we evaluate our method on a dataset from the DREAM 9 challenge. The results are encouraging and demonstrate the benefit of our approach to distinguish patient groups with different response to treatment. In particular each treatment response group is characterized by a predictive model in the form of a signaling Boolean network. This model describes regulatory mechanisms which are specific to each response group. The proteins in this model were selected from the complete dataset by imposing optimization constraints that maximize the difference in the logical response of the Boolean network associated to each group of patients given the omic dataset. This mechanistic and predictive model also allow us to classify new patients data into the two different patient response groups. We propose a new method to detect the most relevant proteins for understanding different patient responses upon treatments in order to better target drugs using a Prior Knowledge Network and proteomics data. The results are interesting and show the effectiveness of our method.

  11. High Spatial Resolution Forecasting of Long-Term Monthly Precipitation and Mean Temperature Trends in Data Scarce Regions

    NASA Astrophysics Data System (ADS)

    Mosier, T. M.; Hill, D. F.; Sharp, K. V.

    2013-12-01

    High spatial resolution time-series data are critical for many hydrological and earth science studies. Multiple groups have developed historical and forecast datasets of high-resolution monthly time-series for regions of the world such as the United States (e.g. PRISM for hindcast data and MACA for long-term forecasts); however, analogous datasets have not been available for most data scarce regions. The current work fills this data need by producing and freely distributing hindcast and forecast time-series datasets of monthly precipitation and mean temperature for all global land surfaces, gridded at a 30 arc-second resolution. The hindcast data are constructed through a Delta downscaling method, using as inputs 0.5 degree monthly time-series and 30 arc-second climatology global weather datasets developed by Willmott & Matsuura and WorldClim, respectively. The forecast data are formulated using a similar downscaling method, but with an additional step to remove bias from the climate variable's probability distribution over each region of interest. The downscaling package is designed to be compatible with a number of general circulation models (GCM) (e.g. with GCMs developed for the IPCC AR4 report and CMIP5), and is presently implemented using time-series data from the NCAR CESM1 model in conjunction with 30 arc-second future decadal climatologies distributed by the Consultative Group on International Agricultural Research. The resulting downscaled datasets are 30 arc-second time-series forecasts of monthly precipitation and mean temperature available for all global land areas. As an example of these data, historical and forecast 30 arc-second monthly time-series from 1950 through 2070 are created and analyzed for the region encompassing Pakistan. For this case study, forecast datasets corresponding to the future representative concentration pathways 45 and 85 scenarios developed by the IPCC are presented and compared. This exercise highlights a range of potential meteorological trends for the Pakistan region and more broadly serves to demonstrate the utility of the presented 30 arc-second monthly precipitation and mean temperature datasets for use in data scarce regions.

  12. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

    PubMed Central

    Wernisch, Lorenz

    2017-01-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190

  13. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

    PubMed

    Gabasova, Evelina; Reid, John; Wernisch, Lorenz

    2017-10-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

  14. Deep learning based beat event detection in action movie franchises

    NASA Astrophysics Data System (ADS)

    Ejaz, N.; Khan, U. A.; Martínez-del-Amor, M. A.; Sparenberg, H.

    2018-04-01

    Automatic understanding and interpretation of movies can be used in a variety of ways to semantically manage the massive volumes of movies data. "Action Movie Franchises" dataset is a collection of twenty Hollywood action movies from five famous franchises with ground truth annotations at shot and beat level of each movie. In this dataset, the annotations are provided for eleven semantic beat categories. In this work, we propose a deep learning based method to classify shots and beat-events on this dataset. The training dataset for each of the eleven beat categories is developed and then a Convolution Neural Network is trained. After finding the shot boundaries, key frames are extracted for each shot and then three classification labels are assigned to each key frame. The classification labels for each of the key frames in a particular shot are then used to assign a unique label to each shot. A simple sliding window based method is then used to group adjacent shots having the same label in order to find a particular beat event. The results of beat event classification are presented based on criteria of precision, recall, and F-measure. The results are compared with the existing technique and significant improvements are recorded.

  15. Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades.

    PubMed

    Orchard, Garrick; Jayawant, Ajinkya; Cohen, Gregory K; Thakor, Nitish

    2015-01-01

    Creating datasets for Neuromorphic Vision is a challenging task. A lack of available recordings from Neuromorphic Vision sensors means that data must typically be recorded specifically for dataset creation rather than collecting and labeling existing data. The task is further complicated by a desire to simultaneously provide traditional frame-based recordings to allow for direct comparison with traditional Computer Vision algorithms. Here we propose a method for converting existing Computer Vision static image datasets into Neuromorphic Vision datasets using an actuated pan-tilt camera platform. Moving the sensor rather than the scene or image is a more biologically realistic approach to sensing and eliminates timing artifacts introduced by monitor updates when simulating motion on a computer monitor. We present conversion of two popular image datasets (MNIST and Caltech101) which have played important roles in the development of Computer Vision, and we provide performance metrics on these datasets using spike-based recognition algorithms. This work contributes datasets for future use in the field, as well as results from spike-based algorithms against which future works can compare. Furthermore, by converting datasets already popular in Computer Vision, we enable more direct comparison with frame-based approaches.

  16. Altered Cortico-Striatal–Thalamic Connectivity in Relation to Spatial Working Memory Capacity in Children with ADHD

    PubMed Central

    Mills, Kathryn L.; Bathula, Deepti; Dias, Taciana G. Costa; Iyer, Swathi P.; Fenesy, Michelle C.; Musser, Erica D.; Stevens, Corinne A.; Thurlow, Bria L.; Carpenter, Samuel D.; Nagel, Bonnie J.; Nigg, Joel T.; Fair, Damien A.

    2012-01-01

    Introduction: Attention deficit hyperactivity disorder (ADHD) captures a heterogeneous group of children, who are characterized by a range of cognitive and behavioral symptoms. Previous resting-state functional connectivity MRI (rs-fcMRI) studies have sought to understand the neural correlates of ADHD by comparing connectivity measurements between those with and without the disorder, focusing primarily on cortical–striatal circuits mediated by the thalamus. To integrate the multiple phenotypic features associated with ADHD and help resolve its heterogeneity, it is helpful to determine how specific circuits relate to unique cognitive domains of the ADHD syndrome. Spatial working memory has been proposed as a key mechanism in the pathophysiology of ADHD. Methods: We correlated the rs-fcMRI of five thalamic regions of interest (ROIs) with spatial span working memory scores in a sample of 67 children aged 7–11 years [ADHD and typically developing children (TDC)]. In an independent dataset, we then examined group differences in thalamo-striatal functional connectivity between 70 ADHD and 89 TDC (7–11 years) from the ADHD-200 dataset. Thalamic ROIs were created based on previous methods that utilize known thalamo-cortical loops and rs-fcMRI to identify functional boundaries in the thalamus. Results/Conclusion: Using these thalamic regions, we found atypical rs-fcMRI between specific thalamic groupings with the basal ganglia. To identify the thalamic connections that relate to spatial working memory in ADHD, only connections identified in both the correlational and comparative analyses were considered. Multiple connections between the thalamus and basal ganglia, particularly between medial and anterior dorsal thalamus and the putamen, were related to spatial working memory and also altered in ADHD. These thalamo-striatal disruptions may be one of multiple atypical neural and cognitive mechanisms that relate to the ADHD clinical phenotype. PMID:22291667

  17. In silico design of novel proton-pump inhibitors with reduced adverse effects.

    PubMed

    Li, Xiaoyi; Kang, Hong; Liu, Wensheng; Singhal, Sarita; Jiao, Na; Wang, Yong; Zhu, Lixin; Zhu, Ruixin

    2018-05-30

    The development of new proton-pump inhibitors (PPIs) with less adverse effects by lowering the pKa values of nitrogen atoms in pyrimidine rings has been previously suggested by our group. In this work, we proposed that new PPIs should have the following features: (1) number of ring II = number of ring I + 1; (2) preferably five, six, or seven-membered heteroatomic ring for stability; and (3) 1 < pKa1 < 4. Six molecular scaffolds based on the aforementioned criteria were constructed, and R groups were extracted from compounds in extensive data sources. A virtual molecule dataset was established, and the pKa values of specific atoms on the molecules in the dataset were calculated to select the molecules with required pKa values. Drug-likeness screening was further conducted to obtain the candidates that significantly reduced the adverse effects of long-term PPI use. This study provided insights and tools for designing targeted molecules in silico that are suitable for practical applications.

  18. MaizeGDB video tutorials, feedback booth and introducing the agBioData working group

    USDA-ARS?s Scientific Manuscript database

    As datasets get larger and more complex, it becomes more difficult for MaizeGDB users to be aware of all the tools and services MaizeGDB provides. In order to help our users, we have made a YouTube channel (https://www.youtube.com/channel/UClV7hOrmTtWjB6fgo_gT_dg) where we are loading all new, short...

  19. The Greenwich Photo-heliographic Results (1874 - 1976): Initial Corrections to the Printed Publications

    NASA Astrophysics Data System (ADS)

    Erwin, E. H.; Coffey, H. E.; Denig, W. F.; Willis, D. M.; Henwood, R.; Wild, M. N.

    2013-11-01

    A new sunspot and faculae digital dataset for the interval 1874 - 1955 has been prepared under the auspices of the NOAA National Geophysical Data Center (NGDC). This digital dataset contains measurements of the positions and areas of both sunspots and faculae published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), under the title Greenwich Photo-heliographic Results ( GPR) , 1874 - 1976. Quality control (QC) procedures based on logical consistency have been used to identify the more obvious errors in the RGO publications. Typical examples of identifiable errors are North versus South errors in specifying heliographic latitude, errors in specifying heliographic (Carrington) longitude, errors in the dates and times, errors in sunspot group numbers, arithmetic errors in the summation process, and the occasional omission of solar ephemerides. Although the number of errors in the RGO publications is remarkably small, an initial table of necessary corrections is provided for the interval 1874 - 1917. Moreover, as noted in the preceding companion papers, the existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the long programme of solar observations supported first by the Royal Observatory, Greenwich, and then by the Royal Greenwich Observatory.

  20. [Characteristics of art therapists in rehabilitative therapy].

    PubMed

    Oster, Jörg

    2017-09-01

    Characteristics of art therapists in rehabilitative therapy Objectives: This study examines the sociodemographic, qualification- and activity-related characteristics of art therapists working in the field of rehabilitation. In 2013, an analysis of occupational groups was carried out in Germany, with the objective of describing the art therapists working there.A total of 2,303 complete datasets were submitted. From this group, those therapists mainly working in the field of rehabilitation/follow-up care/participation of disabled persons (according to Social Security Code VI and IX, n = 302) were selected and described. Most art therapists are female (average age 45 years) and largelywork part-time. Music and art therapy are the most common venues.More than 80% have a graduate degree. Methods of quality management are used.More than half of the therapists working in rehabilitation hospitals are employed in the field of psychosomatic medicine. Both individual and group therapy (each patient attending 1-2 times a week) are common. The results provide an overview of art therapy in the field of rehabilitation and show the spread in rehabilitation. Further research is indicated.

  1. Evolutionary History of the Asian Horned Frogs (Megophryinae): Integrative Approaches to Timetree Dating in the Absence of a Fossil Record.

    PubMed

    Mahony, Stephen; Foley, Nicole M; Biju, S D; Teeling, Emma C

    2017-03-01

    Molecular dating studies typically need fossils to calibrate the analyses. Unfortunately, the fossil record is extremely poor or presently nonexistent for many species groups, rendering such dating analysis difficult. One such group is the Asian horned frogs (Megophryinae). Sampling all generic nomina, we combined a novel ∼5 kb dataset composed of four nuclear and three mitochondrial gene fragments to produce a robust phylogeny, with an extensive external morphological study to produce a working taxonomy for the group. Expanding the molecular dataset to include out-groups of fossil-represented ancestral anuran families, we compared the priorless RelTime dating method with the widely used prior-based Bayesian timetree method, MCMCtree, utilizing a novel combination of fossil priors for anuran phylogenetic dating. The phylogeny was then subjected to ancestral phylogeographic analyses, and dating estimates were compared with likely biogeographic vicariant events. Phylogenetic analyses demonstrated that previously proposed systematic hypotheses were incorrect due to the paraphyly of genera. Molecular phylogenetic, morphological, and timetree results support the recognition of Megophryinae as a single genus, Megophrys, with a subgenus level classification. Timetree results using RelTime better corresponded with the known fossil record for the out-group anuran tree. For the priorless in-group, it also outperformed MCMCtree when node date estimates were compared with likely influential historical biogeographic events, providing novel insights into the evolutionary history of this pan-Asian anuran group. Given a relatively small molecular dataset, and limited prior knowledge, this study demonstrates that the computationally rapid RelTime dating tool may outperform more popular and complex prior reliant timetree methodologies. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  2. Predicting radiotherapy outcomes using statistical learning techniques

    NASA Astrophysics Data System (ADS)

    El Naqa, Issam; Bradley, Jeffrey D.; Lindsay, Patricia E.; Hope, Andrew J.; Deasy, Joseph O.

    2009-09-01

    Radiotherapy outcomes are determined by complex interactions between treatment, anatomical and patient-related variables. A common obstacle to building maximally predictive outcome models for clinical practice is the failure to capture potential complexity of heterogeneous variable interactions and applicability beyond institutional data. We describe a statistical learning methodology that can automatically screen for nonlinear relations among prognostic variables and generalize to unseen data before. In this work, several types of linear and nonlinear kernels to generate interaction terms and approximate the treatment-response function are evaluated. Examples of institutional datasets of esophagitis, pneumonitis and xerostomia endpoints were used. Furthermore, an independent RTOG dataset was used for 'generalizabilty' validation. We formulated the discrimination between risk groups as a supervised learning problem. The distribution of patient groups was initially analyzed using principle components analysis (PCA) to uncover potential nonlinear behavior. The performance of the different methods was evaluated using bivariate correlations and actuarial analysis. Over-fitting was controlled via cross-validation resampling. Our results suggest that a modified support vector machine (SVM) kernel method provided superior performance on leave-one-out testing compared to logistic regression and neural networks in cases where the data exhibited nonlinear behavior on PCA. For instance, in prediction of esophagitis and pneumonitis endpoints, which exhibited nonlinear behavior on PCA, the method provided 21% and 60% improvements, respectively. Furthermore, evaluation on the independent pneumonitis RTOG dataset demonstrated good generalizabilty beyond institutional data in contrast with other models. This indicates that the prediction of treatment response can be improved by utilizing nonlinear kernel methods for discovering important nonlinear interactions among model variables. These models have the capacity to predict on unseen data. Part of this work was first presented at the Seventh International Conference on Machine Learning and Applications, San Diego, CA, USA, 11-13 December 2008.

  3. EnviroAtlas - Austin, TX - Land Cover by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset describes the percentage of each block group that is classified as impervious, forest, green space, and agriculture. Forest is defined as Trees & Forest. Green space is defined as Trees & Forest, Grass & Herbaceous, and Agriculture. This dataset also includes the area per capita for each block group for some land cover types. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  4. GNSS climatology: A summary of findings from the COST Action ES1206 GNSS4SWEC

    NASA Astrophysics Data System (ADS)

    Bock, Olivier; Pacione, Rosa

    2017-04-01

    Working Group 3 of COST Action GNSS4SWEC promoted the coordinated development and assessment of GNSS tropospheric products for climate research. More than 50 researchers from 17 institutions participated in the discussions. The activities were organised in five main topics, each of which led to conclusions and recommendations for a proper production and use of GNSS tropospheric products for climate research. 1) GNSS data processing and validation: an inventory was established listing the main existing reprocessed datasets and one of them (IGS repro1) was more specifically assessed and used as a community dataset to demonstrate the capacity of GNSS to retrieve decadal trends and variability in zenith tropospheric delay (ZTD). Several groups performed also processing sensitivity studies producing long term (15 years or more) solutions and testing the impact of various processing parameters (tropospheric models, cutoff angle…) on the accuracy and stability of the retrieved ZTD estimates. 2) Standards and methods for post-processing: (i) elaborate screening methods have been developed and tested for the detection of outliers in ZTD data; (ii) ZTD to IWV conversion methods and auxiliary datasets have been reviewed and assessed; (iii) the homogeneity of long ZTD and IWV time series has been investigated. Standardised procedures were proposed for first two points. Inhomogeneities have been identified in all reprocessed GNSS datasets which are due to equipment changes or changes in the measurement conditions. Significant activity is on-going on the development of statistical homogenisation techniques that match the GNSS data characteristics. 3) IWV validations: new intercomparisons of GNSS IWV estimates to IWV retrieved from other observational techniques (radiosondes, microwave radiometers, VLBI, DORIS…) have been encouraged to enhance the results of the past and contribute to a better evaluation of inter-technique biases and absolute accuracy of the different IWV sensing techniques. 4) GNSS climatology: as a major goal of this working group, applications have been promoted in collaboration with the climate research community such as the analysis of global and regional trends and variability, the evaluation of global and regional climate model simulations (IPCC, EC-Earth, CORDEX…) and reanalysis products (ERA-Interim, ERA20C, 20CR…). 5) Databases and data formats: cooperation with IGS and EUREF fostered the specification and development of new database structures and updated SINEX format for a more efficient and enhanced exchange, use, and validation of GNSS tropospheric data.

  5. ESSG-based global spatial reference frame for datasets interrelation

    NASA Astrophysics Data System (ADS)

    Yu, J. Q.; Wu, L. X.; Jia, Y. J.

    2013-10-01

    To know well about the highly complex earth system, a large volume of, as well as a large variety of, datasets on the planet Earth are being obtained, distributed, and shared worldwide everyday. However, seldom of existing systems concentrates on the distribution and interrelation of different datasets in a common Global Spatial Reference Frame (GSRF), which holds an invisble obstacle to the data sharing and scientific collaboration. Group on Earth Obeservation (GEO) has recently established a new GSRF, named Earth System Spatial Grid (ESSG), for global datasets distribution, sharing and interrelation in its 2012-2015 WORKING PLAN.The ESSG may bridge the gap among different spatial datasets and hence overcome the obstacles. This paper is to present the implementation of the ESSG-based GSRF. A reference spheroid, a grid subdvision scheme, and a suitable encoding system are required to implement it. The radius of ESSG reference spheroid was set to the double of approximated Earth radius to make datasets from different areas of earth system science being covered. The same paramerters of positioning and orienting as Earth Centred Earth Fixed (ECEF) was adopted for the ESSG reference spheroid to make any other GSRFs being freely transformed into the ESSG-based GSRF. Spheroid degenerated octree grid with radius refiment (SDOG-R) and its encoding method were taken as the grid subdvision and encoding scheme for its good performance in many aspects. A triple (C, T, A) model is introduced to represent and link different datasets based on the ESSG-based GSRF. Finally, the methods of coordinate transformation between the ESSGbased GSRF and other GSRFs were presented to make ESSG-based GSRF operable and propagable.

  6. Preprocessed Consortium for Neuropsychiatric Phenomics dataset.

    PubMed

    Gorgolewski, Krzysztof J; Durnez, Joke; Poldrack, Russell A

    2017-01-01

    Here we present preprocessed MRI data of 265 participants from the Consortium for Neuropsychiatric Phenomics (CNP) dataset. The preprocessed dataset includes minimally preprocessed data in the native, MNI and surface spaces accompanied with potential confound regressors, tissue probability masks, brain masks and transformations. In addition the preprocessed dataset includes unthresholded group level and single subject statistical maps from all tasks included in the original dataset. We hope that availability of this dataset will greatly accelerate research.

  7. Process mining in oncology using the MIMIC-III dataset

    NASA Astrophysics Data System (ADS)

    Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen

    2018-03-01

    Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.

  8. The French-Canadian data set of Demirjian for dental age estimation: a systematic review and meta-analysis.

    PubMed

    Jayaraman, Jayakumar; Wong, Hai Ming; King, Nigel M; Roberts, Graham J

    2013-07-01

    Estimation of age of an individual can be performed by evaluating the pattern of dental development. A dataset for age estimation based on the dental maturity of a French-Canadian population was published over 35 years ago and has become the most widely accepted dataset. The applicability of this dataset has been tested on different population groups. To estimate the observed differences between Chronological age (CA) and Dental age (DA) when the French Canadian dataset was used to estimate the age of different population groups. A systematic search of literature for papers utilizing the French Canadian dataset for age estimation was performed. All language articles from PubMed, Embase and Cochrane databases were electronically searched for terms 'Demirjian' and 'Dental age' published between January 1973 and December 2011. A hand search of articles was also conducted. A total of 274 studies were identified from which 34 studies were included for qualitative analysis and 12 studies were included for quantitative assessment and meta-analysis. When synthesizing the estimation results from different population groups, on average, the Demirjian dataset overestimated the age of females by 0.65 years (-0.10 years to +2.82 years) and males by 0.60 years (-0.23 years to +3.04 years). The French Canadian dataset overestimates the age of the subjects by more than six months and hence this dataset should be used only with considerable caution when estimating age of group of subjects of any global population. Copyright © 2013 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.

  9. The IRI/LDEO Climate Data Library: Helping People use Climate Data

    NASA Astrophysics Data System (ADS)

    Blumenthal, M. B.; Grover-Kopec, E.; Bell, M.; del Corral, J.

    2005-12-01

    The IRI Climate Data Library (http://iridl.ldeo.columbia.edu/) is a library of datasets. By library we mean a collection of things, collected from both near and far, designed to make them more accessible for the library's users. Our datasets come from many different sources, many different "data cultures", many different formats. By dataset we mean a collection of data organized as multidimensional dependent variables, independent variables, and sub-datasets, along with the metadata (particularly use-metadata) that makes it possible to interpret the data in a meaningful manner. Ingrid, which provides the infrastructure for the Data Library, is an environment that lets one work with datasets: read, write, request, serve, view, select, calculate, transform, ... . It hides an extraordinary amount of technical detail from the user, letting the user think in terms of manipulations to datasets rather that manipulations of files of numbers. Among other things, this hidden technical detail could be accessing data on servers in other places, doing only the small needed portion of an enormous calculation, or translating to and from a variety of formats and between "data cultures". These operations are presented as a collection of virtual directories and documents on a web server, so that an ordinary web client can instantiate a calculation simply by requesting the resulting document or image. Building on this infrastructure, we (and others) have created collections of dynamically-updated images to faciliate monitoring aspects of the climate system, as well as linking these images to the underlying data. We have also created specialized interfaces to address the particular needs of user groups that IRI needs to support.

  10. Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

    PubMed Central

    Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

    2014-01-01

    SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111

  11. Mapping the spatial distribution of global anthropogenic mercury atmospheric emission inventories

    NASA Astrophysics Data System (ADS)

    Wilson, Simon J.; Steenhuisen, Frits; Pacyna, Jozef M.; Pacyna, Elisabeth G.

    This paper describes the procedures employed to spatially distribute global inventories of anthropogenic emissions of mercury to the atmosphere, prepared by Pacyna, E.G., Pacyna, J.M., Steenhuisen, F., Wilson, S. [2006. Global anthropogenic mercury emission inventory for 2000. Atmospheric Environment, this issue, doi:10.1016/j.atmosenv.2006.03.041], and briefly discusses the results of this work. A new spatially distributed global emission inventory for the (nominal) year 2000, and a revised version of the 1995 inventory are presented. Emissions estimates for total mercury and major species groups are distributed within latitude/longitude-based grids with a resolution of 1×1 and 0.5×0.5°. A key component in the spatial distribution procedure is the use of population distribution as a surrogate parameter to distribute emissions from sources that cannot be accurately geographically located. In this connection, new gridded population datasets were prepared, based on the CEISIN GPW3 datasets (CIESIN, 2004. Gridded Population of the World (GPW), Version 3. Center for International Earth Science Information Network (CIESIN), Columbia University and Centro Internacional de Agricultura Tropical (CIAT). GPW3 data are available at http://beta.sedac.ciesin.columbia.edu/gpw/index.jsp). The spatially distributed emissions inventories and population datasets prepared in the course of this work are available on the Internet at www.amap.no/Resources/HgEmissions/

  12. Identifying ecological "sweet spots" underlying cyanobacteria functional group dynamics from long-term observations using a statistical machine learning approach

    NASA Astrophysics Data System (ADS)

    Nelson, N.; Munoz-Carpena, R.; Phlips, E. J.

    2017-12-01

    Diversity in the eco-physiological adaptations of cyanobacteria genera creates challenges for water managers who are tasked with developing appropriate actions for controlling not only the intensity and frequency of cyanobacteria blooms, but also reducing the potential for blooms of harmful taxa (e.g., toxin producers, N2 fixers). Compounding these challenges, the efficacy of nutrient management strategies (phosphorus-only versus nitrogen-and-phosphorus) for cyanobacteria bloom abatement is the subject of an ongoing debate, which increases uncertainty associated with bloom mitigation decision-making. In this work, we analyze a unique long-term (17-year) dataset composed of monthly observations of cyanobacteria genera abundances, zooplankton abundances, water quality, and flow from Lake George, a bloom-impacted flow-through lake of the St. Johns River (FL, USA). Using the Random Forests machine learning algorithm, an assumption-free ensemble modeling approach, the dataset was evaluated to quantify and characterize relationships between environmental conditions and seven cyanobacteria groupings: five genera (Anabaena, Cylindrospermopsis, Lyngbya, Microcystis, and Oscillatoria) and two functional groups (N2 fixers and non-fixers). Results highlight the selectivity of nitrogen in describing genera and functional group dynamics, and potential for physical effects to limit the efficacy of nutrient management as a mechanism for cyanobacteria bloom mitigation.

  13. National Geospatial Data Asset Lifecycle Baseline Maturity Assessment for the Federal Geographic Data Committee

    NASA Astrophysics Data System (ADS)

    Peltz-Lewis, L. A.; Blake-Coleman, W.; Johnston, J.; DeLoatch, I. B.

    2014-12-01

    The Federal Geographic Data Committee (FGDC) is designing a portfolio management process for 193 geospatial datasets contained within the 16 topical National Spatial Data Infrastructure themes managed under OMB Circular A-16 "Coordination of Geographic Information and Related Spatial Data Activities." The 193 datasets are designated as National Geospatial Data Assets (NGDA) because of their significance in implementing to the missions of multiple levels of government, partners and stakeholders. As a starting point, the data managers of these NGDAs will conduct a baseline maturity assessment of the dataset(s) for which they are responsible. The maturity is measured against benchmarks related to each of the seven stages of the data lifecycle management framework promulgated within the OMB Circular A-16 Supplemental Guidance issued by OMB in November 2010. This framework was developed by the interagency Lifecycle Management Work Group (LMWG), consisting of 16 Federal agencies, under the 2004 Presidential Initiative the Geospatial Line of Business,using OMB Circular A-130" Management of Federal Information Resources" as guidance The seven lifecycle stages are: Define, Inventory/Evaluate, Obtain, Access, Maintain, Use/Evaluate, and Archive. This paper will focus on the Lifecycle Baseline Maturity Assessment, and efforts to integration the FGDC approach with other data maturity assessments.

  14. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation

    PubMed Central

    Pujar, Shashikant; O’Leary, Nuala A; Farrell, Catherine M; Mudge, Jonathan M; Wallin, Craig; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bult, Carol J; Frankish, Adam; Pruitt, Kim D

    2018-01-01

    Abstract The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. PMID:29126148

  15. A group LASSO-based method for robustly inferring gene regulatory networks from multiple time-course datasets.

    PubMed

    Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun

    2014-01-01

    As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.

  16. A high level interface to SCOP and ASTRAL implemented in python.

    PubMed

    Casbon, James A; Crooks, Gavin E; Saqi, Mansoor A S

    2006-01-10

    Benchmarking algorithms in structural bioinformatics often involves the construction of datasets of proteins with given sequence and structural properties. The SCOP database is a manually curated structural classification which groups together proteins on the basis of structural similarity. The ASTRAL compendium provides non redundant subsets of SCOP domains on the basis of sequence similarity such that no two domains in a given subset share more than a defined degree of sequence similarity. Taken together these two resources provide a 'ground truth' for assessing structural bioinformatics algorithms. We present a small and easy to use API written in python to enable construction of datasets from these resources. We have designed a set of python modules to provide an abstraction of the SCOP and ASTRAL databases. The modules are designed to work as part of the Biopython distribution. Python users can now manipulate and use the SCOP hierarchy from within python programs, and use ASTRAL to return sequences of domains in SCOP, as well as clustered representations of SCOP from ASTRAL. The modules make the analysis and generation of datasets for use in structural genomics easier and more principled.

  17. Benchmarking Deep Learning Models on Large Healthcare Datasets.

    PubMed

    Purushotham, Sanjay; Meng, Chuizheng; Che, Zhengping; Liu, Yan

    2018-06-04

    Deep learning models (aka Deep Neural Networks) have revolutionized many fields including computer vision, natural language processing, speech recognition, and is being increasingly used in clinical healthcare applications. However, few works exist which have benchmarked the performance of the deep learning models with respect to the state-of-the-art machine learning models and prognostic scoring systems on publicly available healthcare datasets. In this paper, we present the benchmarking results for several clinical prediction tasks such as mortality prediction, length of stay prediction, and ICD-9 code group prediction using Deep Learning models, ensemble of machine learning models (Super Learner algorithm), SAPS II and SOFA scores. We used the Medical Information Mart for Intensive Care III (MIMIC-III) (v1.4) publicly available dataset, which includes all patients admitted to an ICU at the Beth Israel Deaconess Medical Center from 2001 to 2012, for the benchmarking tasks. Our results show that deep learning models consistently outperform all the other approaches especially when the 'raw' clinical time series data is used as input features to the models. Copyright © 2018 Elsevier Inc. All rights reserved.

  18. EnviroAtlas - Durham, NC - Land Cover Summaries by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset describes the percentage of each block group that is classified as impervious, forest, green space, wetland, and agriculture. Impervious is a combination of dark and light impervious. Green space is a combination of trees and forest and grass and herbaceous. This dataset also includes the area per capita for each block group for impervious, forest, and green space land cover. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas ) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets ).

  19. Bridging the gap between real-life data and simulated data by providing a highly realistic fall dataset for evaluating camera-based fall detection algorithms.

    PubMed

    Baldewijns, Greet; Debard, Glen; Mertes, Gert; Vanrumste, Bart; Croonenborghs, Tom

    2016-03-01

    Fall incidents are an important health hazard for older adults. Automatic fall detection systems can reduce the consequences of a fall incident by assuring that timely aid is given. The development of these systems is therefore getting a lot of research attention. Real-life data which can help evaluate the results of this research is however sparse. Moreover, research groups that have this type of data are not at liberty to share it. Most research groups thus use simulated datasets. These simulation datasets, however, often do not incorporate the challenges the fall detection system will face when implemented in real-life. In this Letter, a more realistic simulation dataset is presented to fill this gap between real-life data and currently available datasets. It was recorded while re-enacting real-life falls recorded during previous studies. It incorporates the challenges faced by fall detection algorithms in real life. A fall detection algorithm from Debard et al. was evaluated on this dataset. This evaluation showed that the dataset possesses extra challenges compared with other publicly available datasets. In this Letter, the dataset is discussed as well as the results of this preliminary evaluation of the fall detection algorithm. The dataset can be downloaded from www.kuleuven.be/advise/datasets.

  20. On the visualization of water-related big data: extracting insights from drought proxies' datasets

    NASA Astrophysics Data System (ADS)

    Diaz, Vitali; Corzo, Gerald; van Lanen, Henny A. J.; Solomatine, Dimitri

    2017-04-01

    Big data is a growing area of science where hydroinformatics can benefit largely. There have been a number of important developments in the area of data science aimed at analysis of large datasets. Such datasets related to water include measurements, simulations, reanalysis, scenario analyses and proxies. By convention, information contained in these databases is referred to a specific time and a space (i.e., longitude/latitude). This work is motivated by the need to extract insights from large water-related datasets, i.e., transforming large amounts of data into useful information that helps to better understand of water-related phenomena, particularly about drought. In this context, data visualization, part of data science, involves techniques to create and to communicate data by encoding it as visual graphical objects. They may help to better understand data and detect trends. Base on existing methods of data analysis and visualization, this work aims to develop tools for visualizing water-related large datasets. These tools were developed taking advantage of existing libraries for data visualization into a group of graphs which include both polar area diagrams (PADs) and radar charts (RDs). In both graphs, time steps are represented by the polar angles and the percentages of area in drought by the radios. For illustration, three large datasets of drought proxies are chosen to identify trends, prone areas and spatio-temporal variability of drought in a set of case studies. The datasets are (1) SPI-TS2p1 (1901-2002, 11.7 GB), (2) SPI-PRECL0p5 (1948-2016, 7.91 GB) and (3) SPEI-baseV2.3 (1901-2013, 15.3 GB). All of them are on a monthly basis and with a spatial resolution of 0.5 degrees. First two were retrieved from the repository of the International Research Institute for Climate and Society (IRI). They are included into the Analyses Standardized Precipitation Index (SPI) project (iridl.ldeo.columbia.edu/SOURCES/.IRI/.Analyses/.SPI/). The third dataset was recovered from the Standardized Precipitation Evaporation Index (SPEI) Monitor (digital.csic.es/handle/10261/128892). PADs were found suitable to identify the spatio-temporal variability and prone areas of drought. Drought trends were visually detected by using both PADs and RDs. A similar approach can be followed to include other types of graphs to deal with the analysis of water-related big data. Key words: Big data, data visualization, drought, SPI, SPEI

  1. Deep learning and texture-based semantic label fusion for brain tumor segmentation

    NASA Astrophysics Data System (ADS)

    Vidyaratne, L.; Alam, M.; Shboul, Z.; Iftekharuddin, K. M.

    2018-02-01

    Brain tumor segmentation is a fundamental step in surgical treatment and therapy. Many hand-crafted and learning based methods have been proposed for automatic brain tumor segmentation from MRI. Studies have shown that these approaches have their inherent advantages and limitations. This work proposes a semantic label fusion algorithm by combining two representative state-of-the-art segmentation algorithms: texture based hand-crafted, and deep learning based methods to obtain robust tumor segmentation. We evaluate the proposed method using publicly available BRATS 2017 brain tumor segmentation challenge dataset. The results show that the proposed method offers improved segmentation by alleviating inherent weaknesses: extensive false positives in texture based method, and the false tumor tissue classification problem in deep learning method, respectively. Furthermore, we investigate the effect of patient's gender on the segmentation performance using a subset of validation dataset. Note the substantial improvement in brain tumor segmentation performance proposed in this work has recently enabled us to secure the first place by our group in overall patient survival prediction task at the BRATS 2017 challenge.

  2. Deep Learning and Texture-Based Semantic Label Fusion for Brain Tumor Segmentation.

    PubMed

    Vidyaratne, L; Alam, M; Shboul, Z; Iftekharuddin, K M

    2018-01-01

    Brain tumor segmentation is a fundamental step in surgical treatment and therapy. Many hand-crafted and learning based methods have been proposed for automatic brain tumor segmentation from MRI. Studies have shown that these approaches have their inherent advantages and limitations. This work proposes a semantic label fusion algorithm by combining two representative state-of-the-art segmentation algorithms: texture based hand-crafted, and deep learning based methods to obtain robust tumor segmentation. We evaluate the proposed method using publicly available BRATS 2017 brain tumor segmentation challenge dataset. The results show that the proposed method offers improved segmentation by alleviating inherent weaknesses: extensive false positives in texture based method, and the false tumor tissue classification problem in deep learning method, respectively. Furthermore, we investigate the effect of patient's gender on the segmentation performance using a subset of validation dataset. Note the substantial improvement in brain tumor segmentation performance proposed in this work has recently enabled us to secure the first place by our group in overall patient survival prediction task at the BRATS 2017 challenge.

  3. High precision automated face localization in thermal images: oral cancer dataset as test case

    NASA Astrophysics Data System (ADS)

    Chakraborty, M.; Raman, S. K.; Mukhopadhyay, S.; Patsa, S.; Anjum, N.; Ray, J. G.

    2017-02-01

    Automated face detection is the pivotal step in computer vision aided facial medical diagnosis and biometrics. This paper presents an automatic, subject adaptive framework for accurate face detection in the long infrared spectrum on our database for oral cancer detection consisting of malignant, precancerous and normal subjects of varied age group. Previous works on oral cancer detection using Digital Infrared Thermal Imaging(DITI) reveals that patients and normal subjects differ significantly in their facial thermal distribution. Therefore, it is a challenging task to formulate a completely adaptive framework to veraciously localize face from such a subject specific modality. Our model consists of first extracting the most probable facial regions by minimum error thresholding followed by ingenious adaptive methods to leverage the horizontal and vertical projections of the segmented thermal image. Additionally, the model incorporates our domain knowledge of exploiting temperature difference between strategic locations of the face. To our best knowledge, this is the pioneering work on detecting faces in thermal facial images comprising both patients and normal subjects. Previous works on face detection have not specifically targeted automated medical diagnosis; face bounding box returned by those algorithms are thus loose and not apt for further medical automation. Our algorithm significantly outperforms contemporary face detection algorithms in terms of commonly used metrics for evaluating face detection accuracy. Since our method has been tested on challenging dataset consisting of both patients and normal subjects of diverse age groups, it can be seamlessly adapted in any DITI guided facial healthcare or biometric applications.

  4. A New Global Group Velocity Dataset for Constraining Crust and Upper Mantle Properties

    NASA Astrophysics Data System (ADS)

    Ma, Z.; Masters, G.; Laske, G.; Pasyanos, M. E.

    2010-12-01

    We are improving our CRUST2.0 to a new LITHO1.0 model, refining the nominal resolution to 1 degree and including lithospheric structure. The new model is constrained by many datasets, including very large datasets of surface wave group velocity built using a new, efficient measurement technique. This technique starts in a similar fashion to the traditional frequency-time analysis, but instead of making measurements for all frequencies for a single source-station pair, we apply cluster analysis to make measurements for all recordings for a single event at a single target frequency. By changing the nominal frequencies of the bandpass filter, we filter each trace until the centroid frequency of the band-passed spectrum matches the target frequency. We have processed all the LH data from IRIS (and some of the BH data from PASSCAL experiments and the POLARIS network) from 1976 to 2007. The Rayleigh wave group velocity data set is complete from 10mHz to 40mHz at increments of 2.5mHz. The data set has about 330000 measurements for 10 and 20mHz, 200000 for 30mHz and 110000 for 40mHz. We are also building a similar dataset for Love waves, though its size will be about half that of the Rayleigh wave dataset. The SMAD of the group arrival time difference between our global dataset and other more regional datasets is about 12 seconds for 20mHz, 9 seconds for 30mHz, and 7 seconds for 40mHz. Though the discrepancies are about twice as big as our measurement precision (estimated by looking at group arrival time differences between closely-spaced stations), it is still much smaller than the signal in the data (e.g., the group arrival time for 20mHz can differ from the prediction of a 1D Earth by over 250 seconds). The fact that there is no systematic bias between the datasets encourages us to combine them to improve coverage of some key areas. Group velocity maps inverted from the combined datasets show many interesting signals though the dominant signal is related to variations in crustal thickness. For 20mHz, group velocity perturbations from the global mean range from -25% to 11%, with a standard deviation of 4%. We adjust the smoothing of lateral structure in the inversion so that the error of the inferred group velocity is nearly uniform globally. For 20mHz, a 0.1% error in group velocity leads to resolution of features of dimension 9 degrees or less everywhere. The resolution in Eurasia is 5.5 degrees and in N. America is 4.5-5 degrees. For 30mHz, for the same 0.1% error, we can resolve structure of 10 degrees globally, 6.5 degrees for Eurasia and 5.5 degree for N. America.

  5. PAIR Comparison between Two Within-Group Conditions of Resting-State fMRI Improves Classification Accuracy

    PubMed Central

    Zhou, Zhen; Wang, Jian-Bao; Zang, Yu-Feng; Pan, Gang

    2018-01-01

    Classification approaches have been increasingly applied to differentiate patients and normal controls using resting-state functional magnetic resonance imaging data (RS-fMRI). Although most previous classification studies have reported promising accuracy within individual datasets, achieving high levels of accuracy with multiple datasets remains challenging for two main reasons: high dimensionality, and high variability across subjects. We used two independent RS-fMRI datasets (n = 31, 46, respectively) both with eyes closed (EC) and eyes open (EO) conditions. For each dataset, we first reduced the number of features to a small number of brain regions with paired t-tests, using the amplitude of low frequency fluctuation (ALFF) as a metric. Second, we employed a new method for feature extraction, named the PAIR method, examining EC and EO as paired conditions rather than independent conditions. Specifically, for each dataset, we obtained EC minus EO (EC—EO) maps of ALFF from half of subjects (n = 15 for dataset-1, n = 23 for dataset-2) and obtained EO—EC maps from the other half (n = 16 for dataset-1, n = 23 for dataset-2). A support vector machine (SVM) method was used for classification of EC RS-fMRI mapping and EO mapping. The mean classification accuracy of the PAIR method was 91.40% for dataset-1, and 92.75% for dataset-2 in the conventional frequency band of 0.01–0.08 Hz. For cross-dataset validation, we applied the classifier from dataset-1 directly to dataset-2, and vice versa. The mean accuracy of cross-dataset validation was 94.93% for dataset-1 to dataset-2 and 90.32% for dataset-2 to dataset-1 in the 0.01–0.08 Hz range. For the UNPAIR method, classification accuracy was substantially lower (mean 69.89% for dataset-1 and 82.97% for dataset-2), and was much lower for cross-dataset validation (64.69% for dataset-1 to dataset-2 and 64.98% for dataset-2 to dataset-1) in the 0.01–0.08 Hz range. In conclusion, for within-group design studies (e.g., paired conditions or follow-up studies), we recommend the PAIR method for feature extraction. In addition, dimensionality reduction with strong prior knowledge of specific brain regions should also be considered for feature selection in neuroimaging studies. PMID:29375288

  6. On standardization of basic datasets of electronic medical records in traditional Chinese medicine.

    PubMed

    Zhang, Hong; Ni, Wandong; Li, Jing; Jiang, Youlin; Liu, Kunjing; Ma, Zhaohui

    2017-12-24

    Standardization of electronic medical record, so as to enable resource-sharing and information exchange among medical institutions has become inevitable in view of the ever increasing medical information. The current research is an effort towards the standardization of basic dataset of electronic medical records in traditional Chinese medicine. In this work, an outpatient clinical information model and an inpatient clinical information model are created to adequately depict the diagnosis processes and treatment procedures of traditional Chinese medicine. To be backward compatible with the existing dataset standard created for western medicine, the new standard shall be a superset of the existing standard. Thus, the two models are checked against the existing standard in conjunction with 170,000 medical record cases. If a case cannot be covered by the existing standard due to the particularity of Chinese medicine, then either an existing data element is expanded with some Chinese medicine contents or a new data element is created. Some dataset subsets are also created to group and record Chinese medicine special diagnoses and treatments such as acupuncture. The outcome of this research is a proposal of standardized traditional Chinese medicine medical records datasets. The proposal has been verified successfully in three medical institutions with hundreds of thousands of medical records. A new dataset standard for traditional Chinese medicine is proposed in this paper. The proposed standard, covering traditional Chinese medicine as well as western medicine, is expected to be soon approved by the authority. A widespread adoption of this proposal will enable traditional Chinese medicine hospitals and institutions to easily exchange information and share resources. Copyright © 2017. Published by Elsevier B.V.

  7. CIFAR10-DVS: An Event-Stream Dataset for Object Classification

    PubMed Central

    Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

    2017-01-01

    Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582

  8. Knowledge mining from clinical datasets using rough sets and backpropagation neural network.

    PubMed

    Nahato, Kindie Biredagn; Harichandran, Khanna Nehemiah; Arputharaj, Kannan

    2015-01-01

    The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN) is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI) machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.

  9. CIFAR10-DVS: An Event-Stream Dataset for Object Classification.

    PubMed

    Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

    2017-01-01

    Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.

  10. EnviroAtlas - Woodbine, IA - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 1 block group in Woodbine, Iowa. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  11. EnviroAtlas - Pittsburgh, PA - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 1,089 block groups in Pittsburgh, Pennsylvania. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  12. EnviroAtlas - Portland, OR - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 1176 block groups in Portland, Oregon. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (http:/www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  13. EnviroAtlas - Fresno, CA - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 405 block groups in Fresno, California. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  14. EnviroAtlas - New Bedford, MA - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 128 block group in New Bedford, Massachusetts. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  15. EnviroAtlas - Tampa, FL - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 1,833 block groups in Tampa Bay, Florida. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  16. EnviroAtlas - Minneapolis/St. Paul, MN - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 1,772 block groups in Minneapolis/St. Paul, Minnesota. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  17. EnviroAtlas - Cleveland, OH - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 1,442 block groups in Cleveland, Ohio. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas ) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  18. EnviroAtlas - Milwaukee, WI - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 1,175 block groups in Milwaukee, Wisconsin. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  19. EnviroAtlas - Portland, ME - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 146 block groups in Portland, Maine. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  20. EnviroAtlas - Memphis, TN - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 703 block groups in Memphis, Tennessee. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  1. EnviroAtlas - Green Bay, WI - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 155 block groups in Green Bay, Wisconsin. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets ).

  2. EnviroAtlas - Austin, TX - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset presents environmental benefits of the urban forest in 750 block groups in Austin, Texas. Carbon attributes, temperature reduction, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  3. Innovations in user-defined analysis: dynamic grouping and customized user datasets in VistaPHw.

    PubMed

    Solet, David; Glusker, Ann; Laurent, Amy; Yu, Tianji

    2006-01-01

    Flexible, ready access to community health assessment data is a feature of innovative Web-based data query systems. An example is VistaPHw, which provides access to Washington state data and statistics used in community health assessment. Because of its flexible analysis options, VistaPHw customizes local, population-based results to be relevant to public health decision-making. The advantages of two innovations, dynamic grouping and the Custom Data Module, are described. Dynamic grouping permits the creation of user-defined aggregations of geographic areas, age groups, race categories, and years. Standard VistaPHw measures such as rates, confidence intervals, and other statistics may then be calculated for the new groups. Dynamic grouping has provided data for major, successful grant proposals, building partnerships with local governments and organizations, and informing program planning for community organizations. The Custom Data Module allows users to prepare virtually any dataset so it may be analyzed in VistaPHw. Uses for this module may include datasets too sensitive to be placed on a Web server or datasets that are not standardized across the state. Limitations and other system needs are also discussed.

  4. EnviroAtlas - Austin, TX - Demographics by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset is a summary of key demographic groups for the EnviroAtlas community. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  5. EnviroAtlas - Portland, ME - Land Cover by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset describes the percentage of each block group that is classified as impervious, forest, green space, wetland, and agriculture. Impervious is a combination of dark and light impervious. Forest is combination of trees and forest and woody wetlands. Green space is a combination of trees and forest, grass and herbaceous, agriculture, woody wetlands, and emergent wetlands. Wetlands includes both Woody and Emergent Wetlands. This dataset also includes the area per capita for each block group for impervious, forest, and green space land cover. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  6. Primary Datasets for Case Studies of River-Water Quality

    ERIC Educational Resources Information Center

    Goulder, Raymond

    2008-01-01

    Level 6 (final-year BSc) students undertook case studies on between-site and temporal variation in river-water quality. They used professionally-collected datasets supplied by the Environment Agency. The exercise gave students the experience of working with large, real-world datasets and led to their understanding how the quality of river water is…

  7. Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research

    PubMed Central

    McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R

    2018-01-01

    Objectives This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. Methods A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. Results A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Conclusions Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. PMID:29084729

  8. An Adaptive Prediction-Based Approach to Lossless Compression of Floating-Point Volume Data.

    PubMed

    Fout, N; Ma, Kwan-Liu

    2012-12-01

    In this work, we address the problem of lossless compression of scientific and medical floating-point volume data. We propose two prediction-based compression methods that share a common framework, which consists of a switched prediction scheme wherein the best predictor out of a preset group of linear predictors is selected. Such a scheme is able to adapt to different datasets as well as to varying statistics within the data. The first method, called APE (Adaptive Polynomial Encoder), uses a family of structured interpolating polynomials for prediction, while the second method, which we refer to as ACE (Adaptive Combined Encoder), combines predictors from previous work with the polynomial predictors to yield a more flexible, powerful encoder that is able to effectively decorrelate a wide range of data. In addition, in order to facilitate efficient visualization of compressed data, our scheme provides an option to partition floating-point values in such a way as to provide a progressive representation. We compare our two compressors to existing state-of-the-art lossless floating-point compressors for scientific data, with our data suite including both computer simulations and observational measurements. The results demonstrate that our polynomial predictor, APE, is comparable to previous approaches in terms of speed but achieves better compression rates on average. ACE, our combined predictor, while somewhat slower, is able to achieve the best compression rate on all datasets, with significantly better rates on most of the datasets.

  9. Working with and Visualizing Big Data Efficiently with Python for the DARPA XDATA Program

    DTIC Science & Technology

    2017-08-01

    same function to be used with scalar inputs, input arrays of the same shape, or even input arrays of dimensionality in some cases. Most of the math ... math operations on values ● Split-apply-combine: similar to group-by operations in databases ● Join: combine two datasets using common columns 4.3.3...Numba - Continue to increase SIMD performance with support for fast math flags and improved support for AVX, Intel’s large vector

  10. Web-based Reanalysis Intercomparison Tools (WRIT): Comparing Reanalyses and Observational data.

    NASA Astrophysics Data System (ADS)

    Compo, G. P.; Smith, C. A.; Hooper, D. K.

    2014-12-01

    While atmospheric reanalysis datasets are widely used in climate science, many technical issues hinder comparing them to each other and to observations. The reanalysis fields are stored in diverse file architectures, data formats, and resolutions, with metadata, such as variable name and units, that also differ. Individual users have to download the fields, convert them to a common format, store them locally, change variable names, re-grid if needed, and convert units. Comparing reanalyses with observational datasets is difficult for similar reasons. Even if a dataset can be read via Open-source Project for a Network Data Access Protocol (OPeNDAP) or a similar protocol, most of this work is still needed. All of these tasks take time, effort, and money. To overcome some of the obstacles in reanalysis intercomparison, our group at the Cooperative Institute for Research in the Environmental Sciences (CIRES) at the University of Colorado and affiliated colleagues at National Oceanic and Atmospheric Administration's (NOAA's) Earth System Research Laboratory Physical Sciences Division (ESRL/PSD) have created a set of Web-based Reanalysis Intercomparison Tools (WRIT) at http://www.esrl.noaa.gov/psd/data/writ/. WRIT allows users to easily plot and compare reanalysis and observational datasets, and to test hypotheses. Currently, there are tools to plot monthly mean maps and vertical cross-sections, timeseries, and trajectories for standard pressure level and surface variables. Users can refine dates, statistics, and plotting options. Reanalysis datasets currently available include the NCEP/NCAR R1, NCEP/DOE R2, MERRA, ERA-Interim, NCEP CFSR and the 20CR. Observational datasets include those containing precipitation (e.g. GPCP), temperature (e.g. GHCNCAMS), winds (e.g. WASWinds), precipitable water (e.g. NASA NVAP), SLP (HadSLP2), and SST (NOAA ERSST). WRIT also facilitates the mission of the Reanalyses.org website as a convenient toolkit for studying the reanalysis datasets.

  11. The Transiting Exoplanet Community Early Release Science Program for JWST

    NASA Astrophysics Data System (ADS)

    Batalha, Natalie Marie; Bean, Jacob; Stevenson, Kevin; Sing, David; Crossfield, Ian; Knutson, Heather; Line, Michael; Kreidberg, Laura; Desert, Jean-Michel; Wakeford, Hannah R.; Crouzet, Nicolas; Moses, Julianne; Benneke, Björn; Kempton, Eliza; Berta-Thompson, Zach; Lopez-Morales, Mercedes; Parmentier, Vivien; Gibson, Neale; Schlawin, Everett; Fraine, Jonathan; Kendrew, Sarah; Transiting Exoplanet ERS Team

    2018-01-01

    A community working group was formed in October 2016 to consider early release science with the James Webb Space Telescope that broadly benefits the transiting exoplanet community. Over 100 exoplanet scientists worked collaboratively to identify targets that are observable at the initiation of science operations, yield high SNR with a single event, have substantial scientific merit, and have known spectroscopic features identified by prior observations. The working group developed a program that yields representative datasets for primary transit, secondary eclipse, and phase curve observations using the most promising instrument modes for high-precision spectroscopic timeseries (NIRISS-SOSS, NIRCam, NIRSPec, and MIRI-LRS). The centerpiece of the program is an open data challenge that promotes community engagement and leads to a deeper understanding of the JWST instruments as early as possible in the mission. The program is managed under the premise of open science in order to maximize the value of the early release science observations for the transiting exoplanet community.

  12. Developing a minimum dataset for nursing team leader handover in the intensive care unit: A focus group study.

    PubMed

    Spooner, Amy J; Aitken, Leanne M; Corley, Amanda; Chaboyer, Wendy

    2018-01-01

    Despite increasing demand for structured processes to guide clinical handover, nursing handover tools are limited in the intensive care unit. The study aim was to identify key items to include in a minimum dataset for intensive care nursing team leader shift-to-shift handover. This focus group study was conducted in a 21-bed medical/surgical intensive care unit in Australia. Senior registered nurses involved in team leader handovers were recruited. Focus groups were conducted using a nominal group technique to generate and prioritise minimum dataset items. Nurses were presented with content from previous team leader handovers and asked to select which content items to include in a minimum dataset. Participant responses were summarised as frequencies and percentages. Seventeen senior nurses participated in three focus groups. Participants agreed that ISBAR (Identify-Situation-Background-Assessment-Recommendations) was a useful tool to guide clinical handover. Items recommended to be included in the minimum dataset (≥65% agreement) included Identify (name, age, days in intensive care), Situation (diagnosis, surgical procedure), Background (significant event(s), management of significant event(s)) and Recommendations (patient plan for next shift, tasks to follow up for next shift). Overall, 30 of the 67 (45%) items in the Assessment category were considered important to include in the minimum dataset and focused on relevant observations and treatment within each body system. Other non-ISBAR items considered important to include related to the ICU (admissions to ICU, staffing/skill mix, theatre cases) and patients (infectious status, site of infection, end of life plan). Items were further categorised into those to include in all handovers and those to discuss only when relevant to the patient. The findings suggest a minimum dataset for intensive care nursing team leader shift-to-shift handover should contain items within ISBAR along with unit and patient specific information to maintain continuity of care and patient safety across shift changes. Copyright © 2017 Australian College of Critical Care Nurses Ltd. All rights reserved.

  13. A virtual dosimetry audit - Towards transferability of gamma index analysis between clinical trial QA groups.

    PubMed

    Hussein, Mohammad; Clementel, Enrico; Eaton, David J; Greer, Peter B; Haworth, Annette; Ishikura, Satoshi; Kry, Stephen F; Lehmann, Joerg; Lye, Jessica; Monti, Angelo F; Nakamura, Mitsuhiro; Hurkmans, Coen; Clark, Catharine H

    2017-12-01

    Quality assurance (QA) for clinical trials is important. Lack of compliance can affect trial outcome. Clinical trial QA groups have different methods of dose distribution verification and analysis, all with the ultimate aim of ensuring trial compliance. The aim of this study was to gain a better understanding of different processes to inform future dosimetry audit reciprocity. Six clinical trial QA groups participated. Intensity modulated treatment plans were generated for three different cases. A range of 17 virtual 'measurements' were generated by introducing a variety of simulated perturbations (such as MLC position deviations, dose differences, gantry rotation errors, Gaussian noise) to three different treatment plan cases. Participants were blinded to the 'measured' data details. Each group analysed the datasets using their own gamma index (γ) technique and using standardised parameters for passing criteria, lower dose threshold, γ normalisation and global γ. For the same virtual 'measured' datasets, different results were observed using local techniques. For the standardised γ, differences in the percentage of points passing with γ < 1 were also found, however these differences were less pronounced than for each clinical trial QA group's analysis. These variations may be due to different software implementations of γ. This virtual dosimetry audit has been an informative step in understanding differences in the verification of measured dose distributions between different clinical trial QA groups. This work lays the foundations for audit reciprocity between groups, particularly with more clinical trials being open to international recruitment. Copyright © 2017 Elsevier B.V. All rights reserved.

  14. Elsevier’s approach to the bioCADDIE 2016 Dataset Retrieval Challenge

    PubMed Central

    Scerri, Antony; Kuriakose, John; Deshmane, Amit Ajit; Stanger, Mark; Moore, Rebekah; Naik, Raj; de Waard, Anita

    2017-01-01

    Abstract We developed a two-stream, Apache Solr-based information retrieval system in response to the bioCADDIE 2016 Dataset Retrieval Challenge. One stream was based on the principle of word embeddings, the other was rooted in ontology based indexing. Despite encountering several issues in the data, the evaluation procedure and the technologies used, the system performed quite well. We provide some pointers towards future work: in particular, we suggest that more work in query expansion could benefit future biomedical search engines. Database URL: https://data.mendeley.com/datasets/zd9dxpyybg/1 PMID:29220454

  15. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation.

    PubMed

    Pujar, Shashikant; O'Leary, Nuala A; Farrell, Catherine M; Loveland, Jane E; Mudge, Jonathan M; Wallin, Craig; Girón, Carlos G; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; Martin, Fergal J; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Suner, Marie-Marthe; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bruford, Elspeth A; Bult, Carol J; Frankish, Adam; Murphy, Terence; Pruitt, Kim D

    2018-01-04

    The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

  16. OperomeDB: A Database of Condition-Specific Transcription Units in Prokaryotic Genomes.

    PubMed

    Chetal, Kashish; Janga, Sarath Chandra

    2015-01-01

    Background. In prokaryotic organisms, a substantial fraction of adjacent genes are organized into operons-codirectionally organized genes in prokaryotic genomes with the presence of a common promoter and terminator. Although several available operon databases provide information with varying levels of reliability, very few resources provide experimentally supported results. Therefore, we believe that the biological community could benefit from having a new operon prediction database with operons predicted using next-generation RNA-seq datasets. Description. We present operomeDB, a database which provides an ensemble of all the predicted operons for bacterial genomes using available RNA-sequencing datasets across a wide range of experimental conditions. Although several studies have recently confirmed that prokaryotic operon structure is dynamic with significant alterations across environmental and experimental conditions, there are no comprehensive databases for studying such variations across prokaryotic transcriptomes. Currently our database contains nine bacterial organisms and 168 transcriptomes for which we predicted operons. User interface is simple and easy to use, in terms of visualization, downloading, and querying of data. In addition, because of its ability to load custom datasets, users can also compare their datasets with publicly available transcriptomic data of an organism. Conclusion. OperomeDB as a database should not only aid experimental groups working on transcriptome analysis of specific organisms but also enable studies related to computational and comparative operomics.

  17. Identification of characteristic oligonucleotides in the bacterial 16S ribosomal RNA sequence dataset

    NASA Technical Reports Server (NTRS)

    Zhang, Zhengdong; Willson, Richard C.; Fox, George E.

    2002-01-01

    MOTIVATION: The phylogenetic structure of the bacterial world has been intensively studied by comparing sequences of 16S ribosomal RNA (16S rRNA). This database of sequences is now widely used to design probes for the detection of specific bacteria or groups of bacteria one at a time. The success of such methods reflects the fact that there are local sequence segments that are highly characteristic of particular organisms or groups of organisms. It is not clear, however, the extent to which such signature sequences exist in the 16S rRNA dataset. A better understanding of the numbers and distribution of highly informative oligonucleotide sequences may facilitate the design of hybridization arrays that can characterize the phylogenetic position of an unknown organism or serve as the basis for the development of novel approaches for use in bacterial identification. RESULTS: A computer-based algorithm that characterizes the extent to which any individual oligonucleotide sequence in 16S rRNA is characteristic of any particular bacterial grouping was developed. A measure of signature quality, Q(s), was formulated and subsequently calculated for every individual oligonucleotide sequence in the size range of 5-11 nucleotides and for 15mers with reference to each cluster and subcluster in a 929 organism representative phylogenetic tree. Subsequently, the perfect signature sequences were compared to the full set of 7322 sequences to see how common false positives were. The work completed here establishes beyond any doubt that highly characteristic oligonucleotides exist in the bacterial 16S rRNA sequence dataset in large numbers. Over 16,000 15mers were identified that might be useful as signatures. Signature oligonucleotides are available for over 80% of the nodes in the representative tree.

  18. Genome-wide analyses of the bHLH superfamily in crustaceans: reappraisal of higher-order groupings and evidence for lineage-specific duplications

    PubMed Central

    2018-01-01

    The basic helix-loop-helix (bHLH) proteins represent a key group of transcription factors implicated in numerous eukaryotic developmental and signal transduction processes. Characterization of bHLHs from model species such as humans, fruit flies, nematodes and plants have yielded important information on their functions and evolutionary origin. However, relatively little is known about bHLHs in non-model organisms despite the availability of a vast number of high-throughput sequencing datasets, enabling previously intractable genome-wide and cross-species analyses to be now performed. We extensively searched for bHLHs in 126 crustacean species represented across major Crustacea taxa and identified 3777 putative bHLH orthologues. We have also included seven whole-genome datasets representative of major arthropod lineages to obtain a more accurate prediction of the full bHLH gene complement. With focus on important food crop species from Decapoda, we further defined higher-order groupings and have successfully recapitulated previous observations in other animals. Importantly, we also observed evidence for lineage-specific bHLH expansions in two basal crustaceans (branchiopod and copepod), suggesting a mode of evolution through gene duplication as an adaptation to changing environments. In-depth analysis on bHLH-PAS members confirms the phenomenon coined as ‘modular evolution’ (independently evolved domains) typically seen in multidomain proteins. With the amphipod Parhyale hawaiensis as the exception, our analyses have focused on crustacean transcriptome datasets. Hence, there is a clear requirement for future analyses on whole-genome sequences to overcome potential limitations associated with transcriptome mining. Nonetheless, the present work will serve as a key resource for future mechanistic and biochemical studies on bHLHs in economically important crustacean food crop species. PMID:29657824

  19. Trends of pesticides and nitrate in ground water of the Central Columbia Plateau, Washington, 1993-2003

    USGS Publications Warehouse

    Frans, L.

    2008-01-01

    Pesticide and nitrate data for ground water sampled in the Central Columbia Plateau, Washington, between 1993 and 2003 by the U.S. Geological Survey National Water-Quality Assessment Program were evaluated for trends in concentration. A total of 72 wells were sampled in 1993-1995 and again in 2002-2003 in three well networks that targeted row crop and orchard land use settings as well as the regional basalt aquifer. The Regional Kendall trend test indicated that only deethylatrazine (DEA) concentrations showed a significant trend. Deethylatrazine concentrations were found to increase beneath the row crop land use well network, the regional aquifer well network, and for the dataset as a whole. No other pesticides showed a significant trend (nor did nitrate) in the 72-well dataset. Despite the lack of a trend in nitrate concentrations within the National Water-Quality Assessment dataset, previous work has found a statistically significant decrease in nitrate concentrations from 1998-2002 for wells with nitrate concentrations above 10 mg L-1 within the Columbia Basin ground water management area, which is located within the National Water-Quality Assessment study unit boundary. The increasing trend in DEA concentrations was found to negatively correlate with soil hydrologic group using logistic regression and with soil hydrologic group and drainage class using Spearman's correlation. The decreasing trend in high nitrate concentrations was found to positively correlate with the depth to which the well was cased using logistic regression, to positively correlate with nitrate application rates and sand content of the soil, and to negatively correlate with soil hydrologic group using Spearman's correlation. Copyright ?? 2008 by the American Society of Agronomy, Crop Science Society of America, and Soil Science Society of America. All rights reserved.

  20. Pairwise gene GO-based measures for biclustering of high-dimensional expression data.

    PubMed

    Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S

    2018-01-01

    Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.

  1. A Secure Architecture to Provide a Medical Emergency Dataset for Patients in Germany and Abroad.

    PubMed

    Storck, Michael; Wohlmann, Jan; Krudwig, Sarah; Vogel, Alexander; Born, Judith; Weber, Thomas; Dugas, Martin; Juhra, Christian

    2017-01-01

    The ongoing fragmentation of medical care and mobility of patients severely restrains exchange of lifesaving information about patient's medical history in case of emergencies. Therefore, the objective of this work is to offer a secure technical solution to supply medical professionals with emergency-relevant information concerning the current patient via mobile accessibility. To achieve this goal, the official national emergency data set was extended by additional features to form a patient summary for emergencies, a software architecture was developed and data security and data protection issues were taken into account. The patient has sovereignty over his/her data and can therefore decide who has access to or can change his/her stored data, but the treating physician composes the validated dataset. Building upon the introduced concept, future activities are the development of user-interfaces for the software components of the different user groups as well as functioning prototypes for upcoming field tests.

  2. Progeny Clustering: A Method to Identify Biological Phenotypes

    PubMed Central

    Hu, Chenyue W.; Kornblau, Steven M.; Slater, John H.; Qutub, Amina A.

    2015-01-01

    Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476

  3. PRISM Climate Group, Oregon State U

    Science.gov Websites

    FAQ PRISM Climate Data The PRISM Climate Group gathers climate observations from a wide range of monitoring networks, applies sophisticated quality control measures, and develops spatial climate datasets to reveal short- and long-term climate patterns. The resulting datasets incorporate a variety of modeling

  4. EnviroAtlas - Phoenix, AZ - Ecosystem Services by Block Group

    EPA Pesticide Factsheets

    This dataset presents environmental benefits of the urban forest in 2,434 block groups in Phoenix, Arizona. Carbon attributes, pollution removal and value, and runoff effects are calculated for each block group using i-Tree models (www.itreetools.org), local weather data, pollution data, EPA provided city boundary and land cover data, and U.S. Census derived block group boundary data. Temperature reduction values for Phoenix will be added when they become available. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  5. Establishing a process for conducting cross-jurisdictional record linkage in Australia.

    PubMed

    Moore, Hannah C; Guiver, Tenniel; Woollacott, Anthony; de Klerk, Nicholas; Gidding, Heather F

    2016-04-01

    To describe the realities of conducting a cross-jurisdictional data linkage project involving state and Australian Government-based data collections to inform future national data linkage programs of work. We outline the processes involved in conducting a Proof of Concept data linkage project including the implementation of national data integration principles, data custodian and ethical approval requirements, and establishment of data flows. The approval process involved nine approval and regulatory bodies and took more than two years. Data will be linked across 12 datasets involving three data linkage centres. A framework was established to allow data to flow between these centres while maintaining the separation principle that serves to protect the privacy of the individual. This will be the first project to link child immunisation records from an Australian Government dataset to other administrative health datasets for a population cohort covering 2 million births in two Australian states. Although the project experienced some delays, positive outcomes were realised, primarily the development of strong collaborations across key stakeholder groups including community engagement. We have identified several recommendations and enhancements to this now established framework to further streamline the process for data linkage studies involving Australian Government data. © 2015 Public Health Association of Australia.

  6. Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research.

    PubMed

    McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R; Beresford, Michael W

    2018-02-01

    This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  7. Study on homogenization of synthetic GNSS-retrieved IWV time series and its impact on trend estimates with autoregressive noise

    NASA Astrophysics Data System (ADS)

    Klos, Anna; Pottiaux, Eric; Van Malderen, Roeland; Bock, Olivier; Bogusz, Janusz

    2017-04-01

    A synthetic benchmark dataset of Integrated Water Vapour (IWV) was created within the activity of "Data homogenisation" of sub-working group WG3 of COST ES1206 Action. The benchmark dataset was created basing on the analysis of IWV differences retrieved by Global Positioning System (GPS) International GNSS Service (IGS) stations using European Centre for Medium-Range Weather Forecats (ECMWF) reanalysis data (ERA-Interim). Having analysed a set of 120 series of IWV differences (ERAI-GPS) derived for IGS stations, we delivered parameters of a number of gaps and breaks for every certain station. Moreover, we estimated values of trends, significant seasonalities and character of residuals when deterministic model was removed. We tested five different noise models and found that a combination of white and autoregressive processes of first order describes the stochastic part with a good accuracy. Basing on this analysis, we performed Monte Carlo simulations of 25 years long data with two different types of noise: white as well as combination of white and autoregressive processes. We also added few strictly defined offsets, creating three variants of synthetic dataset: easy, less-complicated and fully-complicated. The 'Easy' dataset included seasonal signals (annual, semi-annual, 3 and 4 months if present for a particular station), offsets and white noise. The 'Less-complicated' dataset included above-mentioned, as well as the combination of white and first order autoregressive processes (AR(1)+WH). The 'Fully-complicated' dataset included, beyond above, a trend and gaps. In this research, we show the impact of manual homogenisation on the estimates of trend and its error. We also cross-compare the results for three above-mentioned datasets, as the synthetized noise type might have a significant influence on manual homogenisation. Therefore, it might mostly affect the values of trend and their uncertainties when inappropriately handled. In a future, the synthetic dataset we present is going to be used as a benchmark to test various statistical tools in terms of homogenisation task.

  8. Atlas Toolkit: Fast registration of 3D morphological datasets in the absence of landmarks

    PubMed Central

    Grocott, Timothy; Thomas, Paul; Münsterberg, Andrea E.

    2016-01-01

    Image registration is a gateway technology for Developmental Systems Biology, enabling computational analysis of related datasets within a shared coordinate system. Many registration tools rely on landmarks to ensure that datasets are correctly aligned; yet suitable landmarks are not present in many datasets. Atlas Toolkit is a Fiji/ImageJ plugin collection offering elastic group-wise registration of 3D morphological datasets, guided by segmentation of the interesting morphology. We demonstrate the method by combinatorial mapping of cell signalling events in the developing eyes of chick embryos, and use the integrated datasets to predictively enumerate Gene Regulatory Network states. PMID:26864723

  9. Atlas Toolkit: Fast registration of 3D morphological datasets in the absence of landmarks.

    PubMed

    Grocott, Timothy; Thomas, Paul; Münsterberg, Andrea E

    2016-02-11

    Image registration is a gateway technology for Developmental Systems Biology, enabling computational analysis of related datasets within a shared coordinate system. Many registration tools rely on landmarks to ensure that datasets are correctly aligned; yet suitable landmarks are not present in many datasets. Atlas Toolkit is a Fiji/ImageJ plugin collection offering elastic group-wise registration of 3D morphological datasets, guided by segmentation of the interesting morphology. We demonstrate the method by combinatorial mapping of cell signalling events in the developing eyes of chick embryos, and use the integrated datasets to predictively enumerate Gene Regulatory Network states.

  10. Full-motion video analysis for improved gender classification

    NASA Astrophysics Data System (ADS)

    Flora, Jeffrey B.; Lochtefeld, Darrell F.; Iftekharuddin, Khan M.

    2014-06-01

    The ability of computer systems to perform gender classification using the dynamic motion of the human subject has important applications in medicine, human factors, and human-computer interface systems. Previous works in motion analysis have used data from sensors (including gyroscopes, accelerometers, and force plates), radar signatures, and video. However, full-motion video, motion capture, range data provides a higher resolution time and spatial dataset for the analysis of dynamic motion. Works using motion capture data have been limited by small datasets in a controlled environment. In this paper, we explore machine learning techniques to a new dataset that has a larger number of subjects. Additionally, these subjects move unrestricted through a capture volume, representing a more realistic, less controlled environment. We conclude that existing linear classification methods are insufficient for the gender classification for larger dataset captured in relatively uncontrolled environment. A method based on a nonlinear support vector machine classifier is proposed to obtain gender classification for the larger dataset. In experimental testing with a dataset consisting of 98 trials (49 subjects, 2 trials per subject), classification rates using leave-one-out cross-validation are improved from 73% using linear discriminant analysis to 88% using the nonlinear support vector machine classifier.

  11. Educational and Scientific Applications of Climate Model Diagnostic Analyzer

    NASA Astrophysics Data System (ADS)

    Lee, S.; Pan, L.; Zhai, C.; Tang, B.; Kubar, T. L.; Zhang, J.; Bao, Q.

    2016-12-01

    Climate Model Diagnostic Analyzer (CMDA) is a web-based information system designed for the climate modeling and model analysis community to analyze climate data from models and observations. CMDA provides tools to diagnostically analyze climate data for model validation and improvement, and to systematically manage analysis provenance for sharing results with other investigators. CMDA utilizes cloud computing resources, multi-threading computing, machine-learning algorithms, web service technologies, and provenance-supporting technologies to address technical challenges that the Earth science modeling and model analysis community faces in evaluating and diagnosing climate models. As CMDA infrastructure and technology have matured, we have developed the educational and scientific applications of CMDA. Educationally, CMDA supported the summer school of the JPL Center for Climate Sciences for three years since 2014. In the summer school, the students work on group research projects where CMDA provide datasets and analysis tools. Each student is assigned to a virtual machine with CMDA installed in Amazon Web Services. A provenance management system for CMDA is developed to keep track of students' usages of CMDA, and to recommend datasets and analysis tools for their research topic. The provenance system also allows students to revisit their analysis results and share them with their group. Scientifically, we have developed several science use cases of CMDA covering various topics, datasets, and analysis types. Each use case developed is described and listed in terms of a scientific goal, datasets used, the analysis tools used, scientific results discovered from the use case, an analysis result such as output plots and data files, and a link to the exact analysis service call with all the input arguments filled. For example, one science use case is the evaluation of NCAR CAM5 model with MODIS total cloud fraction. The analysis service used is Difference Plot Service of Two Variables, and the datasets used are NCAR CAM total cloud fraction and MODIS total cloud fraction. The scientific highlight of the use case is that the CAM5 model overall does a fairly decent job at simulating total cloud cover, though simulates too few clouds especially near and offshore of the eastern ocean basins where low clouds are dominant.

  12. Topic modeling for cluster analysis of large biological and medical datasets

    PubMed Central

    2014-01-01

    Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106

  13. Topic modeling for cluster analysis of large biological and medical datasets.

    PubMed

    Zhao, Weizhong; Zou, Wen; Chen, James J

    2014-01-01

    The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.

  14. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

    PubMed Central

    Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila

    2018-01-01

    Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374

  15. A Dataset of Three Educational Technology Experiments on Differentiation, Formative Testing and Feedback

    ERIC Educational Resources Information Center

    Haelermans, Carla; Ghysels, Joris; Prince, Fernao

    2015-01-01

    This paper describes a dataset with data from three individually randomized educational technology experiments on differentiation, formative testing and feedback during one school year for a group of 8th grade students in the Netherlands, using administrative data and the online motivation questionnaire of Boekaerts. The dataset consists of pre-…

  16. An Agro-Climatological Early Warning Tool Based on the Google Earth Engine to Support Regional Food Security Analysis

    NASA Astrophysics Data System (ADS)

    Landsfeld, M. F.; Daudert, B.; Friedrichs, M.; Morton, C.; Hegewisch, K.; Husak, G. J.; Funk, C. C.; Peterson, P.; Huntington, J. L.; Abatzoglou, J. T.; Verdin, J. P.; Williams, E. L.

    2015-12-01

    The Famine Early Warning Systems Network (FEWS NET) focuses on food insecurity in developing nations and provides objective, evidence based analysis to help government decision-makers and relief agencies plan for and respond to humanitarian emergencies. The Google Earth Engine (GEE) is a platform provided by Google Inc. to support scientific research and analysis of environmental data in their cloud environment. The intent is to allow scientists and independent researchers to mine massive collections of environmental data and leverage Google's vast computational resources to detect changes and monitor the Earth's surface and climate. GEE hosts an enormous amount of satellite imagery and climate archives, one of which is the Climate Hazards Group Infrared Precipitation with Stations dataset (CHIRPS). The CHIRPS dataset is land based, quasi-global (latitude 50N-50S), 0.05 degree resolution, and has a relatively long term period of record (1981-present). CHIRPS is on a continuous monthly feed into the GEE as new data fields are generated each month. This precipitation dataset is a key input for FEWS NET monitoring and forecasting efforts. FEWS NET intends to leverage the GEE in order to provide analysts and scientists with flexible, interactive tools to aid in their monitoring and research efforts. These scientists often work in bandwidth limited regions, so lightweight Internet tools and services that bypass the need for downloading massive datasets to analyze them, are preferred for their work. The GEE provides just this type of service. We present a tool designed specifically for FEWS NET scientists to be utilized interactively for investigating and monitoring for agro-climatological issues. We are able to utilize the enormous GEE computing power to generate on-the-fly statistics to calculate precipitation anomalies, z-scores, percentiles and band ratios, and allow the user to interactively select custom areas for statistical time series comparisons and predictions.

  17. Applying Advances in GPM Radiometer Intercalibration and Algorithm Development to a Long-Term TRMM/GPM Global Precipitation Dataset

    NASA Astrophysics Data System (ADS)

    Berg, W. K.

    2016-12-01

    The Global Precipitation Mission (GPM) Core Observatory, which was launched in February of 2014, provides a number of advances for satellite monitoring of precipitation including a dual-frequency radar, high frequency channels on the GPM Microwave Imager (GMI), and coverage over middle and high latitudes. The GPM concept, however, is about producing unified precipitation retrievals from a constellation of microwave radiometers to provide approximately 3-hourly global sampling. This involves intercalibration of the input brightness temperatures from the constellation radiometers, development of an apriori precipitation database using observations from the state-of-the-art GPM radiometer and radars, and accounting for sensor differences in the retrieval algorithm in a physically-consistent way. Efforts by the GPM inter-satellite calibration working group, or XCAL team, and the radiometer algorithm team to create unified precipitation retrievals from the GPM radiometer constellation were fully implemented into the current version 4 GPM precipitation products. These include precipitation estimates from a total of seven conical-scanning and six cross-track scanning radiometers as well as high spatial and temporal resolution global level 3 gridded products. Work is now underway to extend this unified constellation-based approach to the combined TRMM/GPM data record starting in late 1997. The goal is to create a long-term global precipitation dataset employing these state-of-the-art calibration and retrieval algorithm approaches. This new long-term global precipitation dataset will incorporate the physics provided by the combined GPM GMI and DPR sensors into the apriori database, extend prior TRMM constellation observations to high latitudes, and expand the available TRMM precipitation data to the full constellation of available conical and cross-track scanning radiometers. This combined TRMM/GPM precipitation data record will thus provide a high-quality high-temporal resolution global dataset for use in a wide variety of weather and climate research applications.

  18. Data and Models as Social Objects in the HydroShare System for Collaboration in the Hydrology Community and Beyond

    NASA Astrophysics Data System (ADS)

    Tarboton, D. G.; Idaszak, R.; Horsburgh, J. S.; Ames, D. P.; Goodall, J. L.; Band, L. E.; Merwade, V.; Couch, A.; Hooper, R. P.; Maidment, D. R.; Dash, P. K.; Stealey, M.; Yi, H.; Gan, T.; Castronova, A. M.; Miles, B.; Li, Z.; Morsy, M. M.; Crawley, S.; Ramirez, M.; Sadler, J.; Xue, Z.; Bandaragoda, C.

    2016-12-01

    How do you share and publish hydrologic data and models for a large collaborative project? HydroShare is a new, web-based system for sharing hydrologic data and models with specific functionality aimed at making collaboration easier. HydroShare has been developed with U.S. National Science Foundation support under the auspices of the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) to support the collaboration and community cyberinfrastructure needs of the hydrology research community. Within HydroShare, we have developed new functionality for creating datasets, describing them with metadata, and sharing them with collaborators. We cast hydrologic datasets and models as "social objects" that can be shared, collaborated around, annotated, published and discovered. In addition to data and model sharing, HydroShare supports web application programs (apps) that can act on data stored in HydroShare, just as software programs on your PC act on your data locally. This can free you from some of the limitations of local computing capacity and challenges in installing and maintaining software on your own PC. HydroShare's web-based cyberinfrastructure can take work off your desk or laptop computer and onto infrastructure or "cloud" based data and processing servers. This presentation will describe HydroShare's collaboration functionality that enables both public and private sharing with individual users and collaborative user groups, and makes it easier for collaborators to iterate on shared datasets and models, creating multiple versions along the way, and publishing them with a permanent landing page, metadata description, and citable Digital Object Identifier (DOI) when the work is complete. This presentation will also describe the web app architecture that supports interoperability with third party servers functioning as application engines for analysis and processing of big hydrologic datasets. While developed to support the cyberinfrastructure needs of the hydrology community, the informatics infrastructure for programmatic interoperability of web resources has a generality beyond the solution of hydrology problems that will be discussed.

  19. Can Simple Transmission Chains Foster Collective Intelligence in Binary-Choice Tasks?

    PubMed

    Moussaïd, Mehdi; Seyed Yahosseini, Kyanoush

    2016-01-01

    In many social systems, groups of individuals can find remarkably efficient solutions to complex cognitive problems, sometimes even outperforming a single expert. The success of the group, however, crucially depends on how the judgments of the group members are aggregated to produce the collective answer. A large variety of such aggregation methods have been described in the literature, such as averaging the independent judgments, relying on the majority or setting up a group discussion. In the present work, we introduce a novel approach for aggregating judgments-the transmission chain-which has not yet been consistently evaluated in the context of collective intelligence. In a transmission chain, all group members have access to a unique collective solution and can improve it sequentially. Over repeated improvements, the collective solution that emerges reflects the judgments of every group members. We address the question of whether such a transmission chain can foster collective intelligence for binary-choice problems. In a series of numerical simulations, we explore the impact of various factors on the performance of the transmission chain, such as the group size, the model parameters, and the structure of the population. The performance of this method is compared to those of the majority rule and the confidence-weighted majority. Finally, we rely on two existing datasets of individuals performing a series of binary decisions to evaluate the expected performances of the three methods empirically. We find that the parameter space where the transmission chain has the best performance rarely appears in real datasets. We conclude that the transmission chain is best suited for other types of problems, such as those that have cumulative properties.

  20. Can Simple Transmission Chains Foster Collective Intelligence in Binary-Choice Tasks?

    PubMed Central

    Moussaïd, Mehdi; Seyed Yahosseini, Kyanoush

    2016-01-01

    In many social systems, groups of individuals can find remarkably efficient solutions to complex cognitive problems, sometimes even outperforming a single expert. The success of the group, however, crucially depends on how the judgments of the group members are aggregated to produce the collective answer. A large variety of such aggregation methods have been described in the literature, such as averaging the independent judgments, relying on the majority or setting up a group discussion. In the present work, we introduce a novel approach for aggregating judgments—the transmission chain—which has not yet been consistently evaluated in the context of collective intelligence. In a transmission chain, all group members have access to a unique collective solution and can improve it sequentially. Over repeated improvements, the collective solution that emerges reflects the judgments of every group members. We address the question of whether such a transmission chain can foster collective intelligence for binary-choice problems. In a series of numerical simulations, we explore the impact of various factors on the performance of the transmission chain, such as the group size, the model parameters, and the structure of the population. The performance of this method is compared to those of the majority rule and the confidence-weighted majority. Finally, we rely on two existing datasets of individuals performing a series of binary decisions to evaluate the expected performances of the three methods empirically. We find that the parameter space where the transmission chain has the best performance rarely appears in real datasets. We conclude that the transmission chain is best suited for other types of problems, such as those that have cumulative properties. PMID:27880825

  1. Hierarchical Bayesian modelling of mobility metrics for hazard model input calibration

    NASA Astrophysics Data System (ADS)

    Calder, Eliza; Ogburn, Sarah; Spiller, Elaine; Rutarindwa, Regis; Berger, Jim

    2015-04-01

    In this work we present a method to constrain flow mobility input parameters for pyroclastic flow models using hierarchical Bayes modeling of standard mobility metrics such as H/L and flow volume etc. The advantage of hierarchical modeling is that it can leverage the information in global dataset for a particular mobility metric in order to reduce the uncertainty in modeling of an individual volcano, especially important where individual volcanoes have only sparse datasets. We use compiled pyroclastic flow runout data from Colima, Merapi, Soufriere Hills, Unzen and Semeru volcanoes, presented in an open-source database FlowDat (https://vhub.org/groups/massflowdatabase). While the exact relationship between flow volume and friction varies somewhat between volcanoes, dome collapse flows originating from the same volcano exhibit similar mobility relationships. Instead of fitting separate regression models for each volcano dataset, we use a variation of the hierarchical linear model (Kass and Steffey, 1989). The model presents a hierarchical structure with two levels; all dome collapse flows and dome collapse flows at specific volcanoes. The hierarchical model allows us to assume that the flows at specific volcanoes share a common distribution of regression slopes, then solves for that distribution. We present comparisons of the 95% confidence intervals on the individual regression lines for the data set from each volcano as well as those obtained from the hierarchical model. The results clearly demonstrate the advantage of considering global datasets using this technique. The technique developed is demonstrated here for mobility metrics, but can be applied to many other global datasets of volcanic parameters. In particular, such methods can provide a means to better contain parameters for volcanoes for which we only have sparse data, a ubiquitous problem in volcanology.

  2. A data skimming service for locally resident analysis data

    NASA Astrophysics Data System (ADS)

    Cranshaw, J.; Gardner, R. W.; Gieraltowski, J.; Malon, D.; Mambelli, M.; May, E.

    2008-07-01

    A Data Skimming Service (DSS) is a site-level service for rapid event filtering and selection from locally resident datasets based on metadata queries to associated 'tag' databases. In US ATLAS, we expect most if not all of the AOD-based datasets to be replicated to each of the five Tier 2 regional facilities in the US Tier 1 'cloud' coordinated by Brookhaven National Laboratory. Entire datasets will consist of on the order of several terabytes of data, and providing easy, quick access to skimmed subsets of these data will be vital to physics working groups. Typically, physicists will be interested in portions of the complete datasets, selected according to event-level attributes (number of jets, missing Et, etc) and content (specific analysis objects for subsequent processing). In this paper we describe methods used to classify data (metadata tag generation) and to store these results in a local database. Next we discuss a general framework which includes methods for accessing this information, defining skims, specifying event output content, accessing locally available storage through a variety of interfaces (SRM, dCache/dccp, gridftp), accessing remote storage elements as specified, and user job submission tools through local or grid schedulers. The advantages of the DSS are the ability to quickly 'browse' datasets and design skims, for example, pre-adjusting cuts to get to a desired skim level with minimal use of compute resources, and to encode these analysis operations in a database for re-analysis and archival purposes. Additionally the framework has provisions to operate autonomously in the event that external, central resources are not available, and to provide, as a reduced package, a minimal skimming service tailored to the needs of small Tier 3 centres or individual users.

  3. Dose calculation for photon-emitting brachytherapy sources with average energy higher than 50 keV: report of the AAPM and ESTRO.

    PubMed

    Perez-Calatayud, Jose; Ballester, Facundo; Das, Rupak K; Dewerd, Larry A; Ibbott, Geoffrey S; Meigooni, Ali S; Ouhib, Zoubir; Rivard, Mark J; Sloboda, Ron S; Williamson, Jeffrey F

    2012-05-01

    Recommendations of the American Association of Physicists in Medicine (AAPM) and the European Society for Radiotherapy and Oncology (ESTRO) on dose calculations for high-energy (average energy higher than 50 keV) photon-emitting brachytherapy sources are presented, including the physical characteristics of specific (192)Ir, (137)Cs, and (60)Co source models. This report has been prepared by the High Energy Brachytherapy Source Dosimetry (HEBD) Working Group. This report includes considerations in the application of the TG-43U1 formalism to high-energy photon-emitting sources with particular attention to phantom size effects, interpolation accuracy dependence on dose calculation grid size, and dosimetry parameter dependence on source active length. Consensus datasets for commercially available high-energy photon sources are provided, along with recommended methods for evaluating these datasets. Recommendations on dosimetry characterization methods, mainly using experimental procedures and Monte Carlo, are established and discussed. Also included are methodological recommendations on detector choice, detector energy response characterization and phantom materials, and measurement specification methodology. Uncertainty analyses are discussed and recommendations for high-energy sources without consensus datasets are given. Recommended consensus datasets for high-energy sources have been derived for sources that were commercially available as of January 2010. Data are presented according to the AAPM TG-43U1 formalism, with modified interpolation and extrapolation techniques of the AAPM TG-43U1S1 report for the 2D anisotropy function and radial dose function.

  4. Fostering the exchange of real world data across different countries to answer primary care research questions: an UNLOCK study from the IPCRG.

    PubMed

    Cragg, Liza; Williams, Siân; van der Molen, Thys; Thomas, Mike; Correia de Sousa, Jaime; Chavannes, Niels H

    2018-03-08

    There is growing awareness amongst healthcare planners, providers and researchers of the need to make better use of routinely collected health data by translating it into actionable information that improves efficiency of healthcare and patient outcomes. There is also increased acceptance of the importance of real world research that recruits patients representative of primary care populations and evaluates interventions realistically delivered by primary care professionals. The UNLOCK Group is an international collaboration of primary care researchers and practitioners from 15 countries. It has coordinated and shared datasets of diagnostic and prognostic variables for COPD and asthma to answer research questions meaningful to professionals working in primary care over a 6-year period. Over this time the UNLOCK Group has undertaken several studies using data from unselected primary care populations from diverse contexts to evaluate the burden of disease, multiple morbidities, treatment and follow-up. However, practical and structural constraints have hampered the UNLOCK Group's ability to translate research ideas into studies. This study explored the constraints, challenges and successes experienced by the UNLOCK Group and its participants' learning as researchers and primary care practitioners collaborating to answer primary care research questions. The study identified lessons for future studies and collaborations that require data sharing across borders. It also explored specific challenges to fostering the exchange of primary care data in comparison to other datasets such as public health, prescribing or hospital data and mechanisms that may be used to overcome these.

  5. Water quality studied in areas of unconventional oil and gas development, including areas where hydraulic fracturing techniques are used, in the United States

    USGS Publications Warehouse

    Susong, David D.; Gallegos, Tanya J.; Oelsner, Gretchen P.

    2012-01-01

    The U.S. Geological Survey (USGS) John Wesley Powell Center for Analysis and Synthesis is hosting an interdisciplinary working group of USGS scientists to conduct a temporal and spatial analysis of surface-water and groundwater quality in areas of unconventional oil and gas development. The analysis uses existing national and regional datasets to describe water quality, evaluate water-quality changes over time where there are sufficient data, and evaluate spatial and temporal data gaps.

  6. Building Climate Service Capacities in Eastern Africa with CHIRP and GeoCLIM

    NASA Astrophysics Data System (ADS)

    Pedreros, D. H.; Magadzire, T.; Funk, C. C.; Verdin, J. P.; Peterson, P.; Landsfeld, M.; Husak, G. J.

    2013-12-01

    In developing countries there is a great need for capacity building within national and regional climate agencies to develop and analyze historical and real time gridded rainfall datasets. These datasets are of key importance for monitoring climate and agricultural food production at decadal and seasonal time scales, and for informing local decision makers. The Famine Early Warning Systems Network (FEWS NET), working together with the U.S. Geological Survey (USGS) and the Climate Hazards Group (CHG) of the University of California Santa Barbara, has developed an integrated set of data products and tools to support the development of African climate services. The core data product is the Climate Hazards Group Infrared Precipitation (CHIRP) dataset. The CHIRP is a new rainfall dataset resulting from the blending of satellite estimated precipitation with high resolution precipitation climatology. The CHIRP depicts rainfall on five day totals at 5km spatial resolution from 1981 to present. The CHG is developing and deploying a standalone tool - the GeoCLIM - which will allow national and regional meteorological agencies to blend the CHIRP with station observations, run simple crop water balance models, and conduct climatological, trend, and time series analysis. Blending satellite estimates and gauge data helps overcome limited in situ observing networks. Furthermore, the GeoCLIM combines rainfall, soil, and evapotranspiration data with crop hydrological requirements to calculate agricultural water balance, presented as the Water Requirement Satisfaction Index (WRSI). The WRSI is a measurement of the degree in which a crop's hydrological requirements have been satisfied by rainfall. We present the results of a training session for personnel of the East African Intergovernmental Authority on Development Climate Prediction and Applications Center. The two week training program included the use of the GeoCLIM to improve CHIRP using station data, and to calculate and analyze trends in rainfall, WRSI, and drought frequency in the region.

  7. Data Publication: A Partnership between Scientists, Data Managers and Librarians

    NASA Astrophysics Data System (ADS)

    Raymond, L.; Chandler, C.; Lowry, R.; Urban, E.; Moncoiffe, G.; Pissierssens, P.; Norton, C.; Miller, H.

    2012-04-01

    Current literature on the topic of data publication suggests that success is best achieved when there is a partnership between scientists, data managers, and librarians. The Marine Biological Laboratory/Woods Hole Oceanographic Institution (MBLWHOI) Library and the Biological and Chemical Oceanography Data Management Office (BCO-DMO) have developed tools and processes to automate the ingestion of metadata from BCO-DMO for deposit with datasets into the Institutional Repository (IR) Woods Hole Open Access Server (WHOAS). The system also incorporates functionality for BCO-DMO to request a Digital Object Identifier (DOI) from the Library. This partnership allows the Library to work with a trusted data repository to ensure high quality data while the data repository utilizes library services and is assured of a permanent archive of the copy of the data extracted from the repository database. The assignment of persistent identifiers enables accurate data citation. The Library can assign a DOI to appropriate datasets deposited in WHOAS. A primary activity is working with authors to deposit datasets associated with published articles. The DOI would ideally be assigned before submission and be included in the published paper so readers can link directly to the dataset, but DOIs are also being assigned to datasets related to articles after publication. WHOAS metadata records link the article to the datasets and the datasets to the article. The assignment of DOIs has enabled another important collaboration with Elsevier, publisher of educational and professional science journals. Elsevier can now link from articles in the Science Direct database to the datasets available from WHOAS that are related to that article. The data associated with the article are freely available from WHOAS and accompanied by a Dublin Core metadata record. In addition, the Library has worked with researchers to deposit datasets in WHOAS that are not appropriate for national, international, or domain specific data repositories. These datasets currently include audio, text and image files. This research is being conducted by a team of librarians, data managers and scientists that are collaborating with representatives from the Scientific Committee on Oceanic Research (SCOR) and the International Oceanographic Data and Information Exchange (IODE) of the Intergovernmental Oceanographic Commission (IOC). The goal is to identify best practices for tracking data provenance and clearly attributing credit to data collectors/providers.

  8. EnviroAtlas - Austin, TX - Block Groups

    EPA Pesticide Factsheets

    This EnviroAtlas dataset is the base layer for the Austin, TX EnviroAtlas area. The block groups are from the US Census Bureau and are included/excluded based on EnviroAtlas criteria described in the procedure log. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  9. EnviroAtlas - Austin, TX - Demographics by Block Group Web Service

    EPA Pesticide Factsheets

    This EnviroAtlas web service supports research and online mapping activities related to EnviroAtlas (https://enviroatlas.epa.gov/EnviroAtlas). This EnviroAtlas dataset is a summary of key demographic groups for the EnviroAtlas community. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  10. Blood is Thicker Than Water

    PubMed Central

    Carmichael, Sarah; Rijpma, Auke

    2017-01-01

    This article introduces a new dataset of historical family characteristics based on ethnographic literature. The novelty of the dataset lies in the fact that it is constructed at the level of the ethnic group. To test the possibilities of the dataset, we construct a measure of family constraints on women’s agency from it and explore its correlation to a number of geographical factors. PMID:28490859

  11. The allometric exponent for scaling clearance varies with age: a study on seven propofol datasets ranging from preterm neonates to adults.

    PubMed

    Wang, Chenguang; Allegaert, Karel; Peeters, Mariska Y M; Tibboel, Dick; Danhof, Meindert; Knibbe, Catherijne A J

    2014-01-01

    For scaling clearance between adults and children, allometric scaling with a fixed exponent of 0.75 is often applied. In this analysis, we performed a systematic study on the allometric exponent for scaling propofol clearance between two subpopulations selected from neonates, infants, toddlers, children, adolescents and adults. Seven propofol studies were included in the analysis (neonates, infants, toddlers, children, adolescents, adults1 and adults2). In a systematic manner, two out of the six study populations were selected resulting in 15 combined datasets. In addition, the data of the seven studies were regrouped into five age groups (FDA Guidance 1998), from which four combined datasets were prepared consisting of one paediatric age group and the adult group. In each of these 19 combined datasets, the allometric scaling exponent for clearance was estimated using population pharmacokinetic modelling (nonmem 7.2). The allometric exponent for propofol clearance varied between 1.11 and 2.01 in cases where the neonate dataset was included. When two paediatric datasets were analyzed, the exponent varied between 0.2 and 2.01, while it varied between 0.56 and 0.81 when the adult population and a paediatric dataset except for neonates were selected. Scaling from adults to adolescents, children, infants and neonates resulted in exponents of 0.74, 0.70, 0.60 and 1.11 respectively. For scaling clearance, ¾ allometric scaling may be of value for scaling between adults and adolescents or children, while it can neither be used for neonates nor for two paediatric populations. For scaling to neonates an exponent between 1 and 2 was identified. © 2013 The British Pharmacological Society.

  12. GES DISC Datalist Improves Earth Science Data Discoverbility

    NASA Astrophysics Data System (ADS)

    Li, A.; Teng, W. L.; Hegde, M.; Petrenko, M.; Shen, S.; Shie, C. L.; Liu, Z.; Hearty, T.; Bryant, K.; Vollmer, B.; Meyer, D. J.

    2017-12-01

    At American Geophysical Union(AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a novel way to access data: Datalist. Currently, datalist is a collection of predefined data variables from one or more archived datasets, curated by our subject matter expert (SME). Our science support team has curated a predefined Hurricane Datalist and received very positive feedback from the user community. Datalist uses the same architecture our new website uses and have the same look and feel as other datasets on our web site. and also provides a one-stop shopping for data, metadata, citation, documentation, visualization and other available services. Since the last AGU Meeting, we have further developed a few new datalists corresponding to the Big Earth Data Initiative (BEDI) Societal Benefit Areas and A-Train data. We now have four datalists: Hurricane, Wind Energy, Greenhouse Gas and A-Train. We have also started working with our User Working Group members to create their favorite datalists and working with other DAAC to explore the possibility to include their products in our datalists that may also lead to a future of potential federated (cross-DAAC) datalists. Since our datalist prototype effort was a success, we are planning to make datalist operational. It's extremely important to have a common metadata model to support datalist, this will also be the foundation of federated datalist. We mapped our datalist metadata model to the unpublished UMM(Universal Metadata Model)-Var (Variable) (June version) and found that the UMM-var together with UMM-C (Collection) and possible UMM-S (Service) will meet our basic requirements. For example: Dataset shortname, and version are already specified in UMM-C, variable name, long name, units, dimensions are all specified in UMM-Var. UMM-Var also facilitates ScienceKeywords to allow tagging at variable level and Characteristics for optional variable characteristics. Measurements is useful for grouping of the variables and Set is promising to define datalist. And finally, the UMM-Service model to specify the available services for the variable will be very beneficial. In summary, UMM-Var, UMM-C and UMM-S are the basis of federated datalist and the development and deployment of datalist will contribute to the evolution of the UMM.

  13. GES DISC Datalist Improves Earth Science Data Discoverability

    NASA Technical Reports Server (NTRS)

    Li, A.; Teng, W.; Hegde, M.; Petrenko, M.; Shen, S.; Shie, C.; Liu, Z.; Hearty, T.; Bryant, K.; Vollmer, B.; hide

    2017-01-01

    At American Geophysical Union(AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a novel way to access data: Datalist. Currently, datalist is a collection of predefined data variables from one or more archived datasets, curated by our subject matter expert (SME). Our science support team has curated a predefined Hurricane Datalist and received very positive feedback from the user community. Datalist uses the same architecture our new website uses and have the same look and feel as other datasets on our web site. and also provides a one-stop shopping for data, metadata, citation, documentation, visualization and other available services. Since the last AGU Meeting, we have further developed a few new datalists corresponding to the Big Earth Data Initiative (BEDI) Societal Benefit Areas and A-Train data. We now have four datalists: Hurricane, Wind Energy, Greenhouse Gas and A-Train. We have also started working with our User Working Group members to create their favorite datalists and working with other DAAC to explore the possibility to include their products in our datalists that may also lead to a future of potential federated (cross-DAAC) datalists. Since our datalist prototype effort was a success, we are planning to make datalist operational. It's extremely important to have a common metadata model to support datalist, this will also be the foundation of federated datalist. We mapped our datalist metadata model to the unpublished UMM(Universal Metadata Model)-Var (Variable) (June version) and found that the UMM-var together with UMM-C (Collection) and possible UMM-S (Service) will meet our basic requirements. For example: Dataset shortname, and version are already specified in UMM-C, variable name, long name, units, dimensions are all specified in UMM-Var. UMM-Var also facilitates Science Keywords to allow tagging at variable level and Characteristics for optional variable characteristics. Measurements is useful for grouping of the variables and Set is promising to define datalist. And finally, the UMM-Service model to specify the available services for the variable will be very beneficial. In summary, UMM-Var, UMM-C and UMM-S are the basis of federated datalist and the development and deployment of datalist will contribute to the evolution of the UMM.

  14. GTN-G, WGI, RGI, DCW, GLIMS, WGMS, GCOS - What's all this about? (Invited)

    NASA Astrophysics Data System (ADS)

    Paul, F.; Raup, B. H.; Zemp, M.

    2013-12-01

    In a large collaborative effort, the glaciological community has compiled a new and spa-tially complete global dataset of glacier outlines, the so-called Randolph Glacier Inventory or RGI. Despite its regional shortcomings in quality (e.g. in regard to geolocation, gener-alization, and interpretation), this dataset was heavily used for global-scale modelling ap-plications (e.g. determination of total glacier volume and glacier contribution to sea-level rise) in support of the forthcoming 5th Assessment Report (AR5) of Working Group I of the IPCC. The RGI is a merged dataset that is largely based on the GLIMS database and several new datasets provided by the community (both are mostly derived from satellite data), as well as the Digital Chart of the World (DCW) and glacier attribute information (location, size) from the World Glacier Inventory (WGI). There are now two key tasks to be performed, (1) improving the quality of the RGI in all regions where the outlines do not met the quality required for local scale applications, and (2) integrating the RGI in the GLIMS glacier database to improve its spatial completeness. While (1) requires again a huge effort but is already ongoing, (2) is mainly a technical issue that is nearly solved. Apart from this technical dimension, there is also a more political or structural one. While GLIMS is responsible for the remote sensing and glacier inventory part (Tier 5) of the Global Terrestrial Network for Glaciers (GTN-G) within the Global Climate Observing System (GCOS), the World Glacier Monitoring Service (WGMS) is collecting and dis-seminating the field observations. Along with new global products derived from satellite data (e.g. elevation changes and velocity fields) and the community wish to keep a snap-shot dataset such as the RGI available, how to make all these datasets available to the community without duplicating efforts and making best use of the very limited financial resources available must now be discussed. This overview presentation describes the cur-rently available datasets, clarifying the terminology and the international framework, and suggesting a way forward to serve the community at best.

  15. Lineages with long durations are old and morphologically average: an analysis using multiple datasets.

    PubMed

    Liow, Lee Hsiang

    2007-04-01

    Lineage persistence is as central to biology as evolutionary change. Important questions regarding persistence include: why do some lineages outlive their relatives, neither becoming extinct nor evolving into separate lineages? Do these long-duration lineages have distinctive ecological or morphological traits that correlate with their geologic durations and potentially aid their survival? In this paper, I test the hypothesis that lineages (species and higher taxa) with longer geologic durations have morphologies that are more average than expected by chance alone. I evaluate this hypothesis for both individual lineages with longer durations and groups of lineages with longer durations, using more than 60 published datasets of animals with adequate fossil records. Analyses presented here show that groups of lineages with longer durations fall empirically into one of three theoretically possible scenarios, namely: (1) the morphology of groups of longer duration lineages is closer to the grand average of their inclusive group, that is, their relative morphological distance is smaller than expected by chance alone, when compared with rarified samples of their shorter duration relatives (a negative group morpho-duration distribution); (2) the relative morphological distance of groups of longer duration lineages is no different from rarified samples of their shorter duration relatives (a null group morpho-duration distribution); and (3) the relative morphological distance of groups of longer duration lineages is greater than expected when compared with rarified samples of their shorter duration relatives (a positive group morpho-duration distribution). Datasets exhibiting negative group morpho-duration distributions predominate. However, lineages with higher ranks in the Linnean hierarchy demonstrate positive morpho-duration distributions more frequently. The relative morphological distance of individual longer duration lineages is no different from that of rarified samples of their shorter duration relatives (a null individual morpho-duration distribution) for the majority of datasets studied. Contrary to the common idea that very persistent lineages are special or unique in some significant way, both the results from analyses of long-duration lineages as groups and individuals show that they are morphologically average. Persistent lineages often arise early in a group's history, even though there is no prior expectation for this tendency in datasets of extinct groups. The implications of these results for diversification histories and niche preemption are discussed.

  16. EnviroAtlas - Austin, TX - Residents with Minimal Potential Window Views of Trees by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset shows the total block group population and the percentage of the block group population that has little access to potential window views of trees at home. Having little potential access to window views of trees is defined as having no trees & forest land cover within 50 meters. The window views are considered potential because the procedure does not account for presence or directionality of windows in one's home. Forest is defined as Trees & Forest. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  17. Earthquake Rate Model 2 of the 2007 Working Group for California Earthquake Probabilities, Magnitude-Area Relationships

    USGS Publications Warehouse

    Stein, Ross S.

    2008-01-01

    The Working Group for California Earthquake Probabilities must transform fault lengths and their slip rates into earthquake moment-magnitudes. First, the down-dip coseismic fault dimension, W, must be inferred. We have chosen the Nazareth and Hauksson (2004) method, which uses the depth above which 99% of the background seismicity occurs to assign W. The product of the observed or inferred fault length, L, with the down-dip dimension, W, gives the fault area, A. We must then use a scaling relation to relate A to moment-magnitude, Mw. We assigned equal weight to the Ellsworth B (Working Group on California Earthquake Probabilities, 2003) and Hanks and Bakun (2007) equations. The former uses a single logarithmic relation fitted to the M=6.5 portion of data of Wells and Coppersmith (1994); the latter uses a bilinear relation with a slope change at M=6.65 (A=537 km2) and also was tested against a greatly expanded dataset for large continental transform earthquakes. We also present an alternative power law relation, which fits the newly expanded Hanks and Bakun (2007) data best, and captures the change in slope that Hanks and Bakun attribute to a transition from area- to length-scaling of earthquake slip. We have not opted to use the alternative relation for the current model. The selections and weights were developed by unanimous consensus of the Executive Committee of the Working Group, following an open meeting of scientists, a solicitation of outside opinions from additional scientists, and presentation of our approach to the Scientific Review Panel. The magnitude-area relations and their assigned weights are unchanged from that used in Working Group (2003).

  18. ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems

    PubMed Central

    Expósito, Roberto R.

    2018-01-01

    Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/. PMID:29608567

  19. ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems.

    PubMed

    González-Domínguez, Jorge; Expósito, Roberto R

    2018-01-01

    Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.

  20. Molecular blood group typing in Banjar, Jawa, Mandailing and Kelantan Malays in Peninsular Malaysia.

    PubMed

    Abd Gani, Rahayu; Manaf, Siti Mariam; Zafarina, Zainuddin; Panneerchelvam, Sundararajulu; Chambers, Geoffrey Keith; Norazmi, Mohd Noor; Edinur, Hisham Atan

    2015-08-01

    In this study we genotyped ABO, Rhesus, Kell, Kidd and Duffy blood group loci in DNA samples from 120 unrelated individuals representing four Malay subethnic groups living in Peninsular Malaysia (Banjar: n = 30, Jawa: n = 30, Mandailing: n = 30 and Kelantan: n = 30). Analyses were performed using commercial polymerase chain reaction-sequence specific primer (PCR-SSP) typing kits (BAG Health Care GmbH, Lich, Germany). Overall, the present study has successfully compiled blood group datasets for the four Malay subethnic groups and used the datasets for studying ancestry and health. Copyright © 2015 Elsevier Ltd. All rights reserved.

  1. Large-Scale Pattern Discovery in Music

    NASA Astrophysics Data System (ADS)

    Bertin-Mahieux, Thierry

    This work focuses on extracting patterns in musical data from very large collections. The problem is split in two parts. First, we build such a large collection, the Million Song Dataset, to provide researchers access to commercial-size datasets. Second, we use this collection to study cover song recognition which involves finding harmonic patterns from audio features. Regarding the Million Song Dataset, we detail how we built the original collection from an online API, and how we encouraged other organizations to participate in the project. The result is the largest research dataset with heterogeneous sources of data available to music technology researchers. We demonstrate some of its potential and discuss the impact it already has on the field. On cover song recognition, we must revisit the existing literature since there are no publicly available results on a dataset of more than a few thousand entries. We present two solutions to tackle the problem, one using a hashing method, and one using a higher-level feature computed from the chromagram (dubbed the 2DFTM). We further investigate the 2DFTM since it has potential to be a relevant representation for any task involving audio harmonic content. Finally, we discuss the future of the dataset and the hope of seeing more work making use of the different sources of data that are linked in the Million Song Dataset. Regarding cover songs, we explain how this might be a first step towards defining a harmonic manifold of music, a space where harmonic similarities between songs would be more apparent.

  2. NASA Astrophysics Data System (ADS)

    2018-01-01

    The test dataset was also useful to compare visual range estimates carried out by the Koschmieder equation and visibility measured at the Milano-Linate airport. It is worthy to note that in this work the test dataset was used primarily for checking the proposed methodology and it was not meant to give an assessment of bext and VR in Milan for a wintertime period as done by Vecchi et al., [in press], who applied the tailored equation to a larger aerosol dataset.

  3. EnviroAtlas - Austin, TX - Park Access by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset shows the block group population that is within and beyond an easy walking distance (500m) of a park entrance. Park entrances were included in this analysis if they were within 5km of the EnviroAtlas community boundary. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  4. EnviroAtlas - Austin, TX - Historic Places by Census Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset portrays the total number of historic places located within each Census Block Group (CBG). The historic places data were compiled from the National Register of Historic Places, which provides official federal lists of districts, sites, buildings, structures and objects significant to American history, architecture, archeology, engineering, and culture.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  5. Hierarchical Adaptive Means (HAM) clustering for hardware-efficient, unsupervised and real-time spike sorting.

    PubMed

    Paraskevopoulou, Sivylla E; Wu, Di; Eftekhar, Amir; Constandinou, Timothy G

    2014-09-30

    This work presents a novel unsupervised algorithm for real-time adaptive clustering of neural spike data (spike sorting). The proposed Hierarchical Adaptive Means (HAM) clustering method combines centroid-based clustering with hierarchical cluster connectivity to classify incoming spikes using groups of clusters. It is described how the proposed method can adaptively track the incoming spike data without requiring any past history, iteration or training and autonomously determines the number of spike classes. Its performance (classification accuracy) has been tested using multiple datasets (both simulated and recorded) achieving a near-identical accuracy compared to k-means (using 10-iterations and provided with the number of spike classes). Also, its robustness in applying to different feature extraction methods has been demonstrated by achieving classification accuracies above 80% across multiple datasets. Last but crucially, its low complexity, that has been quantified through both memory and computation requirements makes this method hugely attractive for future hardware implementation. Copyright © 2014 Elsevier B.V. All rights reserved.

  6. An interactive environment for agile analysis and visualization of ChIP-sequencing data.

    PubMed

    Lerdrup, Mads; Johansen, Jens Vilstrup; Agrawal-Singh, Shuchi; Hansen, Klaus

    2016-04-01

    To empower experimentalists with a means for fast and comprehensive chromatin immunoprecipitation sequencing (ChIP-seq) data analyses, we introduce an integrated computational environment, EaSeq. The software combines the exploratory power of genome browsers with an extensive set of interactive and user-friendly tools for genome-wide abstraction and visualization. It enables experimentalists to easily extract information and generate hypotheses from their own data and public genome-wide datasets. For demonstration purposes, we performed meta-analyses of public Polycomb ChIP-seq data and established a new screening approach to analyze more than 900 datasets from mouse embryonic stem cells for factors potentially associated with Polycomb recruitment. EaSeq, which is freely available and works on a standard personal computer, can substantially increase the throughput of many analysis workflows, facilitate transparency and reproducibility by automatically documenting and organizing analyses, and enable a broader group of scientists to gain insights from ChIP-seq data.

  7. Best Practices for International Collaboration and Applications of Interoperability within a NASA Data Center

    NASA Astrophysics Data System (ADS)

    Moroni, D. F.; Armstrong, E. M.; Tauer, E.; Hausman, J.; Huang, T.; Thompson, C. K.; Chung, N.

    2013-12-01

    The Physical Oceanographic Distributed Active Archive Center (PO.DAAC) is one of 12 data centers sponsored by NASA's Earth Science Data and Information System (ESDIS) project. The PO.DAAC is tasked with archival and distribution of NASA Earth science missions specific to physical oceanography, many of which have interdisciplinary applications for weather forecasting/monitoring, ocean biology, ocean modeling, and climate studies. PO.DAAC has a 20-year history of cross-project and international collaborations with partners in Europe, Japan, Australia, and the UK. Domestically, the PO.DAAC has successfully established lasting partners with non-NASA institutions and projects including the National Oceanic and Atmospheric Administration (NOAA), United States Navy, Remote Sensing Systems, and Unidata. A key component of these partnerships is PO.DAAC's direct involvement with international working groups and science teams, such as the Group for High Resolution Sea Surface Temperature (GHRSST), International Ocean Vector Winds Science Team (IOVWST), Ocean Surface Topography Science Team (OSTST), and the Committee on Earth Observing Satellites (CEOS). To help bolster new and existing collaborations, the PO.DAAC has established a standardized approach to its internal Data Management and Archiving System (DMAS), utilizing a Data Dictionary to provide the baseline standard for entry and capture of dataset and granule metadata. Furthermore, the PO.DAAC has established an end-to-end Dataset Lifecycle Policy, built upon both internal and external recommendations of best practices toward data stewardship. Together, DMAS, the Data Dictionary, and the Dataset Lifecycle Policy provide the infrastructure to enable standardized data and metadata to be fully ingested and harvested to facilitate interoperability and compatibility across data access protocols, tools, and services. The Dataset Lifecycle Policy provides the checks and balances to help ensure all incoming HDF and netCDF-based datasets meet minimum compliance requirements with the Lawrence Livermore National Laboratory's actively maintained Climate and Forecast (CF) conventions with additional goals toward metadata standards provided by the Attribute Convention for Dataset Discovery (ACDD), the International Organization for Standardization (ISO) 19100-series, and the Federal Geographic Data Committee (FGDC). By default, DMAS ensures all datasets are compliant with NASA's Global Change Master Directory (GCMD) and NASA's Reverb data discovery clearinghouse (also known as ECHO). For data access, PO.DAAC offers several widely-used technologies, including File Transfer Protocol (FTP), Open-source Project for a Network Data Access Protocol (OPeNDAP), and Thematic Realtime Environmental Distributed Data Services (THREDDS). These access technologies are available directly to users or through PO.DAAC's web interfaces, specifically the High-level Tool for Interactive Data Extraction (HiTIDE), Live Access Server (LAS), and PO.DAAC's set of search, image, and Consolidated Web Services (CWS). Lastly, PO.DAAC's newly introduced, standards-based CWS provide singular endpoints for search, imaging, and extraction capabilities, respectively, across L2/L3/L4 datasets. Altogether, these tools, services and policies serve to provide flexible, interoperable functionality for both users and data providers.

  8. Handling limited datasets with neural networks in medical applications: A small-data approach.

    PubMed

    Shaikhina, Torgyn; Khovanova, Natalia A

    2017-01-01

    Single-centre studies in medical domain are often characterised by limited samples due to the complexity and high costs of patient data collection. Machine learning methods for regression modelling of small datasets (less than 10 observations per predictor variable) remain scarce. Our work bridges this gap by developing a novel framework for application of artificial neural networks (NNs) for regression tasks involving small medical datasets. In order to address the sporadic fluctuations and validation issues that appear in regression NNs trained on small datasets, the method of multiple runs and surrogate data analysis were proposed in this work. The approach was compared to the state-of-the-art ensemble NNs; the effect of dataset size on NN performance was also investigated. The proposed framework was applied for the prediction of compressive strength (CS) of femoral trabecular bone in patients suffering from severe osteoarthritis. The NN model was able to estimate the CS of osteoarthritic trabecular bone from its structural and biological properties with a standard error of 0.85MPa. When evaluated on independent test samples, the NN achieved accuracy of 98.3%, outperforming an ensemble NN model by 11%. We reproduce this result on CS data of another porous solid (concrete) and demonstrate that the proposed framework allows for an NN modelled with as few as 56 samples to generalise on 300 independent test samples with 86.5% accuracy, which is comparable to the performance of an NN developed with 18 times larger dataset (1030 samples). The significance of this work is two-fold: the practical application allows for non-destructive prediction of bone fracture risk, while the novel methodology extends beyond the task considered in this study and provides a general framework for application of regression NNs to medical problems characterised by limited dataset sizes. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.

  9. A curated compendium of monocyte transcriptome datasets of relevance to human monocyte immunobiology research

    PubMed Central

    Rinchai, Darawan; Boughorbel, Sabri; Presnell, Scott; Quinn, Charlie; Chaussabel, Damien

    2016-01-01

    Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp. PMID:27158452

  10. Security Vulnerability Profiles of Mission Critical Software: Empirical Analysis of Security Related Bug Reports

    NASA Technical Reports Server (NTRS)

    Goseva-Popstojanova, Katerina; Tyo, Jacob

    2017-01-01

    While some prior research work exists on characteristics of software faults (i.e., bugs) and failures, very little work has been published on analysis of software applications vulnerabilities. This paper aims to contribute towards filling that gap by presenting an empirical investigation of application vulnerabilities. The results are based on data extracted from issue tracking systems of two NASA missions. These data were organized in three datasets: Ground mission IVV issues, Flight mission IVV issues, and Flight mission Developers issues. In each dataset, we identified security related software bugs and classified them in specific vulnerability classes. Then, we created the security vulnerability profiles, i.e., determined where and when the security vulnerabilities were introduced and what were the dominating vulnerabilities classes. Our main findings include: (1) In IVV issues datasets the majority of vulnerabilities were code related and were introduced in the Implementation phase. (2) For all datasets, around 90 of the vulnerabilities were located in two to four subsystems. (3) Out of 21 primary classes, five dominated: Exception Management, Memory Access, Other, Risky Values, and Unused Entities. Together, they contributed from 80 to 90 of vulnerabilities in each dataset.

  11. Development and validation of a melanoma risk score based on pooled data from 16 case-control studies

    PubMed Central

    Davies, John R; Chang, Yu-mei; Bishop, D Timothy; Armstrong, Bruce K; Bataille, Veronique; Bergman, Wilma; Berwick, Marianne; Bracci, Paige M; Elwood, J Mark; Ernstoff, Marc S; Green, Adele; Gruis, Nelleke A; Holly, Elizabeth A; Ingvar, Christian; Kanetsky, Peter A; Karagas, Margaret R; Lee, Tim K; Le Marchand, Loïc; Mackie, Rona M; Olsson, Håkan; Østerlind, Anne; Rebbeck, Timothy R; Reich, Kristian; Sasieni, Peter; Siskind, Victor; Swerdlow, Anthony J; Titus, Linda; Zens, Michael S; Ziegler, Andreas; Gallagher, Richard P.; Barrett, Jennifer H; Newton-Bishop, Julia

    2015-01-01

    Background We report the development of a cutaneous melanoma risk algorithm based upon 7 factors; hair colour, skin type, family history, freckling, nevus count, number of large nevi and history of sunburn, intended to form the basis of a self-assessment webtool for the general public. Methods Predicted odds of melanoma were estimated by analysing a pooled dataset from 16 case-control studies using logistic random coefficients models. Risk categories were defined based on the distribution of the predicted odds in the controls from these studies. Imputation was used to estimate missing data in the pooled datasets. The 30th, 60th and 90th centiles were used to distribute individuals into four risk groups for their age, sex and geographic location. Cross-validation was used to test the robustness of the thresholds for each group by leaving out each study one by one. Performance of the model was assessed in an independent UK case-control study dataset. Results Cross-validation confirmed the robustness of the threshold estimates. Cases and controls were well discriminated in the independent dataset (area under the curve 0.75, 95% CI 0.73-0.78). 29% of cases were in the highest risk group compared with 7% of controls, and 43% of controls were in the lowest risk group compared with 13% of cases. Conclusion We have identified a composite score representing an estimate of relative risk and successfully validated this score in an independent dataset. Impact This score may be a useful tool to inform members of the public about their melanoma risk. PMID:25713022

  12. One tree to link them all: a phylogenetic dataset for the European tetrapoda.

    PubMed

    Roquet, Cristina; Lavergne, Sébastien; Thuiller, Wilfried

    2014-08-08

    Since the ever-increasing availability of phylogenetic informative data, the last decade has seen an upsurge of ecological studies incorporating information on evolutionary relationships among species. However, detailed species-level phylogenies are still lacking for many large groups and regions, which are necessary for comprehensive large-scale eco-phylogenetic analyses. Here, we provide a dataset of 100 dated phylogenetic trees for all European tetrapods based on a mixture of supermatrix and supertree approaches. Phylogenetic inference was performed separately for each of the main Tetrapoda groups of Europe except mammals (i.e. amphibians, birds, squamates and turtles) by means of maximum likelihood (ML) analyses of supermatrix applying a tree constraint at the family (amphibians and squamates) or order (birds and turtles) levels based on consensus knowledge. For each group, we inferred 100 ML trees to be able to provide a phylogenetic dataset that accounts for phylogenetic uncertainty, and assessed node support with bootstrap analyses. Each tree was dated using penalized-likelihood and fossil calibration. The trees obtained were well-supported by existing knowledge and previous phylogenetic studies. For mammals, we modified the most complete supertree dataset available on the literature to include a recent update of the Carnivora clade. As a final step, we merged the phylogenetic trees of all groups to obtain a set of 100 phylogenetic trees for all European Tetrapoda species for which data was available (91%). We provide this phylogenetic dataset (100 chronograms) for the purpose of comparative analyses, macro-ecological or community ecology studies aiming to incorporate phylogenetic information while accounting for phylogenetic uncertainty.

  13. Improving stability of prediction models based on correlated omics data by using network approaches.

    PubMed

    Tissier, Renaud; Houwing-Duistermaat, Jeanine; Rodríguez-Girondo, Mar

    2018-01-01

    Building prediction models based on complex omics datasets such as transcriptomics, proteomics, metabolomics remains a challenge in bioinformatics and biostatistics. Regularized regression techniques are typically used to deal with the high dimensionality of these datasets. However, due to the presence of correlation in the datasets, it is difficult to select the best model and application of these methods yields unstable results. We propose a novel strategy for model selection where the obtained models also perform well in terms of overall predictability. Several three step approaches are considered, where the steps are 1) network construction, 2) clustering to empirically derive modules or pathways, and 3) building a prediction model incorporating the information on the modules. For the first step, we use weighted correlation networks and Gaussian graphical modelling. Identification of groups of features is performed by hierarchical clustering. The grouping information is included in the prediction model by using group-based variable selection or group-specific penalization. We compare the performance of our new approaches with standard regularized regression via simulations. Based on these results we provide recommendations for selecting a strategy for building a prediction model given the specific goal of the analysis and the sizes of the datasets. Finally we illustrate the advantages of our approach by application of the methodology to two problems, namely prediction of body mass index in the DIetary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome study (DILGOM) and prediction of response of each breast cancer cell line to treatment with specific drugs using a breast cancer cell lines pharmacogenomics dataset.

  14. A Novel Feature-Map Based ICA Model for Identifying the Individual, Intra/Inter-Group Brain Networks across Multiple fMRI Datasets.

    PubMed

    Wang, Nizhuan; Chang, Chunqi; Zeng, Weiming; Shi, Yuhu; Yan, Hongjie

    2017-01-01

    Independent component analysis (ICA) has been widely used in functional magnetic resonance imaging (fMRI) data analysis to evaluate functional connectivity of the brain; however, there are still some limitations on ICA simultaneously handling neuroimaging datasets with diverse acquisition parameters, e.g., different repetition time, different scanner, etc. Therefore, it is difficult for the traditional ICA framework to effectively handle ever-increasingly big neuroimaging datasets. In this research, a novel feature-map based ICA framework (FMICA) was proposed to address the aforementioned deficiencies, which aimed at exploring brain functional networks (BFNs) at different scales, e.g., the first level (individual subject level), second level (intragroup level of subjects within a certain dataset) and third level (intergroup level of subjects across different datasets), based only on the feature maps extracted from the fMRI datasets. The FMICA was presented as a hierarchical framework, which effectively made ICA and constrained ICA as a whole to identify the BFNs from the feature maps. The simulated and real experimental results demonstrated that FMICA had the excellent ability to identify the intergroup BFNs and to characterize subject-specific and group-specific difference of BFNs from the independent component feature maps, which sharply reduced the size of fMRI datasets. Compared with traditional ICAs, FMICA as a more generalized framework could efficiently and simultaneously identify the variant BFNs at the subject-specific, intragroup, intragroup-specific and intergroup levels, implying that FMICA was able to handle big neuroimaging datasets in neuroscience research.

  15. Blood is Thicker Than Water: Geography and the Dispersal of Family Characteristics Across the Globe.

    PubMed

    Carmichael, Sarah; Rijpma, Auke

    2017-04-01

    This article introduces a new dataset of historical family characteristics based on ethnographic literature. The novelty of the dataset lies in the fact that it is constructed at the level of the ethnic group. To test the possibilities of the dataset, we construct a measure of family constraints on women's agency from it and explore its correlation to a number of geographical factors.

  16. The impact of a large-scale quality improvement programme on work engagement: preliminary results from a national cross-sectional-survey of the 'Productive Ward'.

    PubMed

    White, Mark; Wells, John S G; Butterworth, Tony

    2014-12-01

    Quality improvement (QI) Programmes, like the Productive Ward: Releasing-time-to-care initiative, aim to 'engage' and 'empower' ward teams to actively participate, innovate and lead quality improvement at the front line. However, little is known about the relationship and impact that QI work has on the 'engagement' of the clinical teams who participate and vice-versa. This paper explores and examines the impact of a large-scale QI programme, the Productive Ward, on the 'work engagement' of the nurses and ward teams involved. Using the Utrecht Work Engagement Scale (UWES), we surveyed, measured and analysed work engagement in a representative test group of hospital-based ward teams who had recently commenced the latest phase of the national 'Productive Ward' initiative in Ireland and compared them to a control group of similar size and matched (as far as is possible) on variables such as ward size, employment grade and clinical specialty area. 338 individual datasets were recorded, n=180 (53.6%) from the Productive Ward group, and n=158 (46.4%) from the control group; the overall response rate was 67%, and did not differ significantly between the Productive Ward and control groups. The work engagement mean score (±standard deviation) in the Productive group was 4.33(±0.88), and 4.07(±1.06) in the control group, representing a modest but statistically significant between-group difference (p=0.013, independent samples t-test). Similarly modest differences were observed in all three dimensions of the work engagement construct. Employment grade and the clinical specialty area were also significantly related to the work engagement score (p<0.001, general linear model) and (for the most part), to its components, with both clerical and nurse manager grades, and the elderly specialist areas, exhibiting substantially higher scores. The findings demonstrate how QI activities, like those integral to the Productive Ward programme, appear to positively impact on the work engagement (the vigour, absorption and dedication) of ward-based teams. The use and suitability of the UWES as an appropriate measure of 'engagement' in QI interventions was confirmed. The engagement of nurses and front-line clinical teams is a major component of creating, developing and sustaining a culture of improvement. Copyright © 2014 The Authors. Published by Elsevier Ltd.. All rights reserved.

  17. Generating Southern Africa Precipitation Forecast Using the FEWS Engine, a New Application for the Google Earth Engine

    NASA Astrophysics Data System (ADS)

    Landsfeld, M. F.; Hegewisch, K.; Daudert, B.; Morton, C.; Husak, G. J.; Friedrichs, M.; Funk, C. C.; Huntington, J. L.; Abatzoglou, J. T.; Verdin, J. P.

    2016-12-01

    The Famine Early Warning Systems Network (FEWS NET) focuses on food insecurity in developing nations and provides objective, evidence-based analysis to help government decision-makers and relief agencies plan for and respond to humanitarian emergencies. The network of FEWS NET analysts and scientists require flexible, interactive tools to aid in their monitoring and research efforts. Because they often work in bandwidth-limited regions, lightweight Internet tools and services that bypass the need for downloading massive datasets are preferred for their work. To support food security analysis FEWS NET developed a custom interface for the Google Earth Engine (GEE). GEE is a platform developed by Google to support scientific analysis of environmental data in their cloud computing environment. This platform allows scientists and independent researchers to mine massive collections of environmental data, leveraging Google's vast computational resources for purposes of detecting changes and monitoring the Earth's surface and climate. GEE hosts an enormous amount of satellite imagery and climate archives, one of which is the Climate Hazards Group Infrared Precipitation with Stations dataset (CHIRPS). CHIRPS precipitation dataset is a key input for FEWS NET monitoring and forecasting efforts. In this talk we introduce the FEWS Engine interface. We present an application that highlights the utility of FEWS Engine for forecasting the upcoming seasonal precipitation of southern Africa. Specifically, the current state of ENSO is assessed and used to identify similar historical seasons. The FEWS Engine compositing tool is used to examine rainfall and other environmental data for these analog seasons. The application illustrates the unique benefits of using FEWS Engine for on-the-fly food security scenario development.

  18. Velocity Structure of the Iran Region Using Seismic and Gravity Observations

    NASA Astrophysics Data System (ADS)

    Syracuse, E. M.; Maceira, M.; Phillips, W. S.; Begnaud, M. L.; Nippress, S. E. J.; Bergman, E.; Zhang, H.

    2015-12-01

    We present a 3D Vp and Vs model of Iran generated using a joint inversion of body wave travel times, Rayleigh wave dispersion curves, and high-wavenumber filtered Bouguer gravity observations. Our work has two main goals: 1) To better understand the tectonics of a prominent example of continental collision, and 2) To assess the improvements in earthquake location possible as a result of joint inversion. The body wave dataset is mainly derived from previous work on location calibration and includes the first-arrival P and S phases of 2500 earthquakes whose initial locations qualify as GT25 or better. The surface wave dataset consists of Rayleigh wave group velocity measurements for regional earthquakes, which are inverted for a suite of period-dependent Rayleigh wave velocity maps prior to inclusion in the joint inversion for body wave velocities. We use gravity anomalies derived from the global gravity model EGM2008. To avoid mapping broad, possibly dynamic features in the gravity field intovariations in density and body wave velocity, we apply a high-pass wavenumber filter to the gravity measurements. We use a simple, approximate relationship between density and velocity so that the three datasets may be combined in a single inversion. The final optimized 3D Vp and Vs model allows us to explore how multi-parameter tomography addresses crustal heterogeneities in areas of limited coverage and improves travel time predictions. We compare earthquake locations from our models to independent locations obtained from InSAR analysis to assess the improvement in locations derived in a joint-inversion model in comparison to those derived in a more traditional body-wave-only velocity model.

  19. Evaluating re-identification risks with respect to the HIPAA privacy rule

    PubMed Central

    Benitez, Kathleen

    2010-01-01

    Objective Many healthcare organizations follow data protection policies that specify which patient identifiers must be suppressed to share “de-identified” records. Such policies, however, are often applied without knowledge of the risk of “re-identification”. The goals of this work are: (1) to estimate re-identification risk for data sharing policies of the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule; and (2) to evaluate the risk of a specific re-identification attack using voter registration lists. Measurements We define several risk metrics: (1) expected number of re-identifications; (2) estimated proportion of a population in a group of size g or less, and (3) monetary cost per re-identification. For each US state, we estimate the risk posed to hypothetical datasets, protected by the HIPAA Safe Harbor and Limited Dataset policies by an attacker with full knowledge of patient identifiers and with limited knowledge in the form of voter registries. Results The percentage of a state's population estimated to be vulnerable to unique re-identification (ie, g=1) when protected via Safe Harbor and Limited Datasets ranges from 0.01% to 0.25% and 10% to 60%, respectively. In the voter attack, this number drops for many states, and for some states is 0%, due to the variable availability of voter registries in the real world. We also find that re-identification cost ranges from $0 to $17 000, further confirming risk variability. Conclusions This work illustrates that blanket protection policies, such as Safe Harbor, leave different organizations vulnerable to re-identification at different rates. It provides justification for locally performed re-identification risk estimates prior to sharing data. PMID:20190059

  20. Lymphoma diagnosis in histopathology using a multi-stage visual learning approach

    NASA Astrophysics Data System (ADS)

    Codella, Noel; Moradi, Mehdi; Matasar, Matt; Sveda-Mahmood, Tanveer; Smith, John R.

    2016-03-01

    This work evaluates the performance of a multi-stage image enhancement, segmentation, and classification approach for lymphoma recognition in hematoxylin and eosin (H and E) stained histopathology slides of excised human lymph node tissue. In the first stage, the original histology slide undergoes various image enhancement and segmentation operations, creating an additional 5 images for every slide. These new images emphasize unique aspects of the original slide, including dominant staining, staining segmentations, non-cellular groupings, and cellular groupings. For the resulting 6 total images, a collection of visual features are extracted from 3 different spatial configurations. Visual features include the first fully connected layer (4096 dimensions) of the Caffe convolutional neural network trained from ImageNet data. In total, over 200 resultant visual descriptors are extracted for each slide. Non-linear SVMs are trained over each of the over 200 descriptors, which are then input to a forward stepwise ensemble selection that optimizes a late fusion sum of logistically normalized model outputs using local hill climbing. The approach is evaluated on a public NIH dataset containing 374 images representing 3 lymphoma conditions: chronic lymphocytic leukemia (CLL), follicular lymphoma (FL), and mantle cell lymphoma (MCL). Results demonstrate a 38.4% reduction in residual error over the current state-of-art on this dataset.

  1. Analysis of genetic population structure in Acacia caven (Leguminosae, Mimosoideae), comparing one exploratory and two Bayesian-model-based methods.

    PubMed

    Pometti, Carolina L; Bessega, Cecilia F; Saidman, Beatriz O; Vilardi, Juan C

    2014-03-01

    Bayesian clustering as implemented in STRUCTURE or GENELAND software is widely used to form genetic groups of populations or individuals. On the other hand, in order to satisfy the need for less computer-intensive approaches, multivariate analyses are specifically devoted to extracting information from large datasets. In this paper, we report the use of a dataset of AFLP markers belonging to 15 sampling sites of Acacia caven for studying the genetic structure and comparing the consistency of three methods: STRUCTURE, GENELAND and DAPC. Of these methods, DAPC was the fastest one and showed accuracy in inferring the K number of populations (K = 12 using the find.clusters option and K = 15 with a priori information of populations). GENELAND in turn, provides information on the area of membership probabilities for individuals or populations in the space, when coordinates are specified (K = 12). STRUCTURE also inferred the number of K populations and the membership probabilities of individuals based on ancestry, presenting the result K = 11 without prior information of populations and K = 15 using the LOCPRIOR option. Finally, in this work all three methods showed high consistency in estimating the population structure, inferring similar numbers of populations and the membership probabilities of individuals to each group, with a high correlation between each other.

  2. Passive Containment DataSet

    EPA Pesticide Factsheets

    This data is for Figures 6 and 7 in the journal article. The data also includes the two EPANET input files used for the analysis described in the paper, one for the looped system and one for the block system.This dataset is associated with the following publication:Grayman, W., R. Murray , and D. Savic. Redesign of Water Distribution Systems for Passive Containment of Contamination. JOURNAL OF THE AMERICAN WATER WORKS ASSOCIATION. American Water Works Association, Denver, CO, USA, 108(7): 381-391, (2016).

  3. Modeling and Databases for Teaching Petrology

    NASA Astrophysics Data System (ADS)

    Asher, P.; Dutrow, B.

    2003-12-01

    With the widespread availability of high-speed computers with massive storage and ready transport capability of large amounts of data, computational and petrologic modeling and the use of databases provide new tools with which to teach petrology. Modeling can be used to gain insights into a system, predict system behavior, describe a system's processes, compare with a natural system or simply to be illustrative. These aspects result from data driven or empirical, analytical or numerical models or the concurrent examination of multiple lines of evidence. At the same time, use of models can enhance core foundations of the geosciences by improving critical thinking skills and by reinforcing prior knowledge gained. However, the use of modeling to teach petrology is dictated by the level of expectation we have for students and their facility with modeling approaches. For example, do we expect students to push buttons and navigate a program, understand the conceptual model and/or evaluate the results of a model. Whatever the desired level of sophistication, specific elements of design should be incorporated into a modeling exercise for effective teaching. These include, but are not limited to; use of the scientific method, use of prior knowledge, a clear statement of purpose and goals, attainable goals, a connection to the natural/actual system, a demonstration that complex heterogeneous natural systems are amenable to analyses by these techniques and, ideally, connections to other disciplines and the larger earth system. Databases offer another avenue with which to explore petrology. Large datasets are available that allow integration of multiple lines of evidence to attack a petrologic problem or understand a petrologic process. These are collected into a database that offers a tool for exploring, organizing and analyzing the data. For example, datasets may be geochemical, mineralogic, experimental and/or visual in nature, covering global, regional to local scales. These datasets provide students with access to large amount of related data through space and time. Goals of the database working group include educating earth scientists about information systems in general, about the importance of metadata about ways of using databases and datasets as educational tools and about the availability of existing datasets and databases. The modeling and databases groups hope to create additional petrologic teaching tools using these aspects and invite the community to contribute to the effort.

  4. Squish: Near-Optimal Compression for Archival of Relational Datasets

    PubMed Central

    Gao, Yihan; Parameswaran, Aditya

    2017-01-01

    Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: Squish achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets. PMID:28180028

  5. BMDExpress Data Viewer: A Visualization Tool to Analyze BMDExpress Datasets

    EPA Science Inventory

    Regulatory agencies increasingly apply benchmark dose (BMD) modeling to determine points of departure in human risk assessments. BMDExpress applies BMD modeling to transcriptomics datasets and groups genes to biological processes and pathways for rapid assessment of doses at whic...

  6. The Gulf of Mexico Coastal Ocean Observing System: A Decade of Data Aggregation and Services.

    NASA Astrophysics Data System (ADS)

    Howard, M.; Gayanilo, F.; Kobara, S.; Baum, S. K.; Currier, R. D.; Stoessel, M. M.

    2016-02-01

    The Gulf of Mexico Coastal Ocean Observing System Regional Association (GCOOS-RA) celebrated its 10-year anniversary in 2015. GCOOS-RA is one of 11 RAs organized under the NOAA-led U.S. Integrated Ocean Observing System (IOOS) Program Office to aggregate regional data and make these data publicly-available in preferred forms and formats via standards-based web services. Initial development of GCOOS focused on building elements of the IOOS Data Management and Communications Plan which is a framework for end-to-end interoperability. These elements included: data discovery, catalog, metadata, online-browse, data access and transport. Initial data types aggregated included near real-time physical oceanographic, marine meteorological and satellite data. Our focus in the middle of the past decade was on the production of basic products such as maps of current oceanographic conditions and quasi-static datasets such as bathymetry and climatologies. In the latter part of the decade we incorporated historical physical oceanographic datasets and historical coastal and offshore water quality data into our holdings and added our first biological dataset. We also developed web environments and products to support Citizen Scientists and stakeholder groups such as recreational boaters. Current efforts are directed towards applying data quality assurance (testing and flagging) to non-federal data, data archiving at national repositories, serving and visualizing numerical model output, providing data services for glider operators, and supporting marine biodiversity observing networks. GCOOS Data Management works closely with the Gulf of Mexico Research Initiative Information and Data Cooperative and various groups involved with Gulf Restoration. GCOOS-RA has influenced attitudes and behaviors associated with good data stewardship and data management practices across the Gulf and will to continue to do so into the next decade.

  7. Phylogenetic Molecular Species Delimitations Unravel Potential New Species in the Pest Genus Spodoptera Guenée, 1852 (Lepidoptera, Noctuidae)

    PubMed Central

    Dumas, Pascaline; Barbut, Jérôme; Le Ru, Bruno; Silvain, Jean-François; Clamens, Anne-Laure; d’Alençon, Emmanuelle; Kergoat, Gael J.

    2015-01-01

    Nowadays molecular species delimitation methods promote the identification of species boundaries within complex taxonomic groups by adopting innovative species concepts and theories (e.g. branching patterns, coalescence). As some of them can efficiently deal with large single-locus datasets, they could speed up the process of species discovery compared to more time consuming molecular methods, and benefit from the existence of large public datasets; these methods can also particularly favour scientific research and actions dealing with threatened or economically important taxa. In this study we aim to investigate and clarify the status of economically important moths species belonging to the genus Spodoptera (Lepidoptera, Noctuidae), a complex group in which previous phylogenetic analyses and integrative approaches already suggested the possible occurrence of cryptic species and taxonomic ambiguities. In this work, the effectiveness of innovative (and faster) species delimitation approaches to infer putative species boundaries has been successfully tested in Spodoptera, by processing the most comprehensive dataset (in terms of number of species and specimens) ever achieved; results are congruent and reliable, irrespective of the set of parameters and phylogenetic models applied. Our analyses confirm the existence of three potential new species clusters (for S. exigua (Hübner, 1808), S. frugiperda (J.E. Smith, 1797) and S. mauritia (Boisduval, 1833)) and support the synonymy of S. marima (Schaus, 1904) with S. ornithogalli (Guenée, 1852). They also highlight the ambiguity of the status of S. cosmiodes (Walker, 1858) and S. descoinsi Lalanne-Cassou & Silvain, 1994. This case study highlights the interest of molecular species delimitation methods as valuable tools for species discovery and to emphasize taxonomic ambiguities. PMID:25853412

  8. Exploring the reproducibility of functional connectivity alterations in Parkinson’s disease

    PubMed Central

    Onu, Mihaela; Wu, Tao; Roceanu, Adina; Bajenaru, Ovidiu

    2017-01-01

    Since anatomic MRI is presently not able to directly discern neuronal loss in Parkinson’s Disease (PD), studying the associated functional connectivity (FC) changes seems a promising approach toward developing non-invasive and non-radioactive neuroimaging markers for this disease. While several groups have reported such FC changes in PD, there are also significant discrepancies between studies. Investigating the reproducibility of PD-related FC changes on independent datasets is therefore of crucial importance. We acquired resting-state fMRI scans for 43 subjects (27 patients and 16 normal controls, with 2 replicate scans per subject) and compared the observed FC changes with those obtained in two independent datasets, one made available by the PPMI consortium (91 patients, 18 controls) and a second one by the group of Tao Wu (20 patients, 20 controls). Unfortunately, PD-related functional connectivity changes turned out to be non-reproducible across datasets. This could be due to disease heterogeneity, but also to technical differences. To distinguish between the two, we devised a method to directly check for disease heterogeneity using random splits of a single dataset. Since we still observe non-reproducibility in a large fraction of random splits of the same dataset, we conclude that functional heterogeneity may be a dominating factor behind the lack of reproducibility of FC alterations in different rs-fMRI studies of PD. While global PD-related functional connectivity changes were non-reproducible across datasets, we identified a few individual brain region pairs with marginally consistent FC changes across all three datasets. However, training classifiers on each one of the three datasets to discriminate PD scans from controls produced only low accuracies on the remaining two test datasets. Moreover, classifiers trained and tested on random splits of the same dataset (which are technically homogeneous) also had low test accuracies, directly substantiating disease heterogeneity. PMID:29182621

  9. Standardization Process for Space Radiation Models Used for Space System Design

    NASA Technical Reports Server (NTRS)

    Barth, Janet; Daly, Eamonn; Brautigam, Donald

    2005-01-01

    The space system design community has three concerns related to models of the radiation belts and plasma: 1) AP-8 and AE-8 models are not adequate for modern applications; 2) Data that have become available since the creation of AP-8 and AE-8 are not being fully exploited for modeling purposes; 3) When new models are produced, there is no authorizing organization identified to evaluate the models or their datasets for accuracy and robustness. This viewgraph presentation provided an overview of the roadmap adopted by the Working Group Meeting on New Standard Radiation Belt and Space Plasma Models.

  10. Les Houches 2017: Physics at TeV Colliders Standard Model Working Group Report

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Andersen, J.R.; et al.

    This Report summarizes the proceedings of the 2017 Les Houches workshop on Physics at TeV Colliders. Session 1 dealt with (I) new developments relevant for high precision Standard Model calculations, (II) theoretical uncertainties and dataset dependence of parton distribution functions, (III) new developments in jet substructure techniques, (IV) issues in the theoretical description of the production of Standard Model Higgs bosons and how to relate experimental measurements, (V) phenomenological studies essential for comparing LHC data from Run II with theoretical predictions and projections for future measurements, and (VI) new developments in Monte Carlo event generators.

  11. EnviroAtlas - Austin, TX - Potential Window Views of Water by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset describes the block group population and the percentage of the block group population that has potential views of water bodies. A potential view of water is defined as having a body of water that is greater than 300m2 within 50m of a residential location. The window views are considered potential because the procedure does not account for presence or directionality of windows in one's home. The residential locations are defined using the EnviroAtlas Dasymetric (2011/October 2015 version) map. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  12. EnviroAtlas - Austin, TX - BenMAP Results by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset demonstrates the effect of changes in pollution concentration on local populations in 750 block groups in Austin, Texas. The US EPA's Environmental Benefits Mapping and Analysis Program (BenMAP) was used to estimate the incidence of adverse health effects (i.e., mortality and morbidity) and associated monetary value that result from changes in pollution concentrations for Travis and Williamson Counties, TX. Incidence and value estimates for the block groups are calculated using i-Tree models (www.itreetools.org), local weather data, pollution data, and U.S. Census derived population data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  13. Speckle reduction process based on digital filtering and wavelet compounding in optical coherence tomography for dermatology

    NASA Astrophysics Data System (ADS)

    Gómez Valverde, Juan J.; Ortuño, Juan E.; Guerra, Pedro; Hermann, Boris; Zabihian, Behrooz; Rubio-Guivernau, José L.; Santos, Andrés.; Drexler, Wolfgang; Ledesma-Carbayo, Maria J.

    2015-07-01

    Optical Coherence Tomography (OCT) has shown a great potential as a complementary imaging tool in the diagnosis of skin diseases. Speckle noise is the most prominent artifact present in OCT images and could limit the interpretation and detection capabilities. In this work we propose a new speckle reduction process and compare it with various denoising filters with high edge-preserving potential, using several sets of dermatological OCT B-scans. To validate the performance we used a custom-designed spectral domain OCT and two different data set groups. The first group consisted in five datasets of a single B-scan captured N times (with N<20), the second were five 3D volumes of 25 Bscans. As quality metrics we used signal to noise (SNR), contrast to noise (CNR) and equivalent number of looks (ENL) ratios. Our results show that a process based on a combination of a 2D enhanced sigma digital filter and a wavelet compounding method achieves the best results in terms of the improvement of the quality metrics. In the first group of individual B-scans we achieved improvements in SNR, CNR and ENL of 16.87 dB, 2.19 and 328 respectively; for the 3D volume datasets the improvements were 15.65 dB, 3.44 and 1148. Our results suggest that the proposed enhancement process may significantly reduce speckle, increasing SNR, CNR and ENL and reducing the number of extra acquisitions of the same frame.

  14. Development of an internationally agreed minimal dataset for juvenile dermatomyositis (JDM) for clinical and research use.

    PubMed

    McCann, Liza J; Kirkham, Jamie J; Wedderburn, Lucy R; Pilkington, Clarissa; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Williamson, Paula R; Beresford, Michael W

    2015-06-12

    Juvenile dermatomyositis (JDM) is a rare autoimmune inflammatory disorder associated with significant morbidity and mortality. International collaboration is necessary to better understand the pathogenesis of the disease, response to treatment and long-term outcome. To aid international collaboration, it is essential to have a core set of data that all researchers and clinicians collect in a standardised way for clinical purposes and for research. This should include demographic details, diagnostic data and measures of disease activity, investigations and treatment. Variables in existing clinical registries have been compared to produce a provisional data set for JDM. We now aim to develop this into a consensus-approved minimum core dataset, tested in a wider setting, with the objective of achieving international agreement. A two-stage bespoke Delphi-process will engage the opinion of a large number of key stakeholders through Email distribution via established international paediatric rheumatology and myositis organisations. This, together with a formalised patient/parent participation process will help inform a consensus meeting of international experts that will utilise a nominal group technique (NGT). The resulting proposed minimal dataset will be tested for feasibility within existing database infrastructures. The developed minimal dataset will be sent to all internationally representative collaborators for final comment. The participants of the expert consensus group will be asked to draw together these comments, ratify and 'sign off' the final minimal dataset. An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. The final approved minimum core dataset could be rapidly incorporated into national and international collaborative efforts, including existing prospective databases, and be available for use in randomised controlled trials and for treatment/protocol comparisons in cohort studies.

  15. Lessons learned in the generation of biomedical research datasets using Semantic Open Data technologies.

    PubMed

    Legaz-García, María del Carmen; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2015-01-01

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources. Such heterogeneity makes difficult not only the generation of research-oriented dataset but also its exploitation. In recent years, the Open Data paradigm has proposed new ways for making data available in ways that sharing and integration are facilitated. Open Data approaches may pursue the generation of content readable only by humans and by both humans and machines, which are the ones of interest in our work. The Semantic Web provides a natural technological space for data integration and exploitation and offers a range of technologies for generating not only Open Datasets but also Linked Datasets, that is, open datasets linked to other open datasets. According to the Berners-Lee's classification, each open dataset can be given a rating between one and five stars attending to can be given to each dataset. In the last years, we have developed and applied our SWIT tool, which automates the generation of semantic datasets from heterogeneous data sources. SWIT produces four stars datasets, given that fifth one can be obtained by being the dataset linked from external ones. In this paper, we describe how we have applied the tool in two projects related to health care records and orthology data, as well as the major lessons learned from such efforts.

  16. Historic AVHRR Processing in the Eumetsat Climate Monitoring Satellite Application Facility (cmsaf) (Invited)

    NASA Astrophysics Data System (ADS)

    Karlsson, K.

    2010-12-01

    The EUMETSAT CMSAF project (www.cmsaf.eu) compiles climatological datasets from various satellite sources with emphasis on the use of EUMETSAT-operated satellites. However, since climate monitoring primarily has a global scope, also datasets merging data from various satellites and satellite operators are prepared. One such dataset is the CMSAF historic GAC (Global Area Coverage) dataset which is based on AVHRR data from the full historic series of NOAA-satellites and the European METOP satellite in mid-morning orbit launched in October 2006. The CMSAF GAC dataset consists of three groups of products: Macroscopical cloud products (cloud amount, cloud type and cloud top), cloud physical products (cloud phase, cloud optical thickness and cloud liquid water path) and surface radiation products (including surface albedo). Results will be presented and discussed for all product groups, including some preliminary inter-comparisons with other datasets (e.g., PATMOS-X, MODIS and CloudSat/CALIPSO datasets). A background will also be given describing the basic methodology behind the derivation of all products. This will include a short historical review of AVHRR cloud processing and resulting AVHRR applications at SMHI. Historic GAC processing is one of five pilot projects selected by the SCOPE-CM (Sustained Co-Ordinated Processing of Environmental Satellite data for Climate Monitoring) project organised by the WMO Space programme. The pilot project is carried out jointly between CMSAF and NOAA with the purpose of finding an optimal GAC processing approach. The initial activity is to inter-compare results of the CMSAF GAC dataset and the NOAA PATMOS-X dataset for the case when both datasets have been derived using the same inter-calibrated AVHRR radiance dataset. The aim is to get further knowledge of e.g. most useful multispectral methods and the impact of ancillary datasets (for example from meteorological reanalysis datasets from NCEP and ECMWF). The CMSAF project is currently defining plans for another five years (2012-2017) of operations and development. New GAC reprocessing efforts are planned and new methodologies will be tested. Central questions here will be how to increase the quantitative use of the products through improving error and uncertainty estimates and how to compile the information in a way to allow meaningful and efficient ways of using the data for e.g. validation of climate model information.

  17. Using Functional or Structural Magnetic Resonance Images and Personal Characteristic Data to Identify ADHD and Autism

    PubMed Central

    Ghiassian, Sina; Greiner, Russell; Jin, Ping; Brown, Matthew R. G.

    2016-01-01

    A clinical tool that can diagnose psychiatric illness using functional or structural magnetic resonance (MR) brain images has the potential to greatly assist physicians and improve treatment efficacy. Working toward the goal of automated diagnosis, we propose an approach for automated classification of ADHD and autism based on histogram of oriented gradients (HOG) features extracted from MR brain images, as well as personal characteristic data features. We describe a learning algorithm that can produce effective classifiers for ADHD and autism when run on two large public datasets. The algorithm is able to distinguish ADHD from control with hold-out accuracy of 69.6% (over baseline 55.0%) using personal characteristics and structural brain scan features when trained on the ADHD-200 dataset (769 participants in training set, 171 in test set). It is able to distinguish autism from control with hold-out accuracy of 65.0% (over baseline 51.6%) using functional images with personal characteristic data when trained on the Autism Brain Imaging Data Exchange (ABIDE) dataset (889 participants in training set, 222 in test set). These results outperform all previously presented methods on both datasets. To our knowledge, this is the first demonstration of a single automated learning process that can produce classifiers for distinguishing patients vs. controls from brain imaging data with above-chance accuracy on large datasets for two different psychiatric illnesses (ADHD and autism). Working toward clinical applications requires robustness against real-world conditions, including the substantial variability that often exists among data collected at different institutions. It is therefore important that our algorithm was successful with the large ADHD-200 and ABIDE datasets, which include data from hundreds of participants collected at multiple institutions. While the resulting classifiers are not yet clinically relevant, this work shows that there is a signal in the (f)MRI data that a learning algorithm is able to find. We anticipate this will lead to yet more accurate classifiers, over these and other psychiatric disorders, working toward the goal of a clinical tool for high accuracy differential diagnosis. PMID:28030565

  18. National Seabed Mapping Programmes Collaborate to Advance Marine Geomorphological Mapping in Adjoining European Seas

    NASA Astrophysics Data System (ADS)

    Monteys, X.; Guinan, J.; Green, S.; Gafeira, J.; Dove, D.; Baeten, N. J.; Thorsnes, T.

    2017-12-01

    Marine geomorphological mapping is an effective means of characterising and understanding the seabed and its features with direct relevance to; offshore infrastructure placement, benthic habitat mapping, conservation & policy, marine spatial planning, fisheries management and pure research. Advancements in acoustic survey techniques and data processing methods resulting in the availability of high-resolution marine datasets e.g. multibeam echosounder bathymetry and shallow seismic mean that geological interpretations can be greatly improved by combining with geomorphological maps. Since December 2015, representatives from the national seabed mapping programmes of Norway (MAREANO), Ireland (INFOMAR) and the United Kingdom (MAREMAP) have collaborated and established the MIM geomorphology working group) with the common aim of advancing best practice for geological mapping in their adjoining sea areas in north-west Europe. A recently developed two-part classification system for Seabed Geomorphology (`Morphology' and Geomorphology') has been established as a result of an initiative led by the British Geological Survey (BGS) with contributions from the MIM group (Dove et al. 2016). To support the scheme, existing BGS GIS tools (SIGMA) have been adapted to apply this two-part classification system and here we present on the tools effectiveness in mapping geomorphological features, along with progress in harmonising the classification and feature nomenclature. Recognising that manual mapping of seabed features can be time-consuming and subjective, semi-automated approaches for mapping seabed features and improving mapping efficiency is being developed using Arc-GIS based tools. These methods recognise, spatially delineate and morphologically describe seabed features such as pockmarks (Gafeira et al., 2012) and cold-water coral mounds. Such tools utilise multibeam echosounder data or any other bathymetric dataset (e.g. 3D seismic, Geldof et al., 2014) that can produce a depth digital model. The tools have the capability to capture an extensive list of morphological attributes. The MIM geomorphology working group's strategy to develop methods for more efficient marine geomorphological mapping is presented with data examples and case studies showing the latest results.

  19. SPICE: exploration and analysis of post-cytometric complex multivariate datasets.

    PubMed

    Roederer, Mario; Nozzi, Joshua L; Nason, Martha C

    2011-02-01

    Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.

  20. EnviroAtlas - Tampa, FL - Land Cover by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset describes the percentage of each block group that is classified as impervious, forest, green space, wetland, and agriculture. Impervious is a combination of dark and light impervious. Forest is a combination of trees and forest and woody wetlands. Green space is a combination of trees and forest, grass and herbaceous, agriculture, woody wetlands, and emergent wetlands. Wetlands includes both Woody and Emergent Wetlands.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  1. A large-scale dataset of solar event reports from automated feature recognition modules

    NASA Astrophysics Data System (ADS)

    Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.

    2016-05-01

    The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.

  2. Data-Oriented Astrophysics at NOAO: The Science Archive & The Data Lab

    NASA Astrophysics Data System (ADS)

    Juneau, Stephanie; NOAO Data Lab, NOAO Science Archive

    2018-06-01

    As we keep progressing into an era of increasingly large astronomy datasets, NOAO’s data-oriented mission is growing in prominence. The NOAO Science Archive, which captures and processes the pixel data from mountaintops in Chile and Arizona, now contains holdings at Petabyte scales. Working at the intersection of astronomy and data science, the main goal of the NOAO Data Lab is to provide users with a suite of tools to work close to this data, the catalogs derived from them, as well as externally provided datasets, and thus optimize the scientific productivity of the astronomy community. These tools and services include databases, query tools, virtual storage space, workflows through our Jupyter Notebook server, and scripted analysis. We currently host datasets from NOAO facilities such as the Dark Energy Survey (DES), the DESI imaging Legacy Surveys (LS), the Dark Energy Camera Plane Survey (DECaPS), and the nearly all-sky NOAO Source Catalog (NSC). We are further preparing for large spectroscopy datasets such as DESI. After a brief overview of the Science Archive, the Data Lab and datasets, I will briefly showcase scientific applications showing use of our data holdings. Lastly, I will describe our vision for future developments as we tackle the next technical and scientific challenges.

  3. Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies.

    PubMed

    Aldridge, Robert W; Shaji, Kunju; Hayward, Andrew C; Abubakar, Ibrahim

    2015-01-01

    The Enhanced Matching System (EMS) is a probabilistic record linkage program developed by the tuberculosis section at Public Health England to match data for individuals across two datasets. This paper outlines how EMS works and investigates its accuracy for linkage across public health datasets. EMS is a configurable Microsoft SQL Server database program. To examine the accuracy of EMS, two public health databases were matched using National Health Service (NHS) numbers as a gold standard unique identifier. Probabilistic linkage was then performed on the same two datasets without inclusion of NHS number. Sensitivity analyses were carried out to examine the effect of varying matching process parameters. Exact matching using NHS number between two datasets (containing 5931 and 1759 records) identified 1071 matched pairs. EMS probabilistic linkage identified 1068 record pairs. The sensitivity of probabilistic linkage was calculated as 99.5% (95%CI: 98.9, 99.8), specificity 100.0% (95%CI: 99.9, 100.0), positive predictive value 99.8% (95%CI: 99.3, 100.0), and negative predictive value 99.9% (95%CI: 99.8, 100.0). Probabilistic matching was most accurate when including address variables and using the automatically generated threshold for determining links with manual review. With the establishment of national electronic datasets across health and social care, EMS enables previously unanswerable research questions to be tackled with confidence in the accuracy of the linkage process. In scenarios where a small sample is being matched into a very large database (such as national records of hospital attendance) then, compared to results presented in this analysis, the positive predictive value or sensitivity may drop according to the prevalence of matches between databases. Despite this possible limitation, probabilistic linkage has great potential to be used where exact matching using a common identifier is not possible, including in low-income settings, and for vulnerable groups such as homeless populations, where the absence of unique identifiers and lower data quality has historically hindered the ability to identify individuals across datasets.

  4. Minimum energy requirements for desalination of brackish groundwater in the United States with comparison to international datasets

    USGS Publications Warehouse

    Ahdab, Yvana D.; Thiel, Gregory P.; Böhlke, John Karl; Stanton, Jennifer S.; Lienhard, John H.

    2018-01-01

    This paper uses chemical and physical data from a large 2017 U.S. Geological Surveygroundwater dataset with wells in the U.S. and three smaller international groundwater datasets with wells primarily in Australia and Spain to carry out a comprehensive investigation of brackish groundwater composition in relation to minimum desalinationenergy costs. First, we compute the site-specific least work required for groundwater desalination. Least work of separation represents a baseline for specific energy consumptionof desalination systems. We develop simplified equations based on the U.S. data for least work as a function of water recovery ratio and a proxy variable for composition, either total dissolved solids, specific conductance, molality or ionic strength. We show that the U.S. correlations for total dissolved solids and molality may be applied to the international datasets. We find that total molality can be used to calculate the least work of dilute solutions with very high accuracy. Then, we examine the effects of groundwater solute composition on minimum energy requirements, showing that separation requirements increase from calcium to sodium for cations and from sulfate to bicarbonate to chloride for anions, for any given TDS concentration. We study the geographic distribution of least work, total dissolved solids, and major ions concentration across the U.S. We determine areas with both low least work and high water stress in order to highlight regions holding potential for desalination to decrease the disparity between high water demand and low water supply. Finally, we discuss the implications of the USGS results on water resource planning, by comparing least work to the specific energy consumption of brackish water reverse osmosisplants and showing the scaling propensity of major electrolytes and silica in the U.S. groundwater samples.

  5. Incorporating biological information in sparse principal component analysis with application to genomic data.

    PubMed

    Li, Ziyi; Safo, Sandra E; Long, Qi

    2017-07-11

    Sparse principal component analysis (PCA) is a popular tool for dimensionality reduction, pattern recognition, and visualization of high dimensional data. It has been recognized that complex biological mechanisms occur through concerted relationships of multiple genes working in networks that are often represented by graphs. Recent work has shown that incorporating such biological information improves feature selection and prediction performance in regression analysis, but there has been limited work on extending this approach to PCA. In this article, we propose two new sparse PCA methods called Fused and Grouped sparse PCA that enable incorporation of prior biological information in variable selection. Our simulation studies suggest that, compared to existing sparse PCA methods, the proposed methods achieve higher sensitivity and specificity when the graph structure is correctly specified, and are fairly robust to misspecified graph structures. Application to a glioblastoma gene expression dataset identified pathways that are suggested in the literature to be related with glioblastoma. The proposed sparse PCA methods Fused and Grouped sparse PCA can effectively incorporate prior biological information in variable selection, leading to improved feature selection and more interpretable principal component loadings and potentially providing insights on molecular underpinnings of complex diseases.

  6. The first MICCAI challenge on PET tumor segmentation.

    PubMed

    Hatt, Mathieu; Laurent, Baptiste; Ouahabi, Anouar; Fayad, Hadi; Tan, Shan; Li, Laquan; Lu, Wei; Jaouen, Vincent; Tauber, Clovis; Czakon, Jakub; Drapejkowski, Filip; Dyrka, Witold; Camarasu-Pop, Sorina; Cervenansky, Frédéric; Girard, Pascal; Glatard, Tristan; Kain, Michael; Yao, Yao; Barillot, Christian; Kirov, Assen; Visvikis, Dimitris

    2018-02-01

    Automatic functional volume segmentation in PET images is a challenge that has been addressed using a large array of methods. A major limitation for the field has been the lack of a benchmark dataset that would allow direct comparison of the results in the various publications. In the present work, we describe a comparison of recent methods on a large dataset following recommendations by the American Association of Physicists in Medicine (AAPM) task group (TG) 211, which was carried out within a MICCAI (Medical Image Computing and Computer Assisted Intervention) challenge. Organization and funding was provided by France Life Imaging (FLI). A dataset of 176 images combining simulated, phantom and clinical images was assembled. A website allowed the participants to register and download training data (n = 19). Challengers then submitted encapsulated pipelines on an online platform that autonomously ran the algorithms on the testing data (n = 157) and evaluated the results. The methods were ranked according to the arithmetic mean of sensitivity and positive predictive value. Sixteen teams registered but only four provided manuscripts and pipeline(s) for a total of 10 methods. In addition, results using two thresholds and the Fuzzy Locally Adaptive Bayesian (FLAB) were generated. All competing methods except one performed with median accuracy above 0.8. The method with the highest score was the convolutional neural network-based segmentation, which significantly outperformed 9 out of 12 of the other methods, but not the improved K-Means, Gaussian Model Mixture and Fuzzy C-Means methods. The most rigorous comparative study of PET segmentation algorithms to date was carried out using a dataset that is the largest used in such studies so far. The hierarchy amongst the methods in terms of accuracy did not depend strongly on the subset of datasets or the metrics (or combination of metrics). All the methods submitted by the challengers except one demonstrated good performance with median accuracy scores above 0.8. Copyright © 2017 Elsevier B.V. All rights reserved.

  7. Publishing and Editing of Semantically-Enabled Scientific Metadata Across Multiple Web Platforms: Challenges and Experiences

    NASA Astrophysics Data System (ADS)

    Patton, E. W.; West, P.; Greer, R.; Jin, B.

    2011-12-01

    Following on work presented at the 2010 AGU Fall Meeting, we present a number of real-world collections of semantically-enabled scientific metadata ingested into the Tetherless World RDF2HTML system as structured data and presented and edited using that system. Two separate datasets from two different domains (oceanography and solar sciences) are made available using existing web standards and services, e.g. encoded using ontologies represented with the Web Ontology Language (OWL) and stored in a SPARQL endpoint for querying. These datasets are deployed for use in three different web environments, i.e. Drupal, MediaWiki, and a custom web portal written in Java, to highlight the cross-platform nature of the data presentation. Stylesheets used to transform concepts in each domain as well as shared terms into HTML will be presented to show the power of using common ontologies to publish data and support reuse of existing terminologies. In addition, a single domain dataset is shared between two separate portal instances to demonstrate the ability for this system to offer distributed access and modification of content across the Internet. Lastly, we will highlight challenges that arose in the software engineering process, outline the design choices we made in solving those issues, and discuss how future improvements to this and other systems will enable the evolution of distributed, decentralized collaborations for scientific data sharing across multiple research groups.

  8. Modeling the effects of diagenesis on carbonate clumped-isotope values in deep- and shallow-water settings

    NASA Astrophysics Data System (ADS)

    Stolper, Daniel A.; Eiler, John M.; Higgins, John A.

    2018-04-01

    The measurement of multiply isotopically substituted ('clumped isotope') carbonate groups provides a way to reconstruct past mineral formation temperatures. However, dissolution-reprecipitation (i.e., recrystallization) reactions, which commonly occur during sedimentary burial, can alter a sample's clumped-isotope composition such that it partially or wholly reflects deeper burial temperatures. Here we derive a quantitative model of diagenesis to explore how diagenesis alters carbonate clumped-isotope values. We apply the model to a new dataset from deep-sea sediments taken from Ocean Drilling Project site 807 in the equatorial Pacific. This dataset is used to ground truth the model. We demonstrate that the use of the model with accompanying carbonate clumped-isotope and carbonate δ18O values provides new constraints on both the diagenetic history of deep-sea settings as well as past equatorial sea-surface temperatures. Specifically, the combination of the diagenetic model and data support previous work that indicates equatorial sea-surface temperatures were warmer in the Paleogene as compared to today. We then explore whether the model is applicable to shallow-water settings commonly preserved in the rock record. Using a previously published dataset from the Bahamas, we demonstrate that the model captures the main trends of the data as a function of burial depth and thus appears applicable to a range of depositional settings.

  9. Empirical Studies on the Network of Social Groups: The Case of Tencent QQ

    PubMed Central

    You, Zhi-Qiang; Han, Xiao-Pu; Lü, Linyuan; Yeung, Chi Ho

    2015-01-01

    Background Participation in social groups are important but the collective behaviors of human as a group are difficult to analyze due to the difficulties to quantify ordinary social relation, group membership, and to collect a comprehensive dataset. Such difficulties can be circumvented by analyzing online social networks. Methodology/Principal Findings In this paper, we analyze a comprehensive dataset released from Tencent QQ, an instant messenger with the highest market share in China. Specifically, we analyze three derivative networks involving groups and their members—the hypergraph of groups, the network of groups and the user network—to reveal social interactions at microscopic and mesoscopic level. Conclusions/Significance Our results uncover interesting behaviors on the growth of user groups, the interactions between groups, and their relationship with member age and gender. These findings lead to insights which are difficult to obtain in social networks based on personal contacts. PMID:26176850

  10. Empirical Studies on the Network of Social Groups: The Case of Tencent QQ.

    PubMed

    You, Zhi-Qiang; Han, Xiao-Pu; Lü, Linyuan; Yeung, Chi Ho

    2015-01-01

    Participation in social groups are important but the collective behaviors of human as a group are difficult to analyze due to the difficulties to quantify ordinary social relation, group membership, and to collect a comprehensive dataset. Such difficulties can be circumvented by analyzing online social networks. In this paper, we analyze a comprehensive dataset released from Tencent QQ, an instant messenger with the highest market share in China. Specifically, we analyze three derivative networks involving groups and their members-the hypergraph of groups, the network of groups and the user network-to reveal social interactions at microscopic and mesoscopic level. Our results uncover interesting behaviors on the growth of user groups, the interactions between groups, and their relationship with member age and gender. These findings lead to insights which are difficult to obtain in social networks based on personal contacts.

  11. Integrating Healthcare Ethical Issues into IS Education

    ERIC Educational Resources Information Center

    Cellucci, Leigh W.; Layman, Elizabeth J.; Campbell, Robert; Zeng, Xiaoming

    2011-01-01

    Federal initiatives are encouraging the increase of IS graduates to work in the healthcare environment because they possess knowledge of datasets and dataset management that are key to effective management of electronic health records (EHRs) and health information technology (IT). IS graduates will be members of the healthcare team, and as such,…

  12. Merge of Five Previous Catalogues Into the Ground Truth Catalogue and Registration Based on MOLA Data with THEMIS-DIR, MDIM and MOC Data-Sets

    NASA Astrophysics Data System (ADS)

    Salamuniccar, G.; Loncaric, S.

    2008-03-01

    The Catalogue from our previous work was merged with the date of Barlow, Rodionova, Boyce, and Kuzmin. The resulting ground truth catalogue with 57,633 craters was registered, using MOLA data, with THEMIS-DIR, MDIM, and MOC data-sets.

  13. Data discovery with DATS: exemplar adoptions and lessons learned.

    PubMed

    Gonzalez-Beltran, Alejandra N; Campbell, John; Dunn, Patrick; Guijarro, Diana; Ionescu, Sanda; Kim, Hyeoneui; Lyle, Jared; Wiser, Jeffrey; Sansone, Susanna-Assunta; Rocca-Serra, Philippe

    2018-01-01

    The DAta Tag Suite (DATS) is a model supporting dataset description, indexing, and discovery. It is available as an annotated serialization with schema.org, a vocabulary used by major search engines, thus making the datasets discoverable on the web. DATS underlies DataMed, the National Institutes of Health Big Data to Knowledge Data Discovery Index prototype, which aims to provide a "PubMed for datasets." The experience gained while indexing a heterogeneous range of >60 repositories in DataMed helped in evaluating DATS's entities, attributes, and scope. In this work, 3 additional exemplary and diverse data sources were mapped to DATS by their representatives or experts, offering a deep scan of DATS fitness against a new set of existing data. The procedure, including feedback from users and implementers, resulted in DATS implementation guidelines and best practices, and identification of a path for evolving and optimizing the model. Finally, the work exposed additional needs when defining datasets for indexing, especially in the context of clinical and observational information. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  14. A journey to Semantic Web query federation in the life sciences.

    PubMed

    Cheung, Kei-Hoi; Frost, H Robert; Marshall, M Scott; Prud'hommeaux, Eric; Samwald, Matthias; Zhao, Jun; Paschke, Adrian

    2009-10-01

    As interest in adopting the Semantic Web in the biomedical domain continues to grow, Semantic Web technology has been evolving and maturing. A variety of technological approaches including triplestore technologies, SPARQL endpoints, Linked Data, and Vocabulary of Interlinked Datasets have emerged in recent years. In addition to the data warehouse construction, these technological approaches can be used to support dynamic query federation. As a community effort, the BioRDF task force, within the Semantic Web for Health Care and Life Sciences Interest Group, is exploring how these emerging approaches can be utilized to execute distributed queries across different neuroscience data sources. We have created two health care and life science knowledge bases. We have explored a variety of Semantic Web approaches to describe, map, and dynamically query multiple datasets. We have demonstrated several federation approaches that integrate diverse types of information about neurons and receptors that play an important role in basic, clinical, and translational neuroscience research. Particularly, we have created a prototype receptor explorer which uses OWL mappings to provide an integrated list of receptors and executes individual queries against different SPARQL endpoints. We have also employed the AIDA Toolkit, which is directed at groups of knowledge workers who cooperatively search, annotate, interpret, and enrich large collections of heterogeneous documents from diverse locations. We have explored a tool called "FeDeRate", which enables a global SPARQL query to be decomposed into subqueries against the remote databases offering either SPARQL or SQL query interfaces. Finally, we have explored how to use the vocabulary of interlinked Datasets (voiD) to create metadata for describing datasets exposed as Linked Data URIs or SPARQL endpoints. We have demonstrated the use of a set of novel and state-of-the-art Semantic Web technologies in support of a neuroscience query federation scenario. We have identified both the strengths and weaknesses of these technologies. While Semantic Web offers a global data model including the use of Uniform Resource Identifiers (URI's), the proliferation of semantically-equivalent URI's hinders large scale data integration. Our work helps direct research and tool development, which will be of benefit to this community.

  15. A journey to Semantic Web query federation in the life sciences

    PubMed Central

    Cheung, Kei-Hoi; Frost, H Robert; Marshall, M Scott; Prud'hommeaux, Eric; Samwald, Matthias; Zhao, Jun; Paschke, Adrian

    2009-01-01

    Background As interest in adopting the Semantic Web in the biomedical domain continues to grow, Semantic Web technology has been evolving and maturing. A variety of technological approaches including triplestore technologies, SPARQL endpoints, Linked Data, and Vocabulary of Interlinked Datasets have emerged in recent years. In addition to the data warehouse construction, these technological approaches can be used to support dynamic query federation. As a community effort, the BioRDF task force, within the Semantic Web for Health Care and Life Sciences Interest Group, is exploring how these emerging approaches can be utilized to execute distributed queries across different neuroscience data sources. Methods and results We have created two health care and life science knowledge bases. We have explored a variety of Semantic Web approaches to describe, map, and dynamically query multiple datasets. We have demonstrated several federation approaches that integrate diverse types of information about neurons and receptors that play an important role in basic, clinical, and translational neuroscience research. Particularly, we have created a prototype receptor explorer which uses OWL mappings to provide an integrated list of receptors and executes individual queries against different SPARQL endpoints. We have also employed the AIDA Toolkit, which is directed at groups of knowledge workers who cooperatively search, annotate, interpret, and enrich large collections of heterogeneous documents from diverse locations. We have explored a tool called "FeDeRate", which enables a global SPARQL query to be decomposed into subqueries against the remote databases offering either SPARQL or SQL query interfaces. Finally, we have explored how to use the vocabulary of interlinked Datasets (voiD) to create metadata for describing datasets exposed as Linked Data URIs or SPARQL endpoints. Conclusion We have demonstrated the use of a set of novel and state-of-the-art Semantic Web technologies in support of a neuroscience query federation scenario. We have identified both the strengths and weaknesses of these technologies. While Semantic Web offers a global data model including the use of Uniform Resource Identifiers (URI's), the proliferation of semantically-equivalent URI's hinders large scale data integration. Our work helps direct research and tool development, which will be of benefit to this community. PMID:19796394

  16. Linguistic Extensions of Topic Models

    ERIC Educational Resources Information Center

    Boyd-Graber, Jordan

    2010-01-01

    Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing large datasets where observations are collected into groups. Although topic modeling has been fruitfully applied to problems social science, biology, and computer vision, it has been most widely used to model datasets where documents are modeled as exchangeable…

  17. A photogrammetric technique for generation of an accurate multispectral optical flow dataset

    NASA Astrophysics Data System (ADS)

    Kniaz, V. V.

    2017-06-01

    A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.

  18. Combining deep residual neural network features with supervised machine learning algorithms to classify diverse food image datasets.

    PubMed

    McAllister, Patrick; Zheng, Huiru; Bond, Raymond; Moorhead, Anne

    2018-04-01

    Obesity is increasing worldwide and can cause many chronic conditions such as type-2 diabetes, heart disease, sleep apnea, and some cancers. Monitoring dietary intake through food logging is a key method to maintain a healthy lifestyle to prevent and manage obesity. Computer vision methods have been applied to food logging to automate image classification for monitoring dietary intake. In this work we applied pretrained ResNet-152 and GoogleNet convolutional neural networks (CNNs), initially trained using ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset with MatConvNet package, to extract features from food image datasets; Food 5K, Food-11, RawFooT-DB, and Food-101. Deep features were extracted from CNNs and used to train machine learning classifiers including artificial neural network (ANN), support vector machine (SVM), Random Forest, and Naive Bayes. Results show that using ResNet-152 deep features with SVM with RBF kernel can accurately detect food items with 99.4% accuracy using Food-5K validation food image dataset and 98.8% with Food-5K evaluation dataset using ANN, SVM-RBF, and Random Forest classifiers. Trained with ResNet-152 features, ANN can achieve 91.34%, 99.28% when applied to Food-11 and RawFooT-DB food image datasets respectively and SVM with RBF kernel can achieve 64.98% with Food-101 image dataset. From this research it is clear that using deep CNN features can be used efficiently for diverse food item image classification. The work presented in this research shows that pretrained ResNet-152 features provide sufficient generalisation power when applied to a range of food image classification tasks. Copyright © 2018 Elsevier Ltd. All rights reserved.

  19. Omicseq: a web-based search engine for exploring omics datasets

    PubMed Central

    Sun, Xiaobo; Pittard, William S.; Xu, Tianlei; Chen, Li; Zwick, Michael E.; Jiang, Xiaoqian; Wang, Fusheng

    2017-01-01

    Abstract The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve ‘findability’ of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. PMID:28402462

  20. Real-world datasets for portfolio selection and solutions of some stochastic dominance portfolio models.

    PubMed

    Bruni, Renato; Cesarone, Francesco; Scozzari, Andrea; Tardella, Fabio

    2016-09-01

    A large number of portfolio selection models have appeared in the literature since the pioneering work of Markowitz. However, even when computational and empirical results are described, they are often hard to replicate and compare due to the unavailability of the datasets used in the experiments. We provide here several datasets for portfolio selection generated using real-world price values from several major stock markets. The datasets contain weekly return values, adjusted for dividends and for stock splits, which are cleaned from errors as much as possible. The datasets are available in different formats, and can be used as benchmarks for testing the performances of portfolio selection models and for comparing the efficiency of the algorithms used to solve them. We also provide, for these datasets, the portfolios obtained by several selection strategies based on Stochastic Dominance models (see "On Exact and Approximate Stochastic Dominance Strategies for Portfolio Selection" (Bruni et al. [2])). We believe that testing portfolio models on publicly available datasets greatly simplifies the comparison of the different portfolio selection strategies.

  1. Prediction of Return-to-original-work after an Industrial Accident Using Machine Learning and Comparison of Techniques

    PubMed Central

    2018-01-01

    Background Many studies have tried to develop predictors for return-to-work (RTW). However, since complex factors have been demonstrated to predict RTW, it is difficult to use them practically. This study investigated whether factors used in previous studies could predict whether an individual had returned to his/her original work by four years after termination of the worker's recovery period. Methods An initial logistic regression analysis of 1,567 participants of the fourth Panel Study of Worker's Compensation Insurance yielded odds ratios. The participants were divided into two subsets, a training dataset and a test dataset. Using the training dataset, logistic regression, decision tree, random forest, and support vector machine models were established, and important variables of each model were identified. The predictive abilities of the different models were compared. Results The analysis showed that only earned income and company-related factors significantly affected return-to-original-work (RTOW). The random forest model showed the best accuracy among the tested machine learning models; however, the difference was not prominent. Conclusion It is possible to predict a worker's probability of RTOW using machine learning techniques with moderate accuracy. PMID:29736160

  2. Effects and Safety of Gyejibongnyeong-Hwan on Dysmenorrhea Caused by Blood Stagnation: A Randomized Controlled Trial

    PubMed Central

    Park, Jeong-Su; Park, Sunju; Cheon, Chun-Hoo; Jo, Seong-Cheon; Cho, Han Baek; Lim, Eun-Mee; Lim, Hyung Ho; Shin, Yong-Cheol; Ko, Seong-Gyu

    2013-01-01

    Objective. This study was a multicenter, randomized, double-blind, and controlled trial with two parallel arms: the GJBNH group and the placebo group. This trial recruited 100 women aging 18 to 35 years with primary dysmenorrhea caused by blood stagnation. The investigational drugs, GJBNH or placebo, were administered for two menstrual periods (8 weeks) to the participants three times per day. The participants were followed up for two menstrual cycles after the administration. Results. The results were analyzed by the intention-to-treat (ITT) dataset and the per-protocol (PP) dataset. In the ITT dataset, the change of the average menstrual pain VAS score in the GJBNH group was statistically significantly lower than that in the control group. Significant difference was not observed in the SF-MPQ score change between the GJBNH group and the placebo group. No significant difference was observed in the PP analyses. In the follow-up phase, the VAS scores of the average menstrual pain and the maximum menstrual pain continually decreased in the placebo group, but they increased in the GJBNH group. Conclusion. GJBNH treatment for eight weeks improved the pain of the dysmenorrhea caused by blood stagnation, but it should be successively administered for more than two menstrual cycles. Trial Registration. This trial is registered with Current Controlled Trials no. ISRCTN30426947. PMID:24191165

  3. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset | Office of Cancer Genomics

    Cancer.gov

    Identifying genetic alterations that prime a cancer cell to respond to a particular therapeutic agent can facilitate the development of precision cancer medicines. Cancer cell-line (CCL) profiling of small-molecule sensitivity has emerged as an unbiased method to assess the relationships between genetic or cellular features of CCLs and small-molecule response. Here, we developed annotated cluster multidimensional enrichment analysis to explore the associations between groups of small molecules and groups of CCLs in a new, quantitative sensitivity dataset.

  4. The Privacy and Security Implications of Open Data in Healthcare.

    PubMed

    Kobayashi, Shinji; Kane, Thomas B; Paton, Chris

    2018-04-22

     The International Medical Informatics Association (IMIA) Open Source Working Group (OSWG) initiated a group discussion to discuss current privacy and security issues in the open data movement in the healthcare domain from the perspective of the OSWG membership.  Working group members independently reviewed the recent academic and grey literature and sampled a number of current large-scale open data projects to inform the working group discussion.  This paper presents an overview of open data repositories and a series of short case reports to highlight relevant issues present in the recent literature concerning the adoption of open approaches to sharing healthcare datasets. Important themes that emerged included data standardisation, the inter-connected nature of the open source and open data movements, and how publishing open data can impact on the ethics, security, and privacy of informatics projects.  The open data and open source movements in healthcare share many common philosophies and approaches including developing international collaborations across multiple organisations and domains of expertise. Both movements aim to reduce the costs of advancing scientific research and improving healthcare provision for people around the world by adopting open intellectual property licence agreements and codes of practice. Implications of the increased adoption of open data in healthcare include the need to balance the security and privacy challenges of opening data sources with the potential benefits of open data for improving research and healthcare delivery. Georg Thieme Verlag KG Stuttgart.

  5. Correlation between adenoma detection rate in colonoscopy- and fecal immunochemical testing-based colorectal cancer screening programs.

    PubMed

    Cubiella, Joaquín; Castells, Antoni; Andreu, Montserrat; Bujanda, Luis; Carballo, Fernando; Jover, Rodrigo; Lanas, Ángel; Morillas, Juan Diego; Salas, Dolores; Quintero, Enrique

    2017-03-01

    The adenoma detection rate (ADR) is the main quality indicator of colonoscopy. The ADR recommended in fecal immunochemical testing (FIT)-based colorectal cancer screening programs is unknown. Using the COLONPREV (NCT00906997) study dataset, we performed a post-hoc analysis to determine if there was a correlation between the ADR in primary and work-up colonoscopy, and the equivalent figure to the minimal 20% ADR recommended. Colonoscopy was performed in 5722 individuals: 5059 as primary strategy and 663 after a positive FIT result (OC-Sensor™; cut-off level 15 µg/g of feces). We developed a predictive model based on a multivariable lineal regression analysis including confounding variables. The median ADR was 31% (range, 14%-51%) in the colonoscopy group and 55% (range, 21%-83%) in the FIT group. There was a positive correlation in the ADR between primary and work-up colonoscopy (Pearson's coefficient 0.716; p  < 0.001). ADR in the FIT group was independently related to ADR in the colonoscopy group: regression coefficient for colonoscopy ADR, 0.71 ( p  = 0.009); sex, 0.09 ( p  = 0.09); age, 0.3 ( p  = 0.5); and region 0.00 ( p  = 0.9). The equivalent figure to the 20% ADR was 45% (95% confidence interval, 35%-56%). ADR in primary and work-up colonoscopy of a FIT-positive result are positively and significantly correlated.

  6. Correlation between adenoma detection rate in colonoscopy- and fecal immunochemical testing-based colorectal cancer screening programs

    PubMed Central

    Castells, Antoni; Andreu, Montserrat; Bujanda, Luis; Carballo, Fernando; Jover, Rodrigo; Lanas, Ángel; Morillas, Juan Diego; Salas, Dolores; Quintero, Enrique

    2016-01-01

    Background The adenoma detection rate (ADR) is the main quality indicator of colonoscopy. The ADR recommended in fecal immunochemical testing (FIT)-based colorectal cancer screening programs is unknown. Methods Using the COLONPREV (NCT00906997) study dataset, we performed a post-hoc analysis to determine if there was a correlation between the ADR in primary and work-up colonoscopy, and the equivalent figure to the minimal 20% ADR recommended. Colonoscopy was performed in 5722 individuals: 5059 as primary strategy and 663 after a positive FIT result (OC-Sensor™; cut-off level 15 µg/g of feces). We developed a predictive model based on a multivariable lineal regression analysis including confounding variables. Results The median ADR was 31% (range, 14%–51%) in the colonoscopy group and 55% (range, 21%–83%) in the FIT group. There was a positive correlation in the ADR between primary and work-up colonoscopy (Pearson’s coefficient 0.716; p < 0.001). ADR in the FIT group was independently related to ADR in the colonoscopy group: regression coefficient for colonoscopy ADR, 0.71 (p = 0.009); sex, 0.09 (p = 0.09); age, 0.3 (p = 0.5); and region 0.00 (p = 0.9). The equivalent figure to the 20% ADR was 45% (95% confidence interval, 35%–56%). Conclusions ADR in primary and work-up colonoscopy of a FIT-positive result are positively and significantly correlated. PMID:28344793

  7. Combining users' activity survey and simulators to evaluate human activity recognition systems.

    PubMed

    Azkune, Gorka; Almeida, Aitor; López-de-Ipiña, Diego; Chen, Liming

    2015-04-08

    Evaluating human activity recognition systems usually implies following expensive and time-consuming methodologies, where experiments with humans are run with the consequent ethical and legal issues. We propose a novel evaluation methodology to overcome the enumerated problems, which is based on surveys for users and a synthetic dataset generator tool. Surveys allow capturing how different users perform activities of daily living, while the synthetic dataset generator is used to create properly labelled activity datasets modelled with the information extracted from surveys. Important aspects, such as sensor noise, varying time lapses and user erratic behaviour, can also be simulated using the tool. The proposed methodology is shown to have very important advantages that allow researchers to carry out their work more efficiently. To evaluate the approach, a synthetic dataset generated following the proposed methodology is compared to a real dataset computing the similarity between sensor occurrence frequencies. It is concluded that the similarity between both datasets is more than significant.

  8. Technique for fast and efficient hierarchical clustering

    DOEpatents

    Stork, Christopher

    2013-10-08

    A fast and efficient technique for hierarchical clustering of samples in a dataset includes compressing the dataset to reduce a number of variables within each of the samples of the dataset. A nearest neighbor matrix is generated to identify nearest neighbor pairs between the samples based on differences between the variables of the samples. The samples are arranged into a hierarchy that groups the samples based on the nearest neighbor matrix. The hierarchy is rendered to a display to graphically illustrate similarities or differences between the samples.

  9. Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures

    PubMed Central

    Wang, Ying; Fu, Lei; Ren, Jie; Yu, Zhaoxia; Chen, Ting; Sun, Fengzhu

    2018-01-01

    Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered “group-specific” in our study. Our main purpose is to discover group-specific sequence regions between control and case groups as disease-associated markers. We developed a long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly. We called our method MetaGO: Group-specific oligonucleotide analysis for metagenomic samples. An open-source pipeline on Apache Spark was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified group-specific markers. In the simulated dataset, 99.11% of group-specific logical 40-mers covered 98.89% disease-specific regions from the disease-associated strain. In addition, 97.90% of group-specific numerical 40-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 group-specific 40-mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 group-specific features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All group-specific 40-mers were present in LC patients, but not healthy controls. All the assembled 11 LC-specific sequences can be mapped to two strains of Veillonella parvula: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying group-specific k-mers, which would be clinically applicable for disease prediction. MetaGO is available at https://github.com/VVsmileyx/MetaGO. PMID:29774017

  10. Trends in ice sheet mass balance, 1992 to 2017

    NASA Astrophysics Data System (ADS)

    Shepherd, A.; Ivins, E. R.; Smith, B.; Velicogna, I.; Whitehouse, P. L.; Rignot, E. J.; van den Broeke, M. R.; Briggs, K.; Hogg, A.; Krinner, G.; Joughin, I. R.; Nowicki, S.; Payne, A. J.; Scambos, T.; Schlegel, N.; Moyano, G.; Konrad, H.

    2017-12-01

    The Ice Sheet Mass Balance Inter-Comparison Exercise (IMBIE) is a community effort, jointly supported by ESA and NASA, that aims to provide a consensus estimate of ice sheet mass balance from satellite gravimetry, altimetry and mass budget assessments, on an annual basis. The project has five experiment groups, one for each of the satellite techniques and two others to analyse surface mass balance (SMB) and glacial isostatic adjustment (GIA). The basic premise for the exercise is that individual ice sheet mass balance datasets are generated by project participants using common spatial and temporal domains to allow meaningful inter-comparison, and this controlled comparison in turn supports aggregation of the individual datasets over their full period. Participation is open to the full community, and the quality and consistency of submissions is regulated through a series of data standards and documentation requirements. The second phase of IMBIE commenced in 2015, with participant data submitted in 2016 and a combined estimate due for public release in 2017. Data from 48 participant groups were submitted to one of the three satellite mass balance technique groups or to the ancillary dataset groups. The individual mass balance estimates and ancillary datasets have been compared and combined within the respective groups. Following this, estimates of ice sheet mass balance derived from the individual techniques were then compared and combined. The result is single estimates of ice sheet mass balance for Greenland, East Antarctica, West Antarctica, and the Antarctic Peninsula. The participants, methodology and results of the exercise will be presented in this paper.

  11. The Harvard organic photovoltaic dataset

    DOE PAGES

    Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; ...

    2016-09-27

    Presented in this work is the Harvard Organic Photovoltaic Dataset (HOPV15), a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

  12. The Harvard organic photovoltaic dataset

    PubMed Central

    Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán

    2016-01-01

    The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312

  13. The Harvard organic photovoltaic dataset.

    PubMed

    Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán

    2016-09-27

    The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

  14. DataUp: A tool to help researchers describe and share tabular data.

    PubMed

    Strasser, Carly; Kunze, John; Abrams, Stephen; Cruse, Patricia

    2014-01-01

    Scientific datasets have immeasurable value, but they lose their value over time without proper documentation, long-term storage, and easy discovery and access. Across disciplines as diverse as astronomy, demography, archeology, and ecology, large numbers of small heterogeneous datasets (i.e., the long tail of data) are especially at risk unless they are properly documented, saved, and shared. One unifying factor for many of these at-risk datasets is that they reside in spreadsheets. In response to this need, the California Digital Library (CDL) partnered with Microsoft Research Connections and the Gordon and Betty Moore Foundation to create the DataUp data management tool for Microsoft Excel. Many researchers creating these small, heterogeneous datasets use Excel at some point in their data collection and analysis workflow, so we were interested in developing a data management tool that fits easily into those work flows and minimizes the learning curve for researchers. The DataUp project began in August 2011. We first formally assessed the needs of researchers by conducting surveys and interviews of our target research groups: earth, environmental, and ecological scientists. We found that, on average, researchers had very poor data management practices, were not aware of data centers or metadata standards, and did not understand the benefits of data management or sharing. Based on our survey results, we composed a list of desirable components and requirements and solicited feedback from the community to prioritize potential features of the DataUp tool. These requirements were then relayed to the software developers, and DataUp was successfully launched in October 2012.

  15. DataUp: A tool to help researchers describe and share tabular data

    PubMed Central

    Strasser, Carly; Kunze, John; Abrams, Stephen; Cruse, Patricia

    2014-01-01

    Scientific datasets have immeasurable value, but they lose their value over time without proper documentation, long-term storage, and easy discovery and access. Across disciplines as diverse as astronomy, demography, archeology, and ecology, large numbers of small heterogeneous datasets (i.e., the long tail of data) are especially at risk unless they are properly documented, saved, and shared. One unifying factor for many of these at-risk datasets is that they reside in spreadsheets. In response to this need, the California Digital Library (CDL) partnered with Microsoft Research Connections and the Gordon and Betty Moore Foundation to create the DataUp data management tool for Microsoft Excel. Many researchers creating these small, heterogeneous datasets use Excel at some point in their data collection and analysis workflow, so we were interested in developing a data management tool that fits easily into those work flows and minimizes the learning curve for researchers. The DataUp project began in August 2011. We first formally assessed the needs of researchers by conducting surveys and interviews of our target research groups: earth, environmental, and ecological scientists. We found that, on average, researchers had very poor data management practices, were not aware of data centers or metadata standards, and did not understand the benefits of data management or sharing. Based on our survey results, we composed a list of desirable components and requirements and solicited feedback from the community to prioritize potential features of the DataUp tool. These requirements were then relayed to the software developers, and DataUp was successfully launched in October 2012. PMID:25653834

  16. Generating and Visualizing Climate Indices using Google Earth Engine

    NASA Astrophysics Data System (ADS)

    Erickson, T. A.; Guentchev, G.; Rood, R. B.

    2017-12-01

    Climate change is expected to have largest impacts on regional and local scales. Relevant and credible climate information is needed to support the planning and adaptation efforts in our communities. The volume of climate projections of temperature and precipitation is steadily increasing, as datasets are being generated on finer spatial and temporal grids with an increasing number of ensembles to characterize uncertainty. Despite advancements in tools for querying and retrieving subsets of these large, multi-dimensional datasets, ease of access remains a barrier for many existing and potential users who want to derive useful information from these data, particularly for those outside of the climate modelling research community. Climate indices, that can be derived from daily temperature and precipitation data, such as annual number of frost days or growing season length, can provide useful information to practitioners and stakeholders. For this work the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset was loaded into Google Earth Engine, a cloud-based geospatial processing platform. Algorithms that use the Earth Engine API to generate several climate indices were written. The indices were chosen from the set developed by the joint CCl/CLIVAR/JCOMM Expert Team on Climate Change Detection and Indices (ETCCDI). Simple user interfaces were created that allow users to query, produce maps and graphs of the indices, as well as download results for additional analyses. These browser-based interfaces could allow users in low-bandwidth environments to access climate information. This research shows that calculating climate indices from global downscaled climate projection datasets and sharing them widely using cloud computing technologies is feasible. Further development will focus on exposing the climate indices to existing applications via the Earth Engine API, and building custom user interfaces for presenting climate indices to a diverse set of user groups.

  17. The Climate Hazards group InfraRed Precipitation with Stations (CHIRPS) dataset and its applications in drought risk management

    NASA Astrophysics Data System (ADS)

    Shukla, Shraddhanand; Funk, Chris; Peterson, Pete; McNally, Amy; Dinku, Tufa; Barbosa, Humberto; Paredes-Trejo, Franklin; Pedreros, Diego; Husak, Greg

    2017-04-01

    A high quality, long-term, high-resolution precipitation dataset is key for supporting drought-related risk management and food security early warning. Here, we present the Climate Hazards group InfraRed Precipitation with Stations (CHIRPS) v2.0, developed by scientists at the University of California, Santa Barbara and the U.S. Geological Survey Earth Resources Observation and Science Center under the direction of Famine Early Warning Systems Network (FEWS NET). CHIRPS is a quasi-global precipitation product and is made available at daily to seasonal time scales with a spatial resolution of 0.05° and a 1981 to near real-time period of record. We begin by describing the three main components of CHIRPS - a high-resolution climatology, time-varying cold cloud duration precipitation estimates, and in situ precipitation estimates, and how they are combined. We then present a validation of this dataset and describe how CHIRPS is being disseminated and used in different applications, such as large-scale hydrologic models and crop water balance models. Validation of CHIRPS has focused on comparisons with precipitation products with global coverage, long periods of record and near real-time availability such as CPC-Unified, CFS Reanalysis and ECMWF datasets and datasets such GPCC and GPCP that incorporate high quality in situ datasets from places such as Uganda, Colombia, and the Sahel. The CHIRPS is shown to have low systematic errors (bias) and low mean absolute errors. We find that CHIRPS performance appears quite similar to research quality products like the GPCC and GPCP, but with higher resolution and lower latency. We also present results from independent validation studies focused on South America and East Africa. CHIRPS is currently being used to drive FEWS NET Land Data Assimilation System (FLDAS), that incorporates multiple hydrologic models, and Water Requirement Satisfaction Index (WRSI), which is a widely used crop water balance model. The outputs (such as soil moisture and runoff) from these models are being used for real-time drought monitoring in Africa. Under support from the USAID FEWS NET, CHG/USGS has developed a two way strategy for dissemination of CHIRPS and related products (e.g. FLDAS, WRSI) and incorporate contributed station data. For example, we are currently working with partners in Mexico (Conagua), Southern Africa (SASSCAL), Colombia (IDEAM), Nigeria (Kukua), Somalia (SWALIM) and Ethiopia (NMA). These institutions provide in situ observations which enhance the CHIRPS and CHIRPS provides feedback on data quality. The CHIRPS is then placed in a web accessible geospatial database. Partners in these countries can then access CHIRPS and other outputs, and display this information using web-based mapping tools. This provides a win-win collaboration, leading to improved globally accessible precipitation estimates and improved climate services in developing nations.

  18. Investigating a method of producing "red and dead" galaxies

    NASA Astrophysics Data System (ADS)

    Skory, Stephen

    2010-08-01

    In optical wavelengths, galaxies are observed to be either red or blue. The overall color of a galaxy is due to the distribution of the ages of its stellar population. Galaxies with currently active star formation appear blue, while those with no recent star formation at all (greater than about a Gyr) have only old, red stars. This strong bimodality has lead to the idea of star formation quenching, and various proposed physical mechanisms. In this dissertation, I attempt to reproduce with Enzo the results of Naab et al. (2007), in which red and dead galaxies are formed using gravitational quenching, rather than with one of the more typical methods of quenching. My initial attempts are unsuccessful, and I explore the reasons why I think they failed. Then using simpler methods better suited to Enzo + AMR, I am successful in producing a galaxy that appears to be similar in color and formation history to those in Naab et al. However, quenching is achieved using unphysically high star formation efficiencies, which is a different mechanism than Naab et al. suggests. Preliminary results of a much higher resolution, follow-on simulation of the above show some possible contradiction with the results of Naab et al. Cold gas is streaming into the galaxy to fuel starbursts, while at a similar epoch the galaxies in Naab et al. have largely already ceased forming stars in the galaxy. On the other hand, the results of the high resolution simulation are qualitatively similar to other works in the literature that show a somewhat different gravitational quenching mechanism than Naab et al. I also discuss my work using halo finders to analyze simulated cosmological data, and my work improving the Enzo/AMR analysis tool "yt". This includes two parallelizations of the halo finder HOP (Eisenstein and Hut, 1998) which allows analysis of very large cosmological datasets on parallel machines. The first version is "yt-HOP," which works well for datasets between about 2563 and 5123 particles, but has memory bottlenecks as the datasets get larger. These bottlenecks inspired the second version, "Parallel HOP," which is a fully parallelized method and implementation of HOP that has worked on datasets with more than 20483 particles on hundreds of processing cores. Both methods are described in detail, as are the various effects of performance-related runtime options. Additionally, both halo finders are subjected to a full suite of performance benchmarks varying both dataset sizes and computational resources used. I conclude with descriptions of four new tools I added to yt. A Parallel Structure Function Generator allows analysis of two-point functions, such as correlation functions, using memory- and workload-parallelism. A Parallel Merger Tree Generator leverages the parallel halo finders in yt, such as Parallel HOP, to build the merger tree of halos in a cosmological simulation, and outputs the result to a SQLite database for simple and powerful data extraction. A Star Particle Analysis toolkit takes a group of star particles and can output the rate of formation as a function of time, and/or a synthetic Spectral Energy Distribution (S.E.D.) using the Bruzual and Charlot (2003) data tables. Finally, a Halo Mass Function toolkit takes as input a list of halo masses and can output the halo mass function for the halos, as well as an analytical fit for those halos using several previously published fits.

  19. Online Visualization and Value Added Services of MERRA-2 Data at GES DISC

    NASA Technical Reports Server (NTRS)

    Shen, Suhung; Ostrenga, Dana M.; Vollmer, Bruce E.; Hegde, Mahabaleshwa S.; Wei, Jennifer C.; Bosilovich, Michael G.

    2017-01-01

    NASA climate reanalysis datasets from MERRA-2, distributed at the Goddard Earth Sciences Data and Information Services Center (GES DISC), have been used in broad research areas, such as climate variations, extreme weather, agriculture, renewable energy, and air quality, etc. The datasets contain numerous variables for atmosphere, land, and ocean, grouped into 95 products. The total archived volume is approximately 337 TB ( approximately 562K files) at the end of October 2017. Due to the large number of products and files, and large data volumes, it may be a challenge for a user to find and download the data of interest. The support team at GES DISC, working closely with the MERRA-2 science team, has created and is continuing to work on value added data services to best meet the needs of a broad user community. This presentation, using aerosol over Asia Monsoon as an example, provides an overview of the MERRA-2 data services at GES DISC, including: How to find the data? How many data access methods are provided? What are the best data access methods for me? How do download the subsetted (parameter, spatial, temporal) data and save in preferred spatial resolution and data format? How to visualize and explore the data online? In addition, we introduce a future online analytic tool designed for supporting application research, focusing on long-term hourly time-series data access and analysis.

  20. Testing charged current quasi-elastic and multinucleon interaction models in the NEUT neutrino interaction generator with published datasets from the MiniBooNE and MINERνA experiments

    NASA Astrophysics Data System (ADS)

    Wilkinson, C.; Terri, R.; Andreopoulos, C.; Bercellie, A.; Bronner, C.; Cartwright, S.; de Perio, P.; Dobson, J.; Duffy, K.; Furmanski, A. P.; Haegel, L.; Hayato, Y.; Kaboth, A.; Mahn, K.; McFarland, K. S.; Nowak, J.; Redij, A.; Rodrigues, P.; Sánchez, F.; Schwehr, J. D.; Sinclair, P.; Sobczyk, J. T.; Stamoulis, P.; Stowell, P.; Tacik, R.; Thompson, L.; Tobayama, S.; Wascko, M. O.; Żmuda, J.

    2016-04-01

    There has been a great deal of theoretical work on sophisticated charged current quasi-elastic (CCQE) neutrino interaction models in recent years, prompted by a number of experimental results that measured unexpectedly large CCQE cross sections on nuclear targets. As the dominant interaction mode at T2K energies, and the signal process in oscillation analyses, it is important for the T2K experiment to include realistic CCQE cross section uncertainties in T2K analyses. To this end, T2K's Neutrino Interaction Working Group has implemented a number of recent models in NEUT, T2K's primary neutrino interaction event generator. In this paper, we give an overview of the models implemented and present fits to published νμ and ν¯ μ CCQE cross section measurements from the MiniBooNE and MINER ν A experiments. The results of the fits are used to select a default cross section model for future T2K analyses and to constrain the cross section uncertainties of the model. We find strong tension between datasets for all models investigated. Among the evaluated models, the combination of a modified relativistic Fermi gas with multinucleon CCQE-like interactions gives the most consistent description of the available data.

  1. Filtering NetCDF Files by Using the EverVIEW Slice and Dice Tool

    USGS Publications Warehouse

    Conzelmann, Craig; Romañach, Stephanie S.

    2010-01-01

    Network Common Data Form (NetCDF) is a self-describing, machine-independent file format for storing array-oriented scientific data. It was created to provide a common interface between applications and real-time meteorological and other scientific data. Over the past few years, there has been a growing movement within the community of natural resource managers in The Everglades, Fla., to use NetCDF as the standard data container for datasets based on multidimensional arrays. As a consequence, a need surfaced for additional tools to view and manipulate NetCDF datasets, specifically to filter the files by creating subsets of large NetCDF files. The U.S. Geological Survey (USGS) and the Joint Ecosystem Modeling (JEM) group are working to address these needs with applications like the EverVIEW Slice and Dice Tool, which allows users to filter grid-based NetCDF files, thus targeting those data most important to them. The major functions of this tool are as follows: (1) to create subsets of NetCDF files temporally, spatially, and by data value; (2) to view the NetCDF data in table form; and (3) to export the filtered data to a comma-separated value (CSV) file format. The USGS and JEM will continue to work with scientists and natural resource managers across The Everglades to solve complex restoration problems through technological advances.

  2. Standardising trauma monitoring: the development of a minimum dataset for trauma registries in Australia and New Zealand.

    PubMed

    Palmer, Cameron S; Davey, Tamzyn M; Mok, Meng Tuck; McClure, Rod J; Farrow, Nathan C; Gruen, Russell L; Pollard, Cliff W

    2013-06-01

    Trauma registries are central to the implementation of effective trauma systems. However, differences between trauma registry datasets make comparisons between trauma systems difficult. In 2005, the collaborative Australian and New Zealand National Trauma Registry Consortium began a process to develop a bi-national minimum dataset (BMDS) for use in Australasian trauma registries. This study aims to describe the steps taken in the development and preliminary evaluation of the BMDS. A working party comprising sixteen representatives from across Australasia identified and discussed the collectability and utility of potential BMDS fields. This included evaluating existing national and international trauma registry datasets, as well as reviewing all quality indicators and audit filters in use in Australasian trauma centres. After the working party activities concluded, this process was continued by a number of interested individuals, with broader feedback sought from the Australasian trauma community on a number of occasions. Once the BMDS had reached a suitable stage of development, an email survey was conducted across Australasian trauma centres to assess whether BMDS fields met an ideal minimum standard of field collectability. The BMDS was also compared with three prominent international datasets to assess the extent of dataset overlap. Following this, the BMDS was encapsulated in a data dictionary, which was introduced in late 2010. The finalised BMDS contained 67 data fields. Forty-seven of these fields met a previously published criterion of 80% collectability across respondent trauma institutions; the majority of the remaining fields either could be collected without any change in resources, or could be calculated from other data fields in the BMDS. However, comparability with international registry datasets was poor. Only nine BMDS fields had corresponding, directly comparable fields in all the national and international-level registry datasets evaluated. A draft BMDS has been developed for use in trauma registries across Australia and New Zealand. The email survey provided strong indications of the utility of the fields contained in the BMDS. The BMDS has been adopted as the dataset to be used by an ongoing Australian Trauma Quality Improvement Program. Copyright © 2012 Elsevier Ltd. All rights reserved.

  3. Do citations and readership identify seminal publications?

    DOE PAGES

    Herrmannova, Drahomira; Patton, Robert M.; Knoth, Petr; ...

    2018-02-10

    Here, this work presents a new approach for analysing the ability of existing research metrics to identify research which has strongly influenced future developments. More specifically, we focus on the ability of citation counts and Mendeley reader counts to distinguish between publications regarded as seminal and publications regarded as literature reviews by field experts. The main motivation behind our research is to gain a better understanding of whether and how well the existing research metrics relate to research quality. For this experiment we have created a new dataset which we call TrueImpactDataset and which contains two types of publications, seminalmore » papers and literature reviews. Using the dataset, we conduct a set of experiments to study how citation and reader counts perform in distinguishing these publication types, following the intuition that causing a change in a field signifies research quality. Finally, our research shows that citation counts work better than a random baseline (by a margin of 10%) in distinguishing important seminal research papers from literature reviews while Mendeley reader counts do not work better than the baseline.« less

  4. Do citations and readership identify seminal publications?

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Herrmannova, Drahomira; Patton, Robert M.; Knoth, Petr

    Here, this work presents a new approach for analysing the ability of existing research metrics to identify research which has strongly influenced future developments. More specifically, we focus on the ability of citation counts and Mendeley reader counts to distinguish between publications regarded as seminal and publications regarded as literature reviews by field experts. The main motivation behind our research is to gain a better understanding of whether and how well the existing research metrics relate to research quality. For this experiment we have created a new dataset which we call TrueImpactDataset and which contains two types of publications, seminalmore » papers and literature reviews. Using the dataset, we conduct a set of experiments to study how citation and reader counts perform in distinguishing these publication types, following the intuition that causing a change in a field signifies research quality. Finally, our research shows that citation counts work better than a random baseline (by a margin of 10%) in distinguishing important seminal research papers from literature reviews while Mendeley reader counts do not work better than the baseline.« less

  5. A polymer dataset for accelerated property prediction and design.

    PubMed

    Huan, Tran Doan; Mannodi-Kanakkithodi, Arun; Kim, Chiho; Sharma, Vinit; Pilania, Ghanshyam; Ramprasad, Rampi

    2016-03-01

    Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate target of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. It will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.

  6. Simulation of Smart Home Activity Datasets

    PubMed Central

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-01-01

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371

  7. Topographic and Hydrographic GIS Datasets for the Afghanistan Geological Survey and U.S. Geological Survey 2014 Mineral Areas of Interest

    USGS Publications Warehouse

    DeWitt, Jessica D.; Chirico, Peter G.; Malpeli, Katherine C.

    2015-11-18

    This work represents the fourth installment of the series, and publishes a dataset of eight new AOIs and one subarea within Afghanistan. These areas include Dasht-e-Nawar, Farah, North Ghazni, South Ghazni, Chakhansur, Godzareh East, Godzareh West, and Namaksar-e-Herat AOIs and the Central Bamyan subarea of the South Bamyan AOI (datasets for South Bamyan were published previously in Casey and Chirico, 2013). For each AOI and subarea, this dataset collection consists of the areal extent boundaries, elevation contours at 25-, 50-, and 100-m intervals, and an enhanced DEM. Hydrographic datasets covering the extent of four AOIs and one subarea are also included in the collection. The resulting raster and vector layers are intended for use by government agencies, developmental organizations, and private companies in Afghanistan to support mineral assessments, monitoring, management, and investment.

  8. Exploring Relationships in Big Data

    NASA Astrophysics Data System (ADS)

    Mahabal, A.; Djorgovski, S. G.; Crichton, D. J.; Cinquini, L.; Kelly, S.; Colbert, M. A.; Kincaid, H.

    2015-12-01

    Big Data are characterized by several different 'V's. Volume, Veracity, Volatility, Value and so on. For many datasets inflated Volumes through redundant features often make the data more noisy and difficult to extract Value out of. This is especially true if one is comparing/combining different datasets, and the metadata are diverse. We have been exploring ways to exploit such datasets through a variety of statistical machinery, and visualization. We show how we have applied it to time-series from large astronomical sky-surveys. This was done in the Virtual Observatory framework. More recently we have been doing similar work for a completely different domain viz. biology/cancer. The methodology reuse involves application to diverse datasets gathered through the various centers associated with the Early Detection Research Network (EDRN) for cancer, an initiative of the National Cancer Institute (NCI). Application to Geo datasets is a natural extension.

  9. Simulation of Smart Home Activity Datasets.

    PubMed

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-06-16

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.

  10. Aircraft Crew Radiation Exposure in Aviation Altitudes During Quiet and Solar Storm Periods

    NASA Astrophysics Data System (ADS)

    Beck, Peter

    The European Commission Directorate General Transport and Energy published in 2004 a summary report of research on aircrew dosimetry carried out by the EURADOS working group WG5 (European Radiation Dosimetry Group, http://www.eurados.org/). The aim of the EURADOS working group WG5 was to bring together, in particular from European research groups, the available, preferably published, experimental data and results of calculations, together with detailed descriptions of the methods of measurement and calculation. The purpose is to provide a dataset for all European Union Member States for the assessment of individual doses and/or to assess the validity of different approaches, and to provide an input to technical recommendations by the experts and the European Commission. Furthermore EURADOS (European Radiation Dosimetry Group, http://www.eurados.org/) started to coordinate research activities in model improvements for dose assessment of solar particle events. Preliminary results related to the European research project CONRAD (Coordinated Network for Radiation Dosimetry) on complex mixed radiation fields at workplaces are presented. The major aim of this work is the validation of models for dose assessment of solar particle events, using data from neutron ground level monitors, in-flight measurement results obtained during a solar particle event and proton satellite data. The radiation protection quantity of interest is effective dose, E (ISO), but the comparison of measurement results obtained by different methods or groups, and comparison of measurement results and the results of calculations, is done in terms of the operational quantity ambient dose equivalent, H* (10). This paper gives an overview of aircrew radiation exposure measurements during quiet and solar storm conditions and focuses on dose results using the EURADOS In-Flight Radiation Data Base and published data on solar particle events

  11. Historical instrumental climate data for Australia - quality and utility for palaeoclimatic studies

    NASA Astrophysics Data System (ADS)

    Nicholls, Neville; Collins, Dean; Trewin, Blair; Hope, Pandora

    2006-10-01

    The quality and availability of climate data suitable for palaeoclimatic calibration and verification for the Australian region are discussed and documented. Details of the various datasets, including problems with the data, are presented. High-quality datasets, where such problems are reduced or even eliminated, are discussed. Many climate datasets are now analysed onto grids, facilitating the preparation of regional-average time series. Work is under way to produce such high-quality, gridded datasets for a variety of hitherto unavailable climate data, including surface humidity, pan evaporation, wind, and cloud. An experiment suggests that only a relatively small number of palaeoclimatic time series could provide a useful estimate of long-term changes in Australian annual average temperature. Copyright

  12. DNAism: exploring genomic datasets on the web with Horizon Charts.

    PubMed

    Rio Deiros, David; Gibbs, Richard A; Rogers, Jeffrey

    2016-01-27

    Computational biologists daily face the need to explore massive amounts of genomic data. New visualization techniques can help researchers navigate and understand these big data. Horizon Charts are a relatively new visualization method that, under the right circumstances, maximizes data density without losing graphical perception. Horizon Charts have been successfully applied to understand multi-metric time series data. We have adapted an existing JavaScript library (Cubism) that implements Horizon Charts for the time series domain so that it works effectively with genomic datasets. We call this new library DNAism. Horizon Charts can be an effective visual tool to explore complex and large genomic datasets. Researchers can use our library to leverage these techniques to extract additional insights from their own datasets.

  13. An open source multivariate framework for n-tissue segmentation with evaluation on public data.

    PubMed

    Avants, Brian B; Tustison, Nicholas J; Wu, Jue; Cook, Philip A; Gee, James C

    2011-12-01

    We introduce Atropos, an ITK-based multivariate n-class open source segmentation algorithm distributed with ANTs ( http://www.picsl.upenn.edu/ANTs). The Bayesian formulation of the segmentation problem is solved using the Expectation Maximization (EM) algorithm with the modeling of the class intensities based on either parametric or non-parametric finite mixtures. Atropos is capable of incorporating spatial prior probability maps (sparse), prior label maps and/or Markov Random Field (MRF) modeling. Atropos has also been efficiently implemented to handle large quantities of possible labelings (in the experimental section, we use up to 69 classes) with a minimal memory footprint. This work describes the technical and implementation aspects of Atropos and evaluates its performance on two different ground-truth datasets. First, we use the BrainWeb dataset from Montreal Neurological Institute to evaluate three-tissue segmentation performance via (1) K-means segmentation without use of template data; (2) MRF segmentation with initialization by prior probability maps derived from a group template; (3) Prior-based segmentation with use of spatial prior probability maps derived from a group template. We also evaluate Atropos performance by using spatial priors to drive a 69-class EM segmentation problem derived from the Hammers atlas from University College London. These evaluation studies, combined with illustrative examples that exercise Atropos options, demonstrate both performance and wide applicability of this new platform-independent open source segmentation tool.

  14. An Open Source Multivariate Framework for n-Tissue Segmentation with Evaluation on Public Data

    PubMed Central

    Tustison, Nicholas J.; Wu, Jue; Cook, Philip A.; Gee, James C.

    2012-01-01

    We introduce Atropos, an ITK-based multivariate n-class open source segmentation algorithm distributed with ANTs (http://www.picsl.upenn.edu/ANTs). The Bayesian formulation of the segmentation problem is solved using the Expectation Maximization (EM) algorithm with the modeling of the class intensities based on either parametric or non-parametric finite mixtures. Atropos is capable of incorporating spatial prior probability maps (sparse), prior label maps and/or Markov Random Field (MRF) modeling. Atropos has also been efficiently implemented to handle large quantities of possible labelings (in the experimental section, we use up to 69 classes) with a minimal memory footprint. This work describes the technical and implementation aspects of Atropos and evaluates its performance on two different ground-truth datasets. First, we use the BrainWeb dataset from Montreal Neurological Institute to evaluate three-tissue segmentation performance via (1) K-means segmentation without use of template data; (2) MRF segmentation with initialization by prior probability maps derived from a group template; (3) Prior-based segmentation with use of spatial prior probability maps derived from a group template. We also evaluate Atropos performance by using spatial priors to drive a 69-class EM segmentation problem derived from the Hammers atlas from University College London. These evaluation studies, combined with illustrative examples that exercise Atropos options, demonstrate both performance and wide applicability of this new platform-independent open source segmentation tool. PMID:21373993

  15. Analysis of genetic population structure in Acacia caven (Leguminosae, Mimosoideae), comparing one exploratory and two Bayesian-model-based methods

    PubMed Central

    Pometti, Carolina L.; Bessega, Cecilia F.; Saidman, Beatriz O.; Vilardi, Juan C.

    2014-01-01

    Bayesian clustering as implemented in STRUCTURE or GENELAND software is widely used to form genetic groups of populations or individuals. On the other hand, in order to satisfy the need for less computer-intensive approaches, multivariate analyses are specifically devoted to extracting information from large datasets. In this paper, we report the use of a dataset of AFLP markers belonging to 15 sampling sites of Acacia caven for studying the genetic structure and comparing the consistency of three methods: STRUCTURE, GENELAND and DAPC. Of these methods, DAPC was the fastest one and showed accuracy in inferring the K number of populations (K = 12 using the find.clusters option and K = 15 with a priori information of populations). GENELAND in turn, provides information on the area of membership probabilities for individuals or populations in the space, when coordinates are specified (K = 12). STRUCTURE also inferred the number of K populations and the membership probabilities of individuals based on ancestry, presenting the result K = 11 without prior information of populations and K = 15 using the LOCPRIOR option. Finally, in this work all three methods showed high consistency in estimating the population structure, inferring similar numbers of populations and the membership probabilities of individuals to each group, with a high correlation between each other. PMID:24688293

  16. Phylogenetic reassessment of tribe Anemoneae (Ranunculaceae): Non-monophyly of Anemone s.l. revealed by plastid datasets

    PubMed Central

    Yang, Jun-Bo; Zhang, Shu-Dong; Guan, Kai-Yun; Tan, Yun-Hong

    2017-01-01

    Morphological and molecular evidence strongly supported the monophyly of tribe Anemoneae DC.; however, phylogenetic relationships among genera of this tribe have still not been fully resolved. In this study, we sampled 120 specimens representing 82 taxa of tribe Anemoneae. One nuclear ribosomal internal transcribed spacer (nrITS) and six plastid markers (atpB-rbcL, matK, psbA-trnQ, rpoB-trnC, rbcL and rps16) were amplified and sequenced. Both Maximum likelihood and Bayesian inference methods were used to reconstruct phylogenies for this tribe. Individual datasets supported all traditional genera as monophyletic, except Anemone and Clematis that were polyphyletic and paraphyletic, respectively, and revealed that the seven single-gene datasets can be split into two groups, i.e. nrITS + atpB-rbcL and the remaining five plastid markers. The combined nrITS + atpB-rbcL dataset recovered monophyly of subtribes Anemoninae (i.e. Anemone s.l.) and Clematidinae (including Anemoclema), respectively. However, the concatenated plastid dataset showed that one group of subtribes Anemoninae (Hepatica and Anemone spp. from subgenus Anemonidium) close to the clade Clematis s.l. + Anemoclema. Our results strongly supported a close relationship between Anemoclema and Clematis s.l., which included Archiclematis and Naravelia. Non-monophyly of Anemone s.l. using the plastid dataset indicates to revise as two genera, new Anemone s.l. (including Pulsatilla, Barneoudia, Oreithales and Knowltonia), Hepatica (corresponding to Anemone subgenus Anemonidium). PMID:28362811

  17. Phylogenetic reassessment of tribe Anemoneae (Ranunculaceae): Non-monophyly of Anemone s.l. revealed by plastid datasets.

    PubMed

    Jiang, Nan; Zhou, Zhuang; Yang, Jun-Bo; Zhang, Shu-Dong; Guan, Kai-Yun; Tan, Yun-Hong; Yu, Wen-Bin

    2017-01-01

    Morphological and molecular evidence strongly supported the monophyly of tribe Anemoneae DC.; however, phylogenetic relationships among genera of this tribe have still not been fully resolved. In this study, we sampled 120 specimens representing 82 taxa of tribe Anemoneae. One nuclear ribosomal internal transcribed spacer (nrITS) and six plastid markers (atpB-rbcL, matK, psbA-trnQ, rpoB-trnC, rbcL and rps16) were amplified and sequenced. Both Maximum likelihood and Bayesian inference methods were used to reconstruct phylogenies for this tribe. Individual datasets supported all traditional genera as monophyletic, except Anemone and Clematis that were polyphyletic and paraphyletic, respectively, and revealed that the seven single-gene datasets can be split into two groups, i.e. nrITS + atpB-rbcL and the remaining five plastid markers. The combined nrITS + atpB-rbcL dataset recovered monophyly of subtribes Anemoninae (i.e. Anemone s.l.) and Clematidinae (including Anemoclema), respectively. However, the concatenated plastid dataset showed that one group of subtribes Anemoninae (Hepatica and Anemone spp. from subgenus Anemonidium) close to the clade Clematis s.l. + Anemoclema. Our results strongly supported a close relationship between Anemoclema and Clematis s.l., which included Archiclematis and Naravelia. Non-monophyly of Anemone s.l. using the plastid dataset indicates to revise as two genera, new Anemone s.l. (including Pulsatilla, Barneoudia, Oreithales and Knowltonia), Hepatica (corresponding to Anemone subgenus Anemonidium).

  18. SU-E-T-23: A Developing Australian Network for Datamining and Modelling Routine Radiotherapy Clinical Data and Radiomics Information for Rapid Learning and Clinical Decision Support

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thwaites, D; Holloway, L; Bailey, M

    2015-06-15

    Purpose: Large amounts of routine radiotherapy (RT) data are available, which can potentially add clinical evidence to support better decisions. A developing collaborative Australian network, with a leading European partner, aims to validate, implement and extend European predictive models (PMs) for Australian practice and assess their impact on future patient decisions. Wider objectives include: developing multi-institutional rapid learning, using distributed learning approaches; and assessing and incorporating radiomics information into PMs. Methods: Two initial standalone pilots were conducted; one on NSCLC, the other on larynx, patient datasets in two different centres. Open-source rapid learning systems were installed, for data extraction andmore » mining to collect relevant clinical parameters from the centres’ databases. The European DSSs were learned (“training cohort”) and validated against local data sets (“clinical cohort”). Further NSCLC studies are underway in three more centres to pilot a wider distributed learning network. Initial radiomics work is underway. Results: For the NSCLC pilot, 159/419 patient datasets were identified meeting the PM criteria, and hence eligible for inclusion in the curative clinical cohort (for the larynx pilot, 109/125). Some missing data were imputed using Bayesian methods. For both, the European PMs successfully predicted prognosis groups, but with some differences in practice reflected. For example, the PM-predicted good prognosis NSCLC group was differentiated from a combined medium/poor prognosis group (2YOS 69% vs. 27%, p<0.001). Stage was less discriminatory in identifying prognostic groups. In the good prognosis group two-year overall survival was 65% in curatively and 18% in palliatively treated patients. Conclusion: The technical infrastructure and basic European PMs support prognosis prediction for these Australian patient groups, showing promise for supporting future personalized treatment decisions, improved treatment quality and potential practice changes. The early indications from the distributed learning and radiomics pilots strengthen this. Improved routine patient data quality should strengthen such rapid learning systems.« less

  19. GSNFS: Gene subnetwork biomarker identification of lung cancer expression data.

    PubMed

    Doungpan, Narumol; Engchuan, Worrawat; Chan, Jonathan H; Meechai, Asawin

    2016-12-05

    Gene expression has been used to identify disease gene biomarkers, but there are ongoing challenges. Single gene or gene-set biomarkers are inadequate to provide sufficient understanding of complex disease mechanisms and the relationship among those genes. Network-based methods have thus been considered for inferring the interaction within a group of genes to further study the disease mechanism. Recently, the Gene-Network-based Feature Set (GNFS), which is capable of handling case-control and multiclass expression for gene biomarker identification, has been proposed, partly taking into account of network topology. However, its performance relies on a greedy search for building subnetworks and thus requires further improvement. In this work, we establish a new approach named Gene Sub-Network-based Feature Selection (GSNFS) by implementing the GNFS framework with two proposed searching and scoring algorithms, namely gene-set-based (GS) search and parent-node-based (PN) search, to identify subnetworks. An additional dataset is used to validate the results. The two proposed searching algorithms of the GSNFS method for subnetwork expansion are concerned with the degree of connectivity and the scoring scheme for building subnetworks and their topology. For each iteration of expansion, the neighbour genes of a current subnetwork, whose expression data improved the overall subnetwork score, is recruited. While the GS search calculated the subnetwork score using an activity score of a current subnetwork and the gene expression values of its neighbours, the PN search uses the expression value of the corresponding parent of each neighbour gene. Four lung cancer expression datasets were used for subnetwork identification. In addition, using pathway data and protein-protein interaction as network data in order to consider the interaction among significant genes were discussed. Classification was performed to compare the performance of the identified gene subnetworks with three subnetwork identification algorithms. The two searching algorithms resulted in better classification and gene/gene-set agreement compared to the original greedy search of the GNFS method. The identified lung cancer subnetwork using the proposed searching algorithm resulted in an improvement of the cross-dataset validation and an increase in the consistency of findings between two independent datasets. The homogeneity measurement of the datasets was conducted to assess dataset compatibility in cross-dataset validation. The lung cancer dataset with higher homogeneity showed a better result when using the GS search while the dataset with low homogeneity showed a better result when using the PN search. The 10-fold cross-dataset validation on the independent lung cancer datasets showed higher classification performance of the proposed algorithms when compared with the greedy search in the original GNFS method. The proposed searching algorithms provide a higher number of genes in the subnetwork expansion step than the greedy algorithm. As a result, the performance of the subnetworks identified from the GSNFS method was improved in terms of classification performance and gene/gene-set level agreement depending on the homogeneity of the datasets used in the analysis. Some common genes obtained from the four datasets using different searching algorithms are genes known to play a role in lung cancer. The improvement of classification performance and the gene/gene-set level agreement, and the biological relevance indicated the effectiveness of the GSNFS method for gene subnetwork identification using expression data.

  20. Hierarchical Recognition Scheme for Human Facial Expression Recognition Systems

    PubMed Central

    Siddiqi, Muhammad Hameed; Lee, Sungyoung; Lee, Young-Koo; Khan, Adil Mehmood; Truc, Phan Tran Ho

    2013-01-01

    Over the last decade, human facial expressions recognition (FER) has emerged as an important research area. Several factors make FER a challenging research problem. These include varying light conditions in training and test images; need for automatic and accurate face detection before feature extraction; and high similarity among different expressions that makes it difficult to distinguish these expressions with a high accuracy. This work implements a hierarchical linear discriminant analysis-based facial expressions recognition (HL-FER) system to tackle these problems. Unlike the previous systems, the HL-FER uses a pre-processing step to eliminate light effects, incorporates a new automatic face detection scheme, employs methods to extract both global and local features, and utilizes a HL-FER to overcome the problem of high similarity among different expressions. Unlike most of the previous works that were evaluated using a single dataset, the performance of the HL-FER is assessed using three publicly available datasets under three different experimental settings: n-fold cross validation based on subjects for each dataset separately; n-fold cross validation rule based on datasets; and, finally, a last set of experiments to assess the effectiveness of each module of the HL-FER separately. Weighted average recognition accuracy of 98.7% across three different datasets, using three classifiers, indicates the success of employing the HL-FER for human FER. PMID:24316568

  1. Technical note: Space-time analysis of rainfall extremes in Italy: clues from a reconciled dataset

    NASA Astrophysics Data System (ADS)

    Libertino, Andrea; Ganora, Daniele; Claps, Pierluigi

    2018-05-01

    Like other Mediterranean areas, Italy is prone to the development of events with significant rainfall intensity, lasting for several hours. The main triggering mechanisms of these events are quite well known, but the aim of developing rainstorm hazard maps compatible with their actual probability of occurrence is still far from being reached. A systematic frequency analysis of these occasional highly intense events would require a complete countrywide dataset of sub-daily rainfall records, but this kind of information was still lacking for the Italian territory. In this work several sources of data are gathered, for assembling the first comprehensive and updated dataset of extreme rainfall of short duration in Italy. The resulting dataset, referred to as the Italian Rainfall Extreme Dataset (I-RED), includes the annual maximum rainfalls recorded in 1 to 24 consecutive hours from more than 4500 stations across the country, spanning the period between 1916 and 2014. A detailed description of the spatial and temporal coverage of the I-RED is presented, together with an exploratory statistical analysis aimed at providing preliminary information on the climatology of extreme rainfall at the national scale. Due to some legal restrictions, the database can be provided only under certain conditions. Taking into account the potentialities emerging from the analysis, a description of the ongoing and planned future work activities on the database is provided.

  2. Estimation of Recurrence of Colorectal Adenomas with Dependent Censoring Using Weighted Logistic Regression

    PubMed Central

    Hsu, Chiu-Hsieh; Li, Yisheng; Long, Qi; Zhao, Qiuhong; Lance, Peter

    2011-01-01

    In colorectal polyp prevention trials, estimation of the rate of recurrence of adenomas at the end of the trial may be complicated by dependent censoring, that is, time to follow-up colonoscopy and dropout may be dependent on time to recurrence. Assuming that the auxiliary variables capture the dependence between recurrence and censoring times, we propose to fit two working models with the auxiliary variables as covariates to define risk groups and then extend an existing weighted logistic regression method for independent censoring to each risk group to accommodate potential dependent censoring. In a simulation study, we show that the proposed method results in both a gain in efficiency and reduction in bias for estimating the recurrence rate. We illustrate the methodology by analyzing a recurrent adenoma dataset from a colorectal polyp prevention trial. PMID:22065985

  3. Clustering of Multivariate Geostatistical Data

    NASA Astrophysics Data System (ADS)

    Fouedjio, Francky

    2017-04-01

    Multivariate data indexed by geographical coordinates have become omnipresent in the geosciences and pose substantial analysis challenges. One of them is the grouping of data locations into spatially contiguous clusters so that data locations belonging to the same cluster have a certain degree of homogeneity while data locations in the different clusters have to be as different as possible. However, groups of data locations created through classical clustering techniques turn out to show poor spatial contiguity, a feature obviously inconvenient for many geoscience applications. In this work, we develop a clustering method that overcomes this problem by accounting the spatial dependence structure of data; thus reinforcing the spatial contiguity of resulting cluster. The capability of the proposed clustering method to provide spatially contiguous and meaningful clusters of data locations is assessed using both synthetic and real datasets. Keywords: clustering, geostatistics, spatial contiguity, spatial dependence.

  4. Nonparametric Combinatorial Sequence Models

    NASA Astrophysics Data System (ADS)

    Wauthier, Fabian L.; Jordan, Michael I.; Jojic, Nebojsa

    This work considers biological sequences that exhibit combinatorial structures in their composition: groups of positions of the aligned sequences are "linked" and covary as one unit across sequences. If multiple such groups exist, complex interactions can emerge between them. Sequences of this kind arise frequently in biology but methodologies for analyzing them are still being developed. This paper presents a nonparametric prior on sequences which allows combinatorial structures to emerge and which induces a posterior distribution over factorized sequence representations. We carry out experiments on three sequence datasets which indicate that combinatorial structures are indeed present and that combinatorial sequence models can more succinctly describe them than simpler mixture models. We conclude with an application to MHC binding prediction which highlights the utility of the posterior distribution induced by the prior. By integrating out the posterior our method compares favorably to leading binding predictors.

  5. ATGC transcriptomics: a web-based application to integrate, explore and analyze de novo transcriptomic data.

    PubMed

    Gonzalez, Sergio; Clavijo, Bernardo; Rivarola, Máximo; Moreno, Patricio; Fernandez, Paula; Dopazo, Joaquín; Paniego, Norma

    2017-02-22

    In the last years, applications based on massively parallelized RNA sequencing (RNA-seq) have become valuable approaches for studying non-model species, e.g., without a fully sequenced genome. RNA-seq is a useful tool for detecting novel transcripts and genetic variations and for evaluating differential gene expression by digital measurements. The large and complex datasets resulting from functional genomic experiments represent a challenge in data processing, management, and analysis. This problem is especially significant for small research groups working with non-model species. We developed a web-based application, called ATGC transcriptomics, with a flexible and adaptable interface that allows users to work with new generation sequencing (NGS) transcriptomic analysis results using an ontology-driven database. This new application simplifies data exploration, visualization, and integration for a better comprehension of the results. ATGC transcriptomics provides access to non-expert computer users and small research groups to a scalable storage option and simple data integration, including database administration and management. The software is freely available under the terms of GNU public license at http://atgcinta.sourceforge.net .

  6. Omicseq: a web-based search engine for exploring omics datasets.

    PubMed

    Sun, Xiaobo; Pittard, William S; Xu, Tianlei; Chen, Li; Zwick, Michael E; Jiang, Xiaoqian; Wang, Fusheng; Qin, Zhaohui S

    2017-07-03

    The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  7. Multilevel principal component analysis (mPCA) in shape analysis: A feasibility study in medical and dental imaging.

    PubMed

    Farnell, D J J; Popat, H; Richmond, S

    2016-06-01

    Methods used in image processing should reflect any multilevel structures inherent in the image dataset or they run the risk of functioning inadequately. We wish to test the feasibility of multilevel principal components analysis (PCA) to build active shape models (ASMs) for cases relevant to medical and dental imaging. Multilevel PCA was used to carry out model fitting to sets of landmark points and it was compared to the results of "standard" (single-level) PCA. Proof of principle was tested by applying mPCA to model basic peri-oral expressions (happy, neutral, sad) approximated to the junction between the mouth/lips. Monte Carlo simulations were used to create this data which allowed exploration of practical implementation issues such as the number of landmark points, number of images, and number of groups (i.e., "expressions" for this example). To further test the robustness of the method, mPCA was subsequently applied to a dental imaging dataset utilising landmark points (placed by different clinicians) along the boundary of mandibular cortical bone in panoramic radiographs of the face. Changes of expression that varied between groups were modelled correctly at one level of the model and changes in lip width that varied within groups at another for the Monte Carlo dataset. Extreme cases in the test dataset were modelled adequately by mPCA but not by standard PCA. Similarly, variations in the shape of the cortical bone were modelled by one level of mPCA and variations between the experts at another for the panoramic radiographs dataset. Results for mPCA were found to be comparable to those of standard PCA for point-to-point errors via miss-one-out testing for this dataset. These errors reduce with increasing number of eigenvectors/values retained, as expected. We have shown that mPCA can be used in shape models for dental and medical image processing. mPCA was found to provide more control and flexibility when compared to standard "single-level" PCA. Specifically, mPCA is preferable to "standard" PCA when multiple levels occur naturally in the dataset. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  8. Leveling data in geochemical mapping: scope of application, pros and cons of existing methods

    NASA Astrophysics Data System (ADS)

    Pereira, Benoît; Vandeuren, Aubry; Sonnet, Philippe

    2017-04-01

    Geochemical mapping successfully met a range of needs from mineral exploration to environmental management. In Europe and around the world numerous geochemical datasets already exist. These datasets may originate from geochemical mapping projects or from the collection of sample analyses requested by environmental protection regulatory bodies. Combining datasets can be highly beneficial for establishing geochemical maps with increased resolution and/or coverage area. However this practice requires assessing the equivalence between datasets and, if needed, applying data leveling to remove possible biases between datasets. In the literature, several procedures for assessing dataset equivalence and leveling data are proposed. Daneshfar & Cameron (1998) proposed a method for the leveling of two adjacent datasets while Pereira et al. (2016) proposed two methods for the leveling of datasets that contain records located within the same geographical area. Each discussed method requires its own set of assumptions (underlying populations of data, spatial distribution of data, etc.). Here we propose to discuss the scope of application, pros, cons and practical recommendations for each method. This work is illustrated with several case studies in Wallonia (Southern Belgium) and in Europe involving trace element geochemical datasets. References: Daneshfar, B. & Cameron, E. (1998), Leveling geochemical data between map sheets, Journal of Geochemical Exploration 63(3), 189-201. Pereira, B.; Vandeuren, A.; Govaerts, B. B. & Sonnet, P. (2016), Assessing dataset equivalence and leveling data in geochemical mapping, Journal of Geochemical Exploration 168, 36-48.

  9. Continuous Toxicological Dose-Response Relationships Are Pretty Homogeneous (Society for Risk Analysis Annual Meeting)

    EPA Science Inventory

    Dose-response relationships for a wide range of in vivo and in vitro continuous datasets are well-described by a four-parameter exponential or Hill model, based on a recent analysis of multiple historical dose-response datasets, mostly with more than five dose groups (Slob and Se...

  10. Sleep stages identification in patients with sleep disorder using k-means clustering

    NASA Astrophysics Data System (ADS)

    Fadhlullah, M. U.; Resahya, A.; Nugraha, D. F.; Yulita, I. N.

    2018-05-01

    Data mining is a computational intelligence discipline where a large dataset processed using a certain method to look for patterns within the large dataset. This pattern then used for real time application or to develop some certain knowledge. This is a valuable tool to solve a complex problem, discover new knowledge, data analysis and decision making. To be able to get the pattern that lies inside the large dataset, clustering method is used to get the pattern. Clustering is basically grouping data that looks similar so a certain pattern can be seen in the large data set. Clustering itself has several algorithms to group the data into the corresponding cluster. This research used data from patients who suffer sleep disorders and aims to help people in the medical world to reduce the time required to classify the sleep stages from a patient who suffers from sleep disorders. This study used K-Means algorithm and silhouette evaluation to find out that 3 clusters are the optimal cluster for this dataset which means can be divided to 3 sleep stages.

  11. Use Hierarchical Storage and Analysis to Exploit Intrinsic Parallelism

    NASA Astrophysics Data System (ADS)

    Zender, C. S.; Wang, W.; Vicente, P.

    2013-12-01

    Big Data is an ugly name for the scientific opportunities and challenges created by the growing wealth of geoscience data. How to weave large, disparate datasets together to best reveal their underlying properties, to exploit their strengths and minimize their weaknesses, to continually aggregate more information than the world knew yesterday and less than we will learn tomorrow? Data analytics techniques (statistics, data mining, machine learning, etc.) can accelerate pattern recognition and discovery. However, often researchers must, prior to analysis, organize multiple related datasets into a coherent framework. Hierarchical organization permits entire dataset to be stored in nested groups that reflect their intrinsic relationships and similarities. Hierarchical data can be simpler and faster to analyze by coding operators to automatically parallelize processes over isomorphic storage units, i.e., groups. The newest generation of netCDF Operators (NCO) embody this hierarchical approach, while still supporting traditional analysis approaches. We will use NCO to demonstrate the trade-offs involved in processing a prototypical Big Data application (analysis of CMIP5 datasets) using hierarchical and traditional analysis approaches.

  12. Viability of Controlling Prosthetic Hand Utilizing Electroencephalograph (EEG) Dataset Signal

    NASA Astrophysics Data System (ADS)

    Miskon, Azizi; A/L Thanakodi, Suresh; Raihan Mazlan, Mohd; Mohd Haziq Azhar, Satria; Nooraya Mohd Tawil, Siti

    2016-11-01

    This project presents the development of an artificial hand controlled by Electroencephalograph (EEG) signal datasets for the prosthetic application. The EEG signal datasets were used as to improvise the way to control the prosthetic hand compared to the Electromyograph (EMG). The EMG has disadvantages to a person, who has not used the muscle for a long time and also to person with degenerative issues due to age factor. Thus, the EEG datasets found to be an alternative for EMG. The datasets used in this work were taken from Brain Computer Interface (BCI) Project. The datasets were already classified for open, close and combined movement operations. It served the purpose as an input to control the prosthetic hand by using an Interface system between Microsoft Visual Studio and Arduino. The obtained results reveal the prosthetic hand to be more efficient and faster in response to the EEG datasets with an additional LiPo (Lithium Polymer) battery attached to the prosthetic. Some limitations were also identified in terms of the hand movements, weight of the prosthetic, and the suggestions to improve were concluded in this paper. Overall, the objective of this paper were achieved when the prosthetic hand found to be feasible in operation utilizing the EEG datasets.

  13. A century of transitions in New York City's measles dynamics.

    PubMed

    Hempel, Karsten; Earn, David J D

    2015-05-06

    Infectious diseases spreading in a human population occasionally exhibit sudden transitions in their qualitative dynamics. Previous work has successfully predicted such transitions in New York City's historical measles incidence using the seasonally forced susceptible-infectious-recovered (SIR) model. This work relied on a dataset spanning 45 years (1928-1973), which we have extended to 93 years (1891-1984). We identify additional dynamical transitions in the longer dataset and successfully explain them by analysing attractors and transients of the same mechanistic epidemiological model. © 2015 The Author(s) Published by the Royal Society. All rights reserved.

  14. The health care and life sciences community profile for dataset descriptions

    PubMed Central

    Alexiev, Vladimir; Ansell, Peter; Bader, Gary; Baran, Joachim; Bolleman, Jerven T.; Callahan, Alison; Cruz-Toledo, José; Gaudet, Pascale; Gombocz, Erich A.; Gonzalez-Beltran, Alejandra N.; Groth, Paul; Haendel, Melissa; Ito, Maori; Jupp, Simon; Juty, Nick; Katayama, Toshiaki; Kobayashi, Norio; Krishnaswami, Kalpana; Laibe, Camille; Le Novère, Nicolas; Lin, Simon; Malone, James; Miller, Michael; Mungall, Christopher J.; Rietveld, Laurens; Wimalaratne, Sarala M.; Yamaguchi, Atsuko

    2016-01-01

    Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. PMID:27602295

  15. Node, Node-Link, and Node-Link-Group Diagrams: An Evaluation.

    PubMed

    Saket, Bahador; Simonetto, Paolo; Kobourov, Stephen; Börner, Katy

    2014-12-01

    Effectively showing the relationships between objects in a dataset is one of the main tasks in information visualization. Typically there is a well-defined notion of distance between pairs of objects, and traditional approaches such as principal component analysis or multi-dimensional scaling are used to place the objects as points in 2D space, so that similar objects are close to each other. In another typical setting, the dataset is visualized as a network graph, where related nodes are connected by links. More recently, datasets are also visualized as maps, where in addition to nodes and links, there is an explicit representation of groups and clusters. We consider these three Techniques, characterized by a progressive increase of the amount of encoded information: node diagrams, node-link diagrams and node-link-group diagrams. We assess these three types of diagrams with a controlled experiment that covers nine different tasks falling broadly in three categories: node-based tasks, network-based tasks and group-based tasks. Our findings indicate that adding links, or links and group representations, does not negatively impact performance (time and accuracy) of node-based tasks. Similarly, adding group representations does not negatively impact the performance of network-based tasks. Node-link-group diagrams outperform the others on group-based tasks. These conclusions contradict results in other studies, in similar but subtly different settings. Taken together, however, such results can have significant implications for the design of standard and domain snecific visualizations tools.

  16. Aerosol climate time series from ESA Aerosol_cci (Invited)

    NASA Astrophysics Data System (ADS)

    Holzer-Popp, T.

    2013-12-01

    Within the ESA Climate Change Initiative (CCI) the Aerosol_cci project (mid 2010 - mid 2013, phase 2 proposed 2014-2016) has conducted intensive work to improve algorithms for the retrieval of aerosol information from European sensors AATSR (3 algorithms), PARASOL, MERIS (3 algorithms), synergetic AATSR/SCIAMACHY, OMI and GOMOS. Whereas OMI and GOMOS were used to derive absorbing aerosol index and stratospheric extinction profiles, respectively, Aerosol Optical Depth (AOD) and Angstrom coefficient were retrieved from the other sensors. Global datasets for 2008 were produced and validated versus independent ground-based data and other satellite data sets (MODIS, MISR). An additional 17-year dataset is currently generated using ATSR-2/AATSR data. During the three years of the project, intensive collaborative efforts were made to improve the retrieval algorithms focusing on the most critical modules. The team agreed on the use of a common definition for the aerosol optical properties. Cloud masking was evaluated, but a rigorous analysis with a pre-scribed cloud mask did not lead to improvement for all algorithms. Better results were obtained using a post-processing step in which sudden transitions, indicative of possible occurrence of cloud contamination, were removed. Surface parameterization, which is most critical for the nadir only algorithms (MERIS and synergetic AATSR / SCIAMACHY) was studied to a limited extent. The retrieval results for AOD, Ångström exponent (AE) and uncertainties were evaluated by comparison with data from AERONET (and a limited amount of MAN) sun photometer and with satellite data available from MODIS and MISR. Both level2 and level3 (gridded daily) datasets were validated. Several validation metrics were used (standard statistical quantities such as bias, rmse, Pearson correlation, linear regression, as well as scoring approaches to quantitatively evaluate the spatial and temporal correlations against AERONET), and in some cases developed further, to evaluate the datasets and their regional and seasonal merits. The validation showed that most datasets have improved significantly and in particular PARASOL (ocean only) provides excellent results. The metrics for AATSR (land and ocean) datasets are similar to those of MODIS and MISR, with AATSR better in some land regions and less good in some others (ocean). However, AATSR coverage is smaller than that of MODIS due to swath width. The MERIS dataset provides better coverage than AATSR but has lower quality (especially over land) than the other datasets. Also the synergetic AATSR/SCIAMACHY dataset has lower quality. The evaluation of the pixel uncertainties shows first good results but also reveals that more work needs to be done to provide comprehensive information for data assimilation. Users (MACC/ECMWF, AEROCOM) confirmed the relevance of this additional information and encouraged Aerosol_cci to release the current uncertainties. The paper will summarize and discuss the results of three year work in Aerosol_cci, extract the lessons learned and conclude with an outlook to the work proposed for the next three years. In this second phase a cyclic effort of algorithm evolution, dataset generation, validation and assessment will be applied to produce and further improve complete time series from all sensors under investigation, new sensors will be added (e.g. IASI), and preparation for the Sentinel missions will be made.

  17. Publishing datasets with eSciDoc and panMetaDocs

    NASA Astrophysics Data System (ADS)

    Ulbricht, D.; Klump, J.; Bertelmann, R.

    2012-04-01

    Currently serveral research institutions worldwide undertake considerable efforts to have their scientific datasets published and to syndicate them to data portals as extensively described objects identified by a persistent identifier. This is done to foster the reuse of data, to make scientific work more transparent, and to create a citable entity that can be referenced unambigously in written publications. GFZ Potsdam established a publishing workflow for file based research datasets. Key software components are an eSciDoc infrastructure [1] and multiple instances of the data curation tool panMetaDocs [2]. The eSciDoc repository holds data objects and their associated metadata in container objects, called eSciDoc items. A key metadata element in this context is the publication status of the referenced data set. PanMetaDocs, which is based on PanMetaWorks [3], is a PHP based web application that allows to describe data with any XML-based metadata schema. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and make use of contextual information in a project setting. Access rights can be applied to set visibility of datasets to other project members and allow collaboration on and notifying about datasets (RSS) and interaction with the internal messaging system, that was inherited from panMetaWorks. When a dataset is to be published, panMetaDocs allows to change the publication status of the eSciDoc item from status "private" to "submitted" and prepare the dataset for verification by an external reviewer. After quality checks, the item publication status can be changed to "published". This makes the data and metadata available through the internet worldwide. PanMetaDocs is developed as an eSciDoc application. It is an easy to use graphical user interface to eSciDoc items, their data and metadata. It is also an application supporting a DOI publication agent during the process of publishing scientific datasets as electronic data supplements to research papers. Publication of research manuscripts has an already well established workflow that shares junctures with other processes and involves several parties in the process of dataset publication. Activities of the author, the reviewer, the print publisher and the data publisher have to be coordinated into a common data publication workflow. The case of data publication at GFZ Potsdam displays some specifics, e.g. the DOIDB webservice. The DOIDB is a proxy service at GFZ for the DataCite [4] DOI registration and its metadata store. DOIDB provides a local summary of the dataset DOIs registered through GFZ as a publication agent. An additional use case for the DOIDB is its function to enrich the datacite metadata with additional custom attributes, like a geographic reference in a DIF record. These attributes are at the moment not available in the datacite metadata schema but would be valuable elements for the compilation of data catalogues in the earth sciences and for dissemination of catalogue data via OAI-PMH. [1] http://www.escidoc.org , eSciDoc, FIZ Karlruhe, Germany [2] http://panmetadocs.sf.net , panMetaDocs, GFZ Potsdam, Germany [3] http://metaworks.pangaea.de , panMetaWorks, Dr. R. Huber, MARUM, Univ. Bremen, Germany [4] http://www.datacite.org

  18. The Role of Datasets on Scientific Influence within Conflict Research

    PubMed Central

    Van Holt, Tracy; Johnson, Jeffery C.; Moates, Shiloh; Carley, Kathleen M.

    2016-01-01

    We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving “conflict” in the Web of Science (WoS) over a 66-year period (1945–2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed—such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957–1971 where ideas didn’t persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped shape the operationalization of conflict. In fact, 94% of the works on the CP that analyzed data either relied on publically available datasets, or they generated a dataset and made it public. These datasets appear to be important in the development of conflict research, allowing for cross-case comparisons, and comparisons to previous works. PMID:27124569

  19. The Role of Datasets on Scientific Influence within Conflict Research.

    PubMed

    Van Holt, Tracy; Johnson, Jeffery C; Moates, Shiloh; Carley, Kathleen M

    2016-01-01

    We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving "conflict" in the Web of Science (WoS) over a 66-year period (1945-2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed-such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957-1971 where ideas didn't persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped shape the operationalization of conflict. In fact, 94% of the works on the CP that analyzed data either relied on publically available datasets, or they generated a dataset and made it public. These datasets appear to be important in the development of conflict research, allowing for cross-case comparisons, and comparisons to previous works.

  20. A Neuroelectrical Brain Imaging Study on the Perception of Figurative Paintings against Only their Color or Shape Contents.

    PubMed

    Maglione, Anton G; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio

    2017-01-01

    In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas.

  1. A Neuroelectrical Brain Imaging Study on the Perception of Figurative Paintings against Only their Color or Shape Contents

    PubMed Central

    Maglione, Anton G.; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio

    2017-01-01

    In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas. PMID:28790907

  2. Santa Margarita Estuary Water Quality Monitoring Data

    DTIC Science & Technology

    2018-02-01

    ADMINISTRATIVE INFORMATION The work described in this report was performed for the Water Quality Section of the Environmental Security Marine Corps Base...water quality model calibration given interest and the necessary resources. The dataset should also inform the stakeholders and Regional Board on...period. Several additional ancillary datasets were collected during the monitoring timeframe that provide key information though they were not collected

  3. Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach

    PubMed

    M, Pandi; R, Balamurugan; N, Sadhasivam

    2017-12-29

    Objective: A better understanding of functional genomics can be obtained by extracting patterns hidden in gene expression data. This could have paramount implications for cancer diagnosis, gene treatments and other domains. Clustering may reveal natural structures and identify interesting patterns in underlying data. The main objective of this research was to derive a heuristic approach to detection of highly co-expressed genes related to cancer from gene expression data with minimum Mean Squared Error (MSE). Methods: A modified CSO algorithm using Harmony Search (MCSO-HS) for clustering cancer gene expression data was applied. Experiment results are analyzed using two cancer gene expression benchmark datasets, namely for leukaemia and for breast cancer. Result: The results indicated MCSO-HS to be better than HS and CSO, 13% and 9% with the leukaemia dataset. For breast cancer dataset improvement was by 22% and 17%, respectively, in terms of MSE. Conclusion: The results showed MCSO-HS to outperform HS and CSO with both benchmark datasets. To validate the clustering results, this work was tested with internal and external cluster validation indices. Also this work points to biological validation of clusters with gene ontology in terms of function, process and component. Creative Commons Attribution License

  4. Phylogenomic evidence for a recent and rapid radiation of lizards in the Patagonian Liolaemus fitzingerii species group.

    PubMed

    Grummer, Jared A; Morando, Mariana M; Avila, Luciano J; Sites, Jack W; Leaché, Adam D

    2018-08-01

    Rapid evolutionary radiations are difficult to resolve because divergence events are nearly synchronous and gene flow among nascent species can be high, resulting in a phylogenetic "bush". Large datasets composed of sequence loci from across the genome can potentially help resolve some of these difficult phylogenetic problems. A suitable test case is the Liolaemus fitzingerii species group of lizards, which includes twelve species that are broadly distributed in Argentinean Patagonia. The species in the group have had a complex evolutionary history that has led to high morphological variation and unstable taxonomy. We generated a sequence capture dataset for 28 ingroup individuals of 580 nuclear loci, alongside a mitogenomic dataset, to infer phylogenetic relationships among species in this group. Relationships among species were generally weakly supported with the nuclear data, and along with an inferred age of ∼2.6 million years old, indicate either rapid evolution, hybridization, incomplete lineage sorting, non-informative data, or a combination thereof. We inferred a signal of mito-nuclear discordance, indicating potential hybridization between L. melanops and L. martorii, and phylogenetic network analyses provided support for 5 reticulation events among species. Phasing the nuclear loci did not provide additional insight into relationships or suspected patterns of hybridization. Only one clade, composed of L. camarones, L. fitzingerii, and L. xanthoviridis was recovered across all analyses. Genomic datasets provide molecular systematists with new opportunities to resolve difficult phylogenetic problems, yet the lack of phylogenetic resolution in Patagonian Liolaemus is biologically meaningful and indicative of a recent and rapid evolutionary radiation. The phylogenetic relationships of the Liolaemus fitzingerii group may be best modeled as a reticulated network instead of a bifurcating phylogeny. Copyright © 2018 Elsevier Inc. All rights reserved.

  5. Population attributable risks of patient, child and organizational risk factors for perinatal mortality in hospital births.

    PubMed

    Poeran, Jashvant; Borsboom, Gerard J J M; de Graaf, Johanna P; Birnie, Erwin; Steegers, Eric A P; Bonsel, Gouke J

    2015-04-01

    The main objective of this study was to estimate the contributing role of maternal, child, and organizational risk factors in perinatal mortality by calculating their population attributable risks (PAR). The primary dataset comprised 1,020,749 singleton hospital births from ≥22 weeks' gestation (The Netherlands Perinatal Registry 2000-2008). PARs for single and grouped risk factors were estimated in four stages: (1) creating a duplicate dataset for each PAR analysis in which risk factors of interest were set to the most favorable value (e.g., all women assigned 'Western' for PAR calculation of ethnicity); (2) in the primary dataset an elaborate multilevel logistic regression model was fitted from which (3) the obtained coefficients were used to predict perinatal mortality in each duplicate dataset; (4) PARs were then estimated as the proportional change of predicted- compared to observed perinatal mortality. Additionally, PARs for grouped risk factors were estimated by using sequential values in two orders: after PAR estimation of grouped maternal risk factors, the resulting PARs for grouped child, and grouped organizational factors were estimated, and vice versa. The combined PAR of maternal, child and organizational factors is 94.4 %, i.e., when all factors are set to the most favorable value perinatal mortality is expected to be reduced with 94.4 %. Depending on the order of analysis, the PAR of maternal risk factors varies from 1.4 to 13.1 %, and for child- and organizational factors 58.7-74.0 and 7.3-34.3 %, respectively. In conclusion, the PAR of maternal-, child- and organizational factors combined is 94.4 %. Optimization of organizational factors may achieve a 34.3 % decrease in perinatal mortality.

  6. Pooled solifenacin overactive bladder trial data: Creation, validation and analysis of an integrated database.

    PubMed

    Chapple, Christopher R; Cardozo, Linda; Snijder, Robert; Siddiqui, Emad; Herschorn, Sender

    2016-12-15

    Patient-level data are available for 11 randomized, controlled, Phase III/Phase IV solifenacin clinical trials. Meta-analyses were conducted to interrogate the data, to broaden knowledge about solifenacin and overactive bladder (OAB) in general. Before integrating data, datasets from individual studies were mapped to a single format using methodology developed by the Clinical Data Interchange Standards Consortium (CDISC). Initially, the data structure was harmonized, to ensure identical categorization, using the CDISC Study Data Tabulation Model (SDTM). To allow for patient level meta-analysis, data were integrated and mapped to analysis datasets. Mapping included adding derived and categorical variables and followed standards described as the Analysis Data Model (ADaM). Mapping to both SDTM and ADaM was performed twice by two independent programming teams, results compared, and inconsistencies corrected in the final output. ADaM analysis sets included assignments of patients to the Safety Analysis Set and the Full Analysis Set. There were three analysis groupings: Analysis group 1 (placebo-controlled, monotherapy, fixed-dose studies, n = 3011); Analysis group 2 (placebo-controlled, monotherapy, pooled, fixed- and flexible-dose, n = 5379); Analysis group 3 (all solifenacin monotherapy-treated patients, n = 6539). Treatment groups were: solifenacin 5 mg fixed dose, solifenacin 5/10 mg flexible dose, solifenacin 10 mg fixed dose and overall solifenacin. Patient were similar enough for data pooling to be acceptable. Creating ADaM datasets provided significant information about individual studies and the derivation decisions made in each study; validated ADaM datasets now exist for medical history, efficacy and AEs. Results from these meta-analyses were similar over time.

  7. The next generation of melanocyte data: Genetic, epigenetic, and transcriptional resource datasets and analysis tools.

    PubMed

    Loftus, Stacie K

    2018-05-01

    The number of melanocyte- and melanoma-derived next generation sequence genome-scale datasets have rapidly expanded over the past several years. This resource guide provides a summary of publicly available sources of melanocyte cell derived whole genome, exome, mRNA and miRNA transcriptome, chromatin accessibility and epigenetic datasets. Also highlighted are bioinformatic resources and tools for visualization and data queries which allow researchers a genome-scale view of the melanocyte. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.

  8. a Critical Review of Automated Photogrammetric Processing of Large Datasets

    NASA Astrophysics Data System (ADS)

    Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.

    2017-08-01

    The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.

  9. Machine Learning Algorithms for Automatic Classification of Marmoset Vocalizations

    PubMed Central

    Ribeiro, Sidarta; Pereira, Danillo R.; Papa, João P.; de Albuquerque, Victor Hugo C.

    2016-01-01

    Automatic classification of vocalization type could potentially become a useful tool for acoustic the monitoring of captive colonies of highly vocal primates. However, for classification to be useful in practice, a reliable algorithm that can be successfully trained on small datasets is necessary. In this work, we consider seven different classification algorithms with the goal of finding a robust classifier that can be successfully trained on small datasets. We found good classification performance (accuracy > 0.83 and F1-score > 0.84) using the Optimum Path Forest classifier. Dataset and algorithms are made publicly available. PMID:27654941

  10. Building predictive models for MERS-CoV infections using data mining techniques.

    PubMed

    Al-Turaiki, Isra; Alshahrani, Mona; Almutairi, Tahani

    Recently, the outbreak of MERS-CoV infections caused worldwide attention to Saudi Arabia. The novel virus belongs to the coronaviruses family, which is responsible for causing mild to moderate colds. The control and command center of Saudi Ministry of Health issues a daily report on MERS-CoV infection cases. The infection with MERS-CoV can lead to fatal complications, however little information is known about this novel virus. In this paper, we apply two data mining techniques in order to better understand the stability and the possibility of recovery from MERS-CoV infections. The Naive Bayes classifier and J48 decision tree algorithm were used to build our models. The dataset used consists of 1082 records of cases reported between 2013 and 2015. In order to build our prediction models, we split the dataset into two groups. The first group combined recovery and death records. A new attribute was created to indicate the record type, such that the dataset can be used to predict the recovery from MERS-CoV. The second group contained the new case records to be used to predict the stability of the infection based on the current status attribute. The resulting recovery models indicate that healthcare workers are more likely to survive. This could be due to the vaccinations that healthcare workers are required to get on regular basis. As for the stability models using J48, two attributes were found to be important for predicting stability: symptomatic and age. Old patients are at high risk of developing MERS-CoV complications. Finally, the performance of all the models was evaluated using three measures: accuracy, precision, and recall. In general, the accuracy of the models is between 53.6% and 71.58%. We believe that the performance of the prediction models can be enhanced with the use of more patient data. As future work, we plan to directly contact hospitals in Riyadh in order to collect more information related to patients with MERS-CoV infections. Copyright © 2016 King Saud Bin Abdulaziz University for Health Sciences. Published by Elsevier Ltd. All rights reserved.

  11. An inter-comparison of PM10 source apportionment using PCA and PMF receptor models in three European sites.

    PubMed

    Cesari, Daniela; Amato, F; Pandolfi, M; Alastuey, A; Querol, X; Contini, D

    2016-08-01

    Source apportionment of aerosol is an important approach to investigate aerosol formation and transformation processes as well as to assess appropriate mitigation strategies and to investigate causes of non-compliance with air quality standards (Directive 2008/50/CE). Receptor models (RMs) based on chemical composition of aerosol measured at specific sites are a useful, and widely used, tool to perform source apportionment. However, an analysis of available studies in the scientific literature reveals heterogeneities in the approaches used, in terms of "working variables" such as the number of samples in the dataset and the number of chemical species used as well as in the modeling tools used. In this work, an inter-comparison of PM10 source apportionment results obtained at three European measurement sites is presented, using two receptor models: principal component analysis coupled with multi-linear regression analysis (PCA-MLRA) and positive matrix factorization (PMF). The inter-comparison focuses on source identification, quantification of source contribution to PM10, robustness of the results, and how these are influenced by the number of chemical species available in the datasets. Results show very similar component/factor profiles identified by PCA and PMF, with some discrepancies in the number of factors. The PMF model appears to be more suitable to separate secondary sulfate and secondary nitrate with respect to PCA at least in the datasets analyzed. Further, some difficulties have been observed with PCA in separating industrial and heavy oil combustion contributions. Commonly at all sites, the crustal contributions found with PCA were larger than those found with PMF, and the secondary inorganic aerosol contributions found by PCA were lower than those found by PMF. Site-dependent differences were also observed for traffic and marine contributions. The inter-comparison of source apportionment performed on complete datasets (using the full range of available chemical species) and incomplete datasets (with reduced number of chemical species) allowed to investigate the sensitivity of source apportionment (SA) results to the working variables used in the RMs. Results show that, at both sites, the profiles and the contributions of the different sources calculated with PMF are comparable within the estimated uncertainties indicating a good stability and robustness of PMF results. In contrast, PCA outputs are more sensitive to the chemical species present in the datasets. In PCA, the crustal contributions are higher in the incomplete datasets and the traffic contributions are significantly lower for incomplete datasets.

  12. Group Colocation Behavior in Technological Social Networks

    PubMed Central

    Brown, Chloë; Lathia, Neal; Mascolo, Cecilia; Noulas, Anastasios; Blondel, Vincent

    2014-01-01

    We analyze two large datasets from technological networks with location and social data: user location records from an online location-based social networking service, and anonymized telecommunications data from a European cellphone operator, in order to investigate the differences between individual and group behavior with respect to physical location. We discover agreements between the two datasets: firstly, that individuals are more likely to meet with one friend at a place they have not visited before, but tend to meet at familiar locations when with a larger group. We also find that groups of individuals are more likely to meet at places that their other friends have visited, and that the type of a place strongly affects the propensity for groups to meet there. These differences between group and solo mobility has potential technological applications, for example, in venue recommendation in location-based social networks. PMID:25148037

  13. A comparison of the education and work experiences of immigrant and the United States of America-trained nurses.

    PubMed

    Mazurenko, O; Gupte, G; Shan, G

    2014-12-01

    This study examined the education and work experience of immigrant and American-trained registered nurses from 1988 to 2008. The USA increasingly relies on immigrant nurses to fill a significant nursing shortage. These nurses receive their training overseas, but can obtain licenses to practice in different countries. Although immigrant nurses have been in the USA workforce for several decades, little is known about how their education and work experience compares with USA-trained nurses. Yet much is presumed by policy makers and administrators who perpetuate the stereotype that immigrant nurses are not as qualified. We analysed the National Sample Survey of Registered Nurses datasets from 1988 to 2008 using the Cochran-Armitage trend tests. Our findings showed similar work experience and upward trends in education among both groups of nurses. However, American-trained nurses were more likely to further advance their education, whereas immigrant nurses were more likely to have more work experience and practice in a wider range of healthcare settings. Although we discovered differences between nurses trained in the USA and abroad, we theorize that these differences even out, as education and work experience each have their own distinct caregiving advantages. Immigrant nurses are not less qualified than their American-trained counterparts. However, healthcare providers should encourage them to further pursue their education and certifications. Even though immigrant nurses' education and work experience are comparable with their American counterparts, workforce development policies may be particularly beneficial for this group. © 2014 International Council of Nurses.

  14. Characterization and prediction of residues determining protein functional specificity.

    PubMed

    Capra, John A; Singh, Mona

    2008-07-01

    Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular functional specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally determined SDPs. We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a dataset of SDPs. The resulting large dataset, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolutionary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large dataset of enzyme SDPs. Datasets and GroupSim code are available online at http://compbio.cs.princeton.edu/specificity/. Supplementary data are available at Bioinformatics online.

  15. Northwestern University Schizophrenia Data and Software Tool (NUSDAST)

    PubMed Central

    Wang, Lei; Kogan, Alex; Cobia, Derin; Alpert, Kathryn; Kolasny, Anthony; Miller, Michael I.; Marcus, Daniel

    2013-01-01

    The schizophrenia research community has invested substantial resources on collecting, managing and sharing large neuroimaging datasets. As part of this effort, our group has collected high resolution magnetic resonance (MR) datasets from individuals with schizophrenia, their non-psychotic siblings, healthy controls and their siblings. This effort has resulted in a growing resource, the Northwestern University Schizophrenia Data and Software Tool (NUSDAST), an NIH-funded data sharing project to stimulate new research. This resource resides on XNAT Central, and it contains neuroimaging (MR scans, landmarks and surface maps for deep subcortical structures, and FreeSurfer cortical parcellation and measurement data), cognitive (cognitive domain scores for crystallized intelligence, working memory, episodic memory, and executive function), clinical (demographic, sibling relationship, SAPS and SANS psychopathology), and genetic (20 polymorphisms) data, collected from more than 450 subjects, most with 2-year longitudinal follow-up. A neuroimaging mapping, analysis and visualization software tool, CAWorks, is also part of this resource. Moreover, in making our existing neuroimaging data along with the associated meta-data and computational tools publically accessible, we have established a web-based information retrieval portal that allows the user to efficiently search the collection. This research-ready dataset meaningfully combines neuroimaging data with other relevant information, and it can be used to help facilitate advancing neuroimaging research. It is our hope that this effort will help to overcome some of the commonly recognized technical barriers in advancing neuroimaging research such as lack of local organization and standard descriptions. PMID:24223551

  16. Northwestern University Schizophrenia Data and Software Tool (NUSDAST).

    PubMed

    Wang, Lei; Kogan, Alex; Cobia, Derin; Alpert, Kathryn; Kolasny, Anthony; Miller, Michael I; Marcus, Daniel

    2013-01-01

    The schizophrenia research community has invested substantial resources on collecting, managing and sharing large neuroimaging datasets. As part of this effort, our group has collected high resolution magnetic resonance (MR) datasets from individuals with schizophrenia, their non-psychotic siblings, healthy controls and their siblings. This effort has resulted in a growing resource, the Northwestern University Schizophrenia Data and Software Tool (NUSDAST), an NIH-funded data sharing project to stimulate new research. This resource resides on XNAT Central, and it contains neuroimaging (MR scans, landmarks and surface maps for deep subcortical structures, and FreeSurfer cortical parcellation and measurement data), cognitive (cognitive domain scores for crystallized intelligence, working memory, episodic memory, and executive function), clinical (demographic, sibling relationship, SAPS and SANS psychopathology), and genetic (20 polymorphisms) data, collected from more than 450 subjects, most with 2-year longitudinal follow-up. A neuroimaging mapping, analysis and visualization software tool, CAWorks, is also part of this resource. Moreover, in making our existing neuroimaging data along with the associated meta-data and computational tools publically accessible, we have established a web-based information retrieval portal that allows the user to efficiently search the collection. This research-ready dataset meaningfully combines neuroimaging data with other relevant information, and it can be used to help facilitate advancing neuroimaging research. It is our hope that this effort will help to overcome some of the commonly recognized technical barriers in advancing neuroimaging research such as lack of local organization and standard descriptions.

  17. The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 Catchments (Version 2.1) for the Conterminous United States: Dam Density and Storage Volume

    EPA Pesticide Factsheets

    This dataset represents the dam density and storage volumes within individual, local NHDPlusV2 catchments and upstream, contributing watersheds based on National Inventory of Dams (NID) data. Attributes were calculated for every local NHDPlusV2 catchment and accumulated to provide watershed-level metrics.(See Supplementary Info for Glossary of Terms) The NID database contains information about the dam??s location, size, purpose, type, last inspection, regulatory facts, and other technical data. Structures on streams reduce the longitudinal and lateral hydrologic connectivity of the system. For example, impoundments above dams slow stream flow, cause deposition of sediment and reduce peak flows. Dams change both the discharge and sediment supply of streams, causing channel incision and bed coarsening downstream. Downstream areas are often sediment deprived, resulting in degradation, i.e., erosion of the stream bed and stream banks. This database was improved upon by locations verified by work from the USGS National Map (Jeff Simley Group). It was observed that some dams, some of them major and which do exist, were not part of the 2009 NID, but were represented in the USGS National Map dataset, and had been in the 2006 NID. Approximately 1,100 such dams were added, based on the USGS National Map lat/long and the 2006 NID attributes (dam height, storage, etc.) Finally, as clean-up, a) about 600 records with duplicate NIDID were removed, and b) about 300 reco

  18. Satellite-Based Precipitation Datasets

    NASA Astrophysics Data System (ADS)

    Munchak, S. J.; Huffman, G. J.

    2017-12-01

    Of the possible sources of precipitation data, those based on satellites provide the greatest spatial coverage. There is a wide selection of datasets, algorithms, and versions from which to choose, which can be confusing to non-specialists wishing to use the data. The International Precipitation Working Group (IPWG) maintains tables of the major publicly available, long-term, quasi-global precipitation data sets (http://www.isac.cnr.it/ ipwg/data/datasets.html), and this talk briefly reviews the various categories. As examples, NASA provides two sets of quasi-global precipitation data sets: the older Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) and current Integrated Multi-satellitE Retrievals for Global Precipitation Measurement (GPM) mission (IMERG). Both provide near-real-time and post-real-time products that are uniformly gridded in space and time. The TMPA products are 3-hourly 0.25°x0.25° on the latitude band 50°N-S for about 16 years, while the IMERG products are half-hourly 0.1°x0.1° on 60°N-S for over 3 years (with plans to go to 16+ years in Spring 2018). In addition to the precipitation estimates, each data set provides fields of other variables, such as the satellite sensor providing estimates and estimated random error. The discussion concludes with advice about determining suitability for use, the necessity of being clear about product names and versions, and the need for continued support for satellite- and surface-based observation.

  19. Navigating the "Research-to-Operations" Bridge of Death: Collaborative Transition of Remotely-Sensed Snow Data from Research into Operational Water Resources Forecasting

    NASA Astrophysics Data System (ADS)

    Miller, W. P.; Bender, S.; Painter, T. H.; Bernard, B.

    2016-12-01

    Water and resource management agencies can benefit from hydrologic forecasts during both flood and drought conditions. Improved predictions of seasonal snowmelt-driven runoff volume and timing can assist operational water managers with decision support and efficient resource management within the spring runoff season. Using operational models and forecasting systems, NOAA's Colorado Basin River Forecast Center (CBRFC) produces hydrologic forecasts for stakeholders and water management groups in the western United States. Collaborative incorporation of research-oriented remote sensing data into CBRFC operational models and systems is one route by which CBRFC forecasts can be improved, ultimately for the benefit of water managers. Successful navigation of research-oriented remote sensing products across the "research-to-operations"/R2O gap (also known as the "valley of death") to operational destinations requires dedicated personnel on both the research and operations sides, working in a highly collaborative environment. Since 2012, the operational CBRFC has collaborated with the research-oriented Jet Propulsion Laboratory (JPL) under funding from NASA to transition remotely-sensed snow data into CBRFC's operational models and forecasting systems. Two specific datasets from JPL, the MODIS Dust Radiative Forcing in Snow (MODDRFS) and the MODIS Snow Covered-Area and Grain size (MODSCAG) products, are used in CBRFC operations as of 2016. Over the past several years, JPL and CBRFC have worked together to analyze patterns in JPL's remote sensing snow datasets from the operational perspective of the CBRFC and to develop techniques to bridge the R2O gap. Retrospective and real-time analyses have yielded valuable insight into the remotely-sensed snow datasets themselves, CBRFC's operational systems, and the collaborative R2O process. Examples of research-oriented JPL snow data, as used in CBRFC operations, are described. A timeline of the collaboration, challenges encountered during the journey across the R2O gap, or "valley of death", and solutions to those challenges are also illustrated.

  20. In situ quantitative characterisation of the ocean water column using acoustic multibeam backscatter data

    NASA Astrophysics Data System (ADS)

    Lamarche, G.; Le Gonidec, Y.; Lucieer, V.; Lurton, X.; Greinert, J.; Dupré, S.; Nau, A.; Heffron, E.; Roche, M.; Ladroit, Y.; Urban, P.

    2017-12-01

    Detecting liquid, solid or gaseous features in the ocean is generating considerable interest in the geoscience community, because of their potentially high economic values (oil & gas, mining), their significance for environmental management (oil/gas leakage, biodiversity mapping, greenhouse gas monitoring) as well as their potential cultural and traditional values (food, freshwater). Enhancing people's capability to quantify and manage the natural capital present in the ocean water goes hand in hand with the development of marine acoustic technology, as marine echosounders provide the most reliable and technologically advanced means to develop quantitative studies of water column backscatter data. This is not developed to its full capability because (i) of the complexity of the physics involved in relation to the constantly changing marine environment, and (ii) the rapid technological evolution of high resolution multibeam echosounder (MBES) water-column imaging systems. The Water Column Imaging Working Group is working on a series of multibeam echosounder (MBES) water column datasets acquired in a variety of environments, using a range of frequencies, and imaging a number of water-column features such as gas seeps, oil leaks, suspended particulate matter, vegetation and freshwater springs. Access to data from different acoustic frequencies and ocean dynamics enables us to discuss and test multifrequency approaches which is the most promising means to develop a quantitative analysis of the physical properties of acoustic scatterers, providing rigorous cross calibration of the acoustic devices. In addition, high redundancy of multibeam data, such as is available for some datasets, will allow us to develop data processing techniques, leading to quantitative estimates of water column gas seeps. Each of the datasets has supporting ground-truthing data (underwater videos and photos, physical oceanography measurements) which provide information on the origin and chemistry of the seep content. This is of first importance when assessing the physical properties of water column scatterers from backscatter acoustic measurement.

  1. Artifact removal in the context of group ICA: a comparison of single-subject and group approaches

    PubMed Central

    Du, Yuhui; Allen, Elena A.; He, Hao; Sui, Jing; Wu, Lei; Calhoun, Vince D.

    2018-01-01

    Independent component analysis (ICA) has been widely applied to identify intrinsic brain networks from fMRI data. Group ICA computes group-level components from all data and subsequently estimates individual-level components to recapture inter-subject variability. However, the best approach to handle artifacts, which may vary widely among subjects, is not yet clear. In this work, we study and compare two ICA approaches for artifacts removal. One approach, recommended in recent work by the Human Connectome Project, first performs ICA on individual subject data to remove artifacts, and then applies a group ICA on the cleaned data from all subjects. We refer to this approach as Individual ICA based artifacts Removal Plus Group ICA (IRPG). A second proposed approach, called Group Information Guided ICA (GIG-ICA), performs ICA on group data, then removes the group-level artifact components, and finally performs subject-specific ICAs using the group-level non-artifact components as spatial references. We used simulations to evaluate the two approaches with respect to the effects of data quality, data quantity, variable number of sources among subjects, and spatially unique artifacts. Resting-state test-retest datasets were also employed to investigate the reliability of functional networks. Results from simulations demonstrate GIG-ICA has greater performance compared to IRPG, even in the case when single-subject artifacts removal is perfect and when individual subjects have spatially unique artifacts. Experiments using test-retest data suggest that GIG-ICA provides more reliable functional networks. Based on high estimation accuracy, ease of implementation, and high reliability of functional networks, we find GIG-ICA to be a promising approach. PMID:26859308

  2. Globus Online: Climate Data Management for Small Teams

    NASA Astrophysics Data System (ADS)

    Ananthakrishnan, R.; Foster, I.

    2013-12-01

    Large and highly distributed climate data demands new approaches to data organization and lifecycle management. We need, in particular, catalogs that can allow researchers to track the location and properties of large numbers of data files, and management tools that can allow researchers to update data properties and organization during their research, move data among different locations, and invoke analysis computations on data--all as easily as if they were working with small numbers of files on their desktop computer. Both catalogs and management tools often need to be able to scale to extremely large quantities of data. When developing solutions to these problems, it is important to distinguish between the needs of (a) large communities, for whom the ability to organize published data is crucial (e.g., by implementing formal data publication processes, assigning DOIs, recording definitive metadata, providing for versioning), and (b) individual researchers and small teams, who are more frequently concerned with tracking the diverse data and computations involved in what highly dynamic and iterative research processes. Key requirements in the latter case include automated data registration and metadata extraction, ease of update, close-to-zero management overheads (e.g., no local software install); and flexible, user-managed sharing support, allowing read and write privileges within small groups. We describe here how new capabilities provided by the Globus Online system address the needs of the latter group of climate scientists, providing for the rapid creation and establishment of lightweight individual- or team-specific catalogs; the definition of logical groupings of data elements, called datasets; the evolution of catalogs, dataset definitions, and associated metadata over time, to track changes in data properties and organization as a result of research processes; and the manipulation of data referenced by catalog entries (e.g., replication of a dataset to a remote location for analysis, sharing of a dataset). Its software-as-a-service ('SaaS') architecture means that these capabilities are provided to users over the network, without a need for local software installation. In addition, Globus Online provides well defined APIs, thus providing a platform that can be leveraged to integrate the capabilities with other portals and applications. We describe early applications of these new Globus Online to climate science. We focus in particular on applications that demonstrate how Globus Online capabilities complement those of the Earth System Grid Federation (ESGF), the premier system for publication and discovery of large community datasets. ESGF already uses Globus Online mechanisms for data download. We demonstrate methods by which the two systems can be further integrated and harmonized, so that for example data collections produced within a small team can be easily published from Globus Online to ESGF for archival storage and broader access--and a Globus Online catalog can be used to organize an individual view of a subset of data held in ESGF.

  3. Workplace discrimination and cumulative trauma disorders: the national EEOC ADA research project.

    PubMed

    Armstrong, Amy J; McMahon, Brian T; West, Steven L; Lewis, Allen

    2005-01-01

    Employment discrimination of persons with cumulative trauma disorders (CTDs) was explored using the Integrated Mission System dataset of the US Equal Employment Opportunity Commission. Demographic characteristics and merit resolutions of the Charging Parties (persons with CTD) were compared to individuals experiencing other physical, sensory and neurological impairments. Factors compared also included industry designation, geographic region, and size of Respondents against which allegations were filed. Persons with CTD had proportionately greater allegations among large Respondents (greater than 500 workers) engaged in manufacturing, utilities, transportation, finance insurance and real estate. The types of discrimination Issues that were proportionately greater in the CTD group included layoff, failure to reinstate, and failure to provide reasonable accommodation. The CTD group was significantly less likely than the comparison group to be involved in discrimination Issues such as assignment to less desirable duty, shift or work location; demotion; termination, or failure to hire or provide training. Persons with CTD had higher proportions of merit Resolutions where allegations were voluntarily withdrawn by the Charging Party with benefits.

  4. -A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome.

    PubMed

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.

  5. ­A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome

    PubMed Central

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616

  6. Atlas-guided cluster analysis of large tractography datasets.

    PubMed

    Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

    2013-01-01

    Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.

  7. A framework for automatic creation of gold-standard rigid 3D-2D registration datasets.

    PubMed

    Madan, Hennadii; Pernuš, Franjo; Likar, Boštjan; Špiclin, Žiga

    2017-02-01

    Advanced image-guided medical procedures incorporate 2D intra-interventional information into pre-interventional 3D image and plan of the procedure through 3D/2D image registration (32R). To enter clinical use, and even for publication purposes, novel and existing 32R methods have to be rigorously validated. The performance of a 32R method can be estimated by comparing it to an accurate reference or gold standard method (usually based on fiducial markers) on the same set of images (gold standard dataset). Objective validation and comparison of methods are possible only if evaluation methodology is standardized, and the gold standard  dataset is made publicly available. Currently, very few such datasets exist and only one contains images of multiple patients acquired during a procedure. To encourage the creation of gold standard 32R datasets, we propose an automatic framework. The framework is based on rigid registration of fiducial markers. The main novelty is spatial grouping of fiducial markers on the carrier device, which enables automatic marker localization and identification across the 3D and 2D images. The proposed framework was demonstrated on clinical angiograms of 20 patients. Rigid 32R computed by the framework was more accurate than that obtained manually, with the respective target registration error below 0.027 mm compared to 0.040 mm. The framework is applicable for gold standard setup on any rigid anatomy, provided that the acquired images contain spatially grouped fiducial markers. The gold standard datasets and software will be made publicly available.

  8. Barriers and facilitators to the implementation of an evidence-based electronic minimum dataset for nursing team leader handover: A descriptive survey.

    PubMed

    Spooner, Amy J; Aitken, Leanne M; Chaboyer, Wendy

    2017-11-15

    There is widespread use of clinical information systems in intensive care units however, the evidence to support electronic handover is limited. The study aim was to assess the barriers and facilitators to use of an electronic minimum dataset for nursing team leader shift-to-shift handover in the intensive care unit prior to its implementation. The study was conducted in a 21-bed medical/surgical intensive care unit, specialising in cardiothoracic surgery at a tertiary referral hospital, in Queensland, Australia. An established tool was modified to the intensive care nursing handover context and a survey of all 63 nursing team leaders was undertaken. Survey statements were rated using a 6-point Likert scale with selections from 'strongly disagree' to 'strongly agree', and open-ended questions. Descriptive statistics were used to summarise results. A total of 39 team leaders responded to the survey (62%). Team leaders used general intensive care work unit guidelines to inform practice however they were less familiar with the intensive care handover work unit guideline. Barriers to minimum dataset uptake included: a tool that was not user friendly, time consuming and contained too much information. Facilitators to minimum dataset adoption included: a tool that was user friendly, saved time and contained relevant information. Identifying the complexities of a healthcare setting prior to the implementation of an intervention assists researchers and clinicians to integrate new knowledge into healthcare settings. Barriers and facilitators to knowledge use focused on usability, content and efficiency of the electronic minimum dataset and can be used to inform tailored strategies to optimise team leaders' adoption of a minimum dataset for handover. Copyright © 2017 Australian College of Critical Care Nurses Ltd. Published by Elsevier Ltd. All rights reserved.

  9. Semi-supervised tracking of extreme weather events in global spatio-temporal climate datasets

    NASA Astrophysics Data System (ADS)

    Kim, S. K.; Prabhat, M.; Williams, D. N.

    2017-12-01

    Deep neural networks have been successfully applied to solve problem to detect extreme weather events in large scale climate datasets and attend superior performance that overshadows all previous hand-crafted methods. Recent work has shown that multichannel spatiotemporal encoder-decoder CNN architecture is able to localize events in semi-supervised bounding box. Motivated by this work, we propose new learning metric based on Variational Auto-Encoders (VAE) and Long-Short-Term-Memory (LSTM) to track extreme weather events in spatio-temporal dataset. We consider spatio-temporal object tracking problems as learning probabilistic distribution of continuous latent features of auto-encoder using stochastic variational inference. For this, we assume that our datasets are i.i.d and latent features is able to be modeled by Gaussian distribution. In proposed metric, we first train VAE to generate approximate posterior given multichannel climate input with an extreme climate event at fixed time. Then, we predict bounding box, location and class of extreme climate events using convolutional layers given input concatenating three features including embedding, sampled mean and standard deviation. Lastly, we train LSTM with concatenated input to learn timely information of dataset by recurrently feeding output back to next time-step's input of VAE. Our contribution is two-fold. First, we show the first semi-supervised end-to-end architecture based on VAE to track extreme weather events which can apply to massive scaled unlabeled climate datasets. Second, the information of timely movement of events is considered for bounding box prediction using LSTM which can improve accuracy of localization. To our knowledge, this technique has not been explored neither in climate community or in Machine Learning community.

  10. The Greenwich Photo-heliographic Results (1874 - 1976): Summary of the Observations, Applications, Datasets, Definitions and Errors

    NASA Astrophysics Data System (ADS)

    Willis, D. M.; Coffey, H. E.; Henwood, R.; Erwin, E. H.; Hoyt, D. V.; Wild, M. N.; Denig, W. F.

    2013-11-01

    The measurements of sunspot positions and areas that were published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), as the Greenwich Photo-heliographic Results ( GPR), 1874 - 1976, exist in both printed and digital forms. These printed and digital sunspot datasets have been archived in various libraries and data centres. Unfortunately, however, typographic, systematic and isolated errors can be found in the various datasets. The purpose of the present paper is to begin the task of identifying and correcting these errors. In particular, the intention is to provide in one foundational paper all the necessary background information on the original solar observations, their various applications in scientific research, the format of the different digital datasets, the necessary definitions of the quantities measured, and the initial identification of errors in both the printed publications and the digital datasets. Two companion papers address the question of specific identifiable errors; namely, typographic errors in the printed publications, and both isolated and systematic errors in the digital datasets. The existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the programme of solar observations supported for more than a century by the Royal Observatory, Greenwich, and the Royal Greenwich Observatory. This improved dataset should be of value in many future scientific investigations.

  11. Scalable Earth-observation Analytics for Geoscientists: Spacetime Extensions to the Array Database SciDB

    NASA Astrophysics Data System (ADS)

    Appel, Marius; Lahn, Florian; Pebesma, Edzer; Buytaert, Wouter; Moulds, Simon

    2016-04-01

    Today's amount of freely available data requires scientists to spend large parts of their work on data management. This is especially true in environmental sciences when working with large remote sensing datasets, such as obtained from earth-observation satellites like the Sentinel fleet. Many frameworks like SpatialHadoop or Apache Spark address the scalability but target programmers rather than data analysts, and are not dedicated to imagery or array data. In this work, we use the open-source data management and analytics system SciDB to bring large earth-observation datasets closer to analysts. Its underlying data representation as multidimensional arrays fits naturally to earth-observation datasets, distributes storage and computational load over multiple instances by multidimensional chunking, and also enables efficient time-series based analyses, which is usually difficult using file- or tile-based approaches. Existing interfaces to R and Python furthermore allow for scalable analytics with relatively little learning effort. However, interfacing SciDB and file-based earth-observation datasets that come as tiled temporal snapshots requires a lot of manual bookkeeping during ingestion, and SciDB natively only supports loading data from CSV-like and custom binary formatted files, which currently limits its practical use in earth-observation analytics. To make it easier to work with large multi-temporal datasets in SciDB, we developed software tools that enrich SciDB with earth observation metadata and allow working with commonly used file formats: (i) the SciDB extension library scidb4geo simplifies working with spatiotemporal arrays by adding relevant metadata to the database and (ii) the Geospatial Data Abstraction Library (GDAL) driver implementation scidb4gdal allows to ingest and export remote sensing imagery from and to a large number of file formats. Using added metadata on temporal resolution and coverage, the GDAL driver supports time-based ingestion of imagery to existing multi-temporal SciDB arrays. While our SciDB plugin works directly in the database, the GDAL driver has been specifically developed using a minimum amount of external dependencies (i.e. CURL). Source code for both tools is available from github [1]. We present these tools in a case-study that demonstrates the ingestion of multi-temporal tiled earth-observation data to SciDB, followed by a time-series analysis using R and SciDBR. Through the exclusive use of open-source software, our approach supports reproducibility in scalable large-scale earth-observation analytics. In the future, these tools can be used in an automated way to let scientists only work on ready-to-use SciDB arrays to significantly reduce the data management workload for domain scientists. [1] https://github.com/mappl/scidb4geo} and \\url{https://github.com/mappl/scidb4gdal

  12. FLUXNET2015 Dataset: Batteries included

    NASA Astrophysics Data System (ADS)

    Pastorello, G.; Papale, D.; Agarwal, D.; Trotta, C.; Chu, H.; Canfora, E.; Torn, M. S.; Baldocchi, D. D.

    2016-12-01

    The synthesis datasets have become one of the signature products of the FLUXNET global network. They are composed from contributions of individual site teams to regional networks, being then compiled into uniform data products - now used in a wide variety of research efforts: from plant-scale microbiology to global-scale climate change. The FLUXNET Marconi Dataset in 2000 was the first in the series, followed by the FLUXNET LaThuile Dataset in 2007, with significant additions of data products and coverage, solidifying the adoption of the datasets as a research tool. The FLUXNET2015 Dataset counts with another round of substantial improvements, including extended quality control processes and checks, use of downscaled reanalysis data for filling long gaps in micrometeorological variables, multiple methods for USTAR threshold estimation and flux partitioning, and uncertainty estimates - all of which accompanied by auxiliary flags. This "batteries included" approach provides a lot of information for someone who wants to explore the data (and the processing methods) in detail. This inevitably leads to a large number of data variables. Although dealing with all these variables might seem overwhelming at first, especially to someone looking at eddy covariance data for the first time, there is method to our madness. In this work we describe the data products and variables that are part of the FLUXNET2015 Dataset, and the rationale behind the organization of the dataset, covering the simplified version (labeled SUBSET), the complete version (labeled FULLSET), and the auxiliary products in the dataset.

  13. The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping.

    PubMed

    Ashish, Naveen; Dewan, Peehoo; Toga, Arthur W

    2015-01-01

    This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal.

  14. The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping

    PubMed Central

    Ashish, Naveen; Dewan, Peehoo; Toga, Arthur W.

    2016-01-01

    This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN1. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal. PMID:26793094

  15. EnviroAtlas - Austin, TX - Greenspace Around Schools by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas data set shows the number of schools in each block group in the EnviroAtlas community boundary as well as the number of schools where less than 25% of the area within 100 meters of the school is classified as greenspace. Green space is defined as Trees & Forest, Grass & Herbaceous, and Agriculture. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  16. Robust semi-automatic segmentation of pulmonary subsolid nodules in chest computed tomography scans

    NASA Astrophysics Data System (ADS)

    Lassen, B. C.; Jacobs, C.; Kuhnigk, J.-M.; van Ginneken, B.; van Rikxoort, E. M.

    2015-02-01

    The malignancy of lung nodules is most often detected by analyzing changes of the nodule diameter in follow-up scans. A recent study showed that comparing the volume or the mass of a nodule over time is much more significant than comparing the diameter. Since the survival rate is higher when the disease is still in an early stage it is important to detect the growth rate as soon as possible. However manual segmentation of a volume is time-consuming. Whereas there are several well evaluated methods for the segmentation of solid nodules, less work is done on subsolid nodules which actually show a higher malignancy rate than solid nodules. In this work we present a fast, semi-automatic method for segmentation of subsolid nodules. As minimal user interaction the method expects a user-drawn stroke on the largest diameter of the nodule. First, a threshold-based region growing is performed based on intensity analysis of the nodule region and surrounding parenchyma. In the next step the chest wall is removed by a combination of a connected component analyses and convex hull calculation. Finally, attached vessels are detached by morphological operations. The method was evaluated on all nodules of the publicly available LIDC/IDRI database that were manually segmented and rated as non-solid or part-solid by four radiologists (Dataset 1) and three radiologists (Dataset 2). For these 59 nodules the Jaccard index for the agreement of the proposed method with the manual reference segmentations was 0.52/0.50 (Dataset 1/Dataset 2) compared to an inter-observer agreement of the manual segmentations of 0.54/0.58 (Dataset 1/Dataset 2). Furthermore, the inter-observer agreement using the proposed method (i.e. different input strokes) was analyzed and gave a Jaccard index of 0.74/0.74 (Dataset 1/Dataset 2). The presented method provides satisfactory segmentation results with minimal observer effort in minimal time and can reduce the inter-observer variability for segmentation of subsolid nodules in clinical routine.

  17. EG-09EPIGENETIC PROFILING REVEALS A CpG HYPERMETHYLATION PHENOTYPE (CIMP) ASSOCIATED WITH WORSE PROGRESSION-FREE SURVIVAL IN MENINGIOMA

    PubMed Central

    Olar, Adriana; Wani, Khalida; Mansouri, Alireza; Zadeh, Gelareh; Wilson, Charmaine; DeMonte, Franco; Fuller, Gregory; Jones, David; Pfister, Stefan; von Deimling, Andreas; Sulman, Erik; Aldape, Kenneth

    2014-01-01

    BACKGROUND: Methylation profiling of solid tumors has revealed biologic subtypes, often with clinical implications. Methylation profiles of meningioma and their clinical implications are not well understood. METHODS: Ninety-two meningioma samples (n = 44 test set and n = 48 validation set) were profiled using the Illumina HumanMethylation450 BeadChip. Unsupervised clustering and analyses for recurrence-free survival (RFS) were performed. RESULTS: Unsupervised clustering of the test set using approximately 900 highly variable markers identified two clearly defined methylation subgroups. One of the groups (n = 19) showed global hypermethylation of a set of markers, analogous to CpG island methylator phenotype (CIMP). These findings were reproducible in the validation set, with 18/48 samples showing the CIMP-positive phenotype. Importantly, of 347 highly variable markers common to both the test and validation set analyses, 107 defined CIMP in the test set and 94 defined CIMP in the validation set, with an overlap of 83 markers between the two datasets. This number is much greater than expected by chance indicating reproducibly of the hypermethylated markers that define CIMP in meningioma. With respect to clinical correlation, the 37 CIMP-positive cases displayed significantly shorter RFS compared to the 55 non-CIMP cases (hazard ratio 2.9, p = 0.013). In an effort to develop a preliminary outcome predictor, a 155-marker subset correlated with RFS was identified in the test dataset. When interrogated in the validation dataset, this 155-marker subset showed a statistical trend (p < 0.1) towards distinguishing survival groups. CONCLUSIONS: This study defines the existence of a CIMP phenotype in meningioma, which involves a substantial proportion (37/92, 40%) of samples with clinical implications. Ongoing work will expand this cohort and examine identification of additional biologic differences (mutational and DNA copy number analysis) to further characterize the aberrant methylation subtype in meningioma. CIMP-positivity with aberrant methylation in recurrent/malignant meningioma suggests a potential therapeutic target for clinically aggressive cases.

  18. Work-Related Burn Injuries Hospitalized in US Burn Centers: 2002 to 2011.

    PubMed

    Huang, Zhenna; Friedman, Lee S

    2017-03-01

    To develop a comprehensive definition to identify work-related burns in the National Burn Repository (NBR) based on multiple fields and describes injuries by occupation. The NBR, which is an inpatient dataset, was used to compare type and severity of burn injuries by occupation. Using the definition developed for this analysis, 22,969 burn injuries were identified as work-related. In contrast, the single work-related field intended to capture occupational injuries only captured 4696 cases. The highest numbers of burns were observed in construction/extraction, food preparation, and durable goods production occupations. Occupations with a mean total body surface area (TBSA) burned greater than 10% include transportation and material-moving, architecture and engineering, and arts/design/entertainment/sports/media occupations. The NBR dataset should be further utilized for occupational burn injury investigations and multiple fields should be considered for case ascertainment.

  19. Evolving Deep Networks Using HPC

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Young, Steven R.; Rose, Derek C.; Johnston, Travis

    While a large number of deep learning networks have been studied and published that produce outstanding results on natural image datasets, these datasets only make up a fraction of those to which deep learning can be applied. These datasets include text data, audio data, and arrays of sensors that have very different characteristics than natural images. As these “best” networks for natural images have been largely discovered through experimentation and cannot be proven optimal on some theoretical basis, there is no reason to believe that they are the optimal network for these drastically different datasets. Hyperparameter search is thus oftenmore » a very important process when applying deep learning to a new problem. In this work we present an evolutionary approach to searching the possible space of network hyperparameters and construction that can scale to 18, 000 nodes. This approach is applied to datasets of varying types and characteristics where we demonstrate the ability to rapidly find best hyperparameters in order to enable practitioners to quickly iterate between idea and result.« less

  20. A polymer dataset for accelerated property prediction and design

    DOE PAGES

    Huan, Tran Doan; Mannodi-Kanakkithodi, Arun; Kim, Chiho; ...

    2016-03-01

    Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate targetmore » of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. As a result, it will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.« less

  1. Multi-Dimensional Scaling based grouping of known complexes and intelligent protein complex detection.

    PubMed

    Rehman, Zia Ur; Idris, Adnan; Khan, Asifullah

    2018-06-01

    Protein-Protein Interactions (PPI) play a vital role in cellular processes and are formed because of thousands of interactions among proteins. Advancements in proteomics technologies have resulted in huge PPI datasets that need to be systematically analyzed. Protein complexes are the locally dense regions in PPI networks, which extend important role in metabolic pathways and gene regulation. In this work, a novel two-phase protein complex detection and grouping mechanism is proposed. In the first phase, topological and biological features are extracted for each complex, and prediction performance is investigated using Bagging based Ensemble classifier (PCD-BEns). Performance evaluation through cross validation shows improvement in comparison to CDIP, MCode, CFinder and PLSMC methods Second phase employs Multi-Dimensional Scaling (MDS) for the grouping of known complexes by exploring inter complex relations. It is experimentally observed that the combination of topological and biological features in the proposed approach has greatly enhanced prediction performance for protein complex detection, which may help to understand various biological processes, whereas application of MDS based exploration may assist in grouping potentially similar complexes. Copyright © 2018 Elsevier Ltd. All rights reserved.

  2. Transition to Operations Plans for GPM Datasets

    NASA Technical Reports Server (NTRS)

    Zavodsky, Bradley; Jedlovec, Gary; Case, Jonathan; Leroy, Anita; Molthan, Andrew; Bell, Jordan; Fuell, Kevin; Stano, Geoffrey

    2013-01-01

    Founded in 2002 at the National Space Science Technology Center at Marshall Space Flight Center in Huntsville, AL. Focused on transitioning unique NASA and NOAA observations and research capabilities to the operational weather community to improve short-term weather forecasts on a regional and local scale. NASA directed funding; NOAA funding from Proving Grounds (PG). Demonstrate capabilities experimental products to weather applications and societal benefit to prepare forecasters for the use of data from next generation of operational satellites. Objective of this poster is to highlight SPoRT's research to operations (R2O) paradigm and provide examples of work done by the team with legacy instruments relevant to GPM in order to promote collaborations with groups developing GPM products.

  3. 3D Imaging of Microbial Biofilms: Integration of Synchrotron Imaging and an Interactive Visualization Interface

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thomas, Mathew; Marshall, Matthew J.; Miller, Erin A.

    2014-08-26

    Understanding the interactions of structured communities known as “biofilms” and other complex matrixes is possible through the X-ray micro tomography imaging of the biofilms. Feature detection and image processing for this type of data focuses on efficiently identifying and segmenting biofilms and bacteria in the datasets. The datasets are very large and often require manual interventions due to low contrast between objects and high noise levels. Thus new software is required for the effectual interpretation and analysis of the data. This work specifies the evolution and application of the ability to analyze and visualize high resolution X-ray micro tomography datasets.

  4. Characterization of Emergent Data Networks Among Long-Tail Data

    NASA Astrophysics Data System (ADS)

    Elag, Mostafa; Kumar, Praveen; Hedstrom, Margaret; Myers, James; Plale, Beth; Marini, Luigi; McDonald, Robert

    2014-05-01

    Data curation underpins data-driven scientific advancements. It manages the information flux across multiple users throughout data life cycle as well as increases data sustainability and reusability. The exponential growth in data production spanning across the Earth Science involving individual and small research groups, which is termed as log-tail data, increases the data-knowledge latency among related domains. It has become clear that an advanced framework-agnostic metadata and ontologies for long-tail data is required to increase their visibility to each other, and provide concise and meaningful descriptions that reveal their connectivity. Despite the advancement that has been achieved by various sophisticated data management models in different Earth Science disciplines, it is not always straightforward to derive relationships among long-tail data. Semantic data clustering algorithms and pre-defined logic rules that are oriented toward prediction of possible data relationships, is one method to address these challenges. Our work advances the connectivity of related long-tail data by introducing the design for an ontology-based knowledge management system. In this work, we present the system architecture, its components, and illustrate how it can be used to scrutinize the connectivity among datasets. To demonstrate the capabilities of this "data network" prototype, we implemented this approach within the Sustainable Environment Actionable Data (SEAD) environment, an open-source semantic content repository that provides a RDF database for long-tail data, and show how emergent relationships among datasets can be identified.

  5. An overview of the general practice nurse workforce in Australia, 2012?15.

    PubMed

    Heywood, Troy; Laurence, Caroline

    2018-05-08

    Several surveys of the general practice nurse (GPN) workforce have been undertaken in Australia over the last decade, but they have limitations, which mean that the workforce is not well-understood. The aim of this study is to describe the profile of the GPN workforce using the dataset available through the Australia Health Practitioner Registration Agency and to explore how it differs from the non-GPN nursing workforce, and if this workforce is changing over time. Data from labour force surveys conducted from 2012 to 2015 were used. Variables examined were age group, gender, remoteness area, hours worked, nurse type (enrolled (EN) or registered (RN)), years in the workforce and also intended years of work before exiting the workforce. When compared with the broader nursing workforce, a greater proportion of GPNs in 2015 were older (60 v. 51%), worked part-time (65 v. 48%) and worked in regional areas (35 v. 26%). Additionally, the characteristics of GPNs has changed between 2012 and 2015, with an increased proportion of younger nurses, more registered nurses and fewer working in remote areas. To ensure a sustainable workforce, particularly in rural and remote areas, strategies to recruit and retain this workforce will be needed.

  6. Near-real-time cheatgrass percent cover in the Northern Great Basin, USA, 2015

    USGS Publications Warehouse

    Boyte, Stephen; Wylie, Bruce K.

    2016-01-01

    Cheatgrass (Bromus tectorum L.) dramatically changes shrub steppe ecosystems in the Northern Great Basin, United States.Current-season cheatgrass location and percent cover are difficult to estimate rapidly.We explain the development of a near-real-time cheatgrass percent cover dataset and map in the Northern Great Basin for the current year (2015), display the current year’s map, provide analysis of the map, and provide a website link to download the map (as a PDF) and the associated dataset.The near-real-time cheatgrass percent cover dataset and map were consistent with non-expedited, historical cheatgrass percent cover datasets and maps.Having cheatgrass maps available mid-summer can help land managers, policy makers, and Geographic Information Systems personnel as they work to protect socially relevant areas such as critical wildlife habitats.

  7. EnviroAtlas - Austin, TX - Atlas Area Boundary

    EPA Pesticide Factsheets

    This EnviroAtlas dataset shows the boundary of the Austin, TX Atlas Area. It represents the outside edge of all the block groups included in the EnviroAtlas Area.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  8. EnviroAtlas - Fresno, CA - Riparian Buffer Land Cover by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset describes the percentage of different land cover types within 15- and 50-meters of hydrologically connected streams, rivers, and other water bodies within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  9. Building Bridges Between Geoscience and Data Science through Benchmark Data Sets

    NASA Astrophysics Data System (ADS)

    Thompson, D. R.; Ebert-Uphoff, I.; Demir, I.; Gel, Y.; Hill, M. C.; Karpatne, A.; Güereque, M.; Kumar, V.; Cabral, E.; Smyth, P.

    2017-12-01

    The changing nature of observational field data demands richer and more meaningful collaboration between data scientists and geoscientists. Thus, among other efforts, the Working Group on Case Studies of the NSF-funded RCN on Intelligent Systems Research To Support Geosciences (IS-GEO) is developing a framework to strengthen such collaborations through the creation of benchmark datasets. Benchmark datasets provide an interface between disciplines without requiring extensive background knowledge. The goals are to create (1) a means for two-way communication between geoscience and data science researchers; (2) new collaborations, which may lead to new approaches for data analysis in the geosciences; and (3) a public, permanent repository of complex data sets, representative of geoscience problems, useful to coordinate efforts in research and education. The group identified 10 key elements and characteristics for ideal benchmarks. High impact: A problem with high potential impact. Active research area: A group of geoscientists should be eager to continue working on the topic. Challenge: The problem should be challenging for data scientists. Data science generality and versatility: It should stimulate development of new general and versatile data science methods. Rich information content: Ideally the data set provides stimulus for analysis at many different levels. Hierarchical problem statement: A hierarchy of suggested analysis tasks, from relatively straightforward to open-ended tasks. Means for evaluating success: Data scientists and geoscientists need means to evaluate whether the algorithms are successful and achieve intended purpose. Quick start guide: Introduction for data scientists on how to easily read the data to enable rapid initial data exploration. Geoscience context: Summary for data scientists of the specific data collection process, instruments used, any pre-processing and the science questions to be answered. Citability: A suitable identifier to facilitate tracking the use of the benchmark later on, e.g. allowing search engines to find all research papers using it. A first sample benchmark developed in collaboration with the Jet Propulsion Laboratory (JPL) deals with the automatic analysis of imaging spectrometer data to detect significant methane sources in the atmosphere.

  10. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bagher-Ebadian, H; Chetty, I; Liu, C

    Purpose: To examine the impact of image smoothing and noise on the robustness of textural information extracted from CBCT images for prediction of radiotherapy response for patients with head/neck (H/N) cancers. Methods: CBCT image datasets for 14 patients with H/N cancer treated with radiation (70 Gy in 35 fractions) were investigated. A deformable registration algorithm was used to fuse planning CT’s to CBCT’s. Tumor volume was automatically segmented on each CBCT image dataset. Local control at 1-year was used to classify 8 patients as responders (R), and 6 as non-responders (NR). A smoothing filter [2D Adaptive Weiner (2DAW) with 3more » different windows (ψ=3, 5, and 7)], and two noise models (Poisson and Gaussian, SNR=25) were implemented, and independently applied to CBCT images. Twenty-two textural features, describing the spatial arrangement of voxel intensities calculated from gray-level co-occurrence matrices, were extracted for all tumor volumes. Results: Relative to CBCT images without smoothing, none of 22 textural features extracted showed any significant differences when smoothing was applied (using the 2DAW with filtering parameters of ψ=3 and 5), in the responder and non-responder groups. When smoothing, 2DAW with ψ=7 was applied, one textural feature, Information Measure of Correlation, was significantly different relative to no smoothing. Only 4 features (Energy, Entropy, Homogeneity, and Maximum-Probability) were found to be statistically different between the R and NR groups (Table 1). These features remained statistically significant discriminators for R and NR groups in presence of noise and smoothing. Conclusion: This preliminary work suggests that textural classifiers for response prediction, extracted from H&N CBCT images, are robust to low-power noise and low-pass filtering. While other types of filters will alter the spatial frequencies differently, these results are promising. The current study is subject to Type II errors. A much larger cohort of patients is needed to confirm these results. This work was supported in part by a grant from Varian Medical Systems (Palo Alto, CA)« less

  11. Evaluation of experimental design and computational parameter choices affecting analyses of ChIP-seq and RNA-seq data in undomesticated poplar trees.

    Treesearch

    Lijun Liu; V. Missirian; Matthew S. Zinkgraf; Andrew Groover; V. Filkov

    2014-01-01

    Background: One of the great advantages of next generation sequencing is the ability to generate large genomic datasets for virtually all species, including non-model organisms. It should be possible, in turn, to apply advanced computational approaches to these datasets to develop models of biological processes. In a practical sense, working with non-model organisms...

  12. Architecture of the local spatial data infrastructure for regional climate change research

    NASA Astrophysics Data System (ADS)

    Titov, Alexander; Gordov, Evgeny

    2013-04-01

    Georeferenced datasets (meteorological databases, modeling and reanalysis results, etc.) are actively used in modeling and analysis of climate change for various spatial and temporal scales. Due to inherent heterogeneity of environmental datasets as well as their size which might constitute up to tens terabytes for a single dataset studies in the area of climate and environmental change require a special software support based on SDI approach. A dedicated architecture of the local spatial data infrastructure aiming at regional climate change analysis using modern web mapping technologies is presented. Geoportal is a key element of any SDI, allowing searching of geoinformation resources (datasets and services) using metadata catalogs, producing geospatial data selections by their parameters (data access functionality) as well as managing services and applications of cartographical visualization. It should be noted that due to objective reasons such as big dataset volume, complexity of data models used, syntactic and semantic differences of various datasets, the development of environmental geodata access, processing and visualization services turns out to be quite a complex task. Those circumstances were taken into account while developing architecture of the local spatial data infrastructure as a universal framework providing geodata services. So that, the architecture presented includes: 1. Effective in terms of search, access, retrieval and subsequent statistical processing, model of storing big sets of regional georeferenced data, allowing in particular to store frequently used values (like monthly and annual climate change indices, etc.), thus providing different temporal views of the datasets 2. General architecture of the corresponding software components handling geospatial datasets within the storage model 3. Metadata catalog describing in detail using ISO 19115 and CF-convention standards datasets used in climate researches as a basic element of the spatial data infrastructure as well as its publication according to OGC CSW (Catalog Service Web) specification 4. Computational and mapping web services to work with geospatial datasets based on OWS (OGC Web Services) standards: WMS, WFS, WPS 5. Geoportal as a key element of thematic regional spatial data infrastructure providing also software framework for dedicated web applications development To realize web mapping services Geoserver software is used since it provides natural WPS implementation as a separate software module. To provide geospatial metadata services GeoNetwork Opensource (http://geonetwork-opensource.org) product is planned to be used for it supports ISO 19115/ISO 19119/ISO 19139 metadata standards as well as ISO CSW 2.0 profile for both client and server. To implement thematic applications based on geospatial web services within the framework of local SDI geoportal the following open source software have been selected: 1. OpenLayers JavaScript library, providing basic web mapping functionality for the thin client such as web browser 2. GeoExt/ExtJS JavaScript libraries for building client-side web applications working with geodata services. The web interface developed will be similar to the interface of such popular desktop GIS applications, as uDIG, QuantumGIS etc. The work is partially supported by RF Ministry of Education and Science grant 8345, SB RAS Program VIII.80.2.1 and IP 131.

  13. The tragedy of the biodiversity data commons: a data impediment creeping nigher?

    PubMed Central

    Galicia, David; Ariño, Arturo H

    2018-01-01

    Abstract Researchers are embracing the open access movement to facilitate unrestricted availability of scientific results. One sign of this willingness is the steady increase in data freely shared online, which has prompted a corresponding increase in the number of papers using such data. Publishing datasets is a time-consuming process that is often seen as a courtesy, rather than a necessary step in the research process. Making data accessible allows further research, provides basic information for decision-making and contributes to transparency in science. Nevertheless, the ease of access to heaps of data carries a perception of ‘free lunch for all’, and the work of data publishers is largely going unnoticed. Acknowledging such a significant effort involving the creation, management and publication of a dataset remains a flimsy, not well established practice in the scientific community. In a meta-analysis of published literature, we have observed various dataset citation practices, but mostly (92%) consisting of merely citing the data repository rather than the data publisher. Failing to recognize the work of data publishers might lead to a decrease in the number of quality datasets shared online, compromising potential research that is dependent on the availability of such data. We make an urgent appeal to raise awareness about this issue. PMID:29688384

  14. DCS-SVM: a novel semi-automated method for human brain MR image segmentation.

    PubMed

    Ahmadvand, Ali; Daliri, Mohammad Reza; Hajiali, Mohammadtaghi

    2017-11-27

    In this paper, a novel method is proposed which appropriately segments magnetic resonance (MR) brain images into three main tissues. This paper proposes an extension of our previous work in which we suggested a combination of multiple classifiers (CMC)-based methods named dynamic classifier selection-dynamic local training local Tanimoto index (DCS-DLTLTI) for MR brain image segmentation into three main cerebral tissues. This idea is used here and a novel method is developed that tries to use more complex and accurate classifiers like support vector machine (SVM) in the ensemble. This work is challenging because the CMC-based methods are time consuming, especially on huge datasets like three-dimensional (3D) brain MR images. Moreover, SVM is a powerful method that is used for modeling datasets with complex feature space, but it also has huge computational cost for big datasets, especially those with strong interclass variability problems and with more than two classes such as 3D brain images; therefore, we cannot use SVM in DCS-DLTLTI. Therefore, we propose a novel approach named "DCS-SVM" to use SVM in DCS-DLTLTI to improve the accuracy of segmentation results. The proposed method is applied on well-known datasets of the Internet Brain Segmentation Repository (IBSR) and promising results are obtained.

  15. What Does Big Data Mean for Wearable Sensor Systems?

    PubMed Central

    Lovell, N. H.; Yang, G. Z.; Horsch, A.; Lukowicz, P.; Murrugarra, L.; Marschollek, M.

    2014-01-01

    Summary Objectives The aim of this paper is to discuss how recent developments in the field of big data may potentially impact the future use of wearable sensor systems in healthcare. Methods The article draws on the scientific literature to support the opinions presented by the IMIA Wearable Sensors in Healthcare Working Group. Results The following is discussed: the potential for wearable sensors to generate big data; how complementary technologies, such as a smartphone, will augment the concept of a wearable sensor and alter the nature of the monitoring data created; how standards would enable sharing of data and advance scientific progress. Importantly, attention is drawn to statistical inference problems for which big datasets provide little assistance, or may hinder the identification of a useful solution. Finally, a discussion is presented on risks to privacy and possible negative consequences arising from intensive wearable sensor monitoring. Conclusions Wearable sensors systems have the potential to generate datasets which are currently beyond our capabilities to easily organize and interpret. In order to successfully utilize wearable sensor data to infer wellbeing, and enable proactive health management, standards and ontologies must be developed which allow for data to be shared between research groups and between commercial systems, promoting the integration of these data into health information systems. However, policy and regulation will be required to ensure that the detailed nature of wearable sensor data is not misused to invade privacies or prejudice against individuals. PMID:25123733

  16. Merging Disparate Data Sources Into a Paleoanthropological Geodatabase for Research, Education, and Conservation in the Greater Hadar Region (Afar, Ethiopia)

    NASA Astrophysics Data System (ADS)

    Campisano, C. J.; Dimaggio, E. N.; Arrowsmith, J. R.; Kimbel, W. H.; Reed, K. E.; Robinson, S. E.; Schoville, B. J.

    2008-12-01

    Understanding the geographic, temporal, and environmental contexts of human evolution requires the ability to compare wide-ranging datasets collected from multiple research disciplines. Paleoanthropological field- research projects are notoriously independent administratively even in regions of high transdisciplinary importance. As a result, valuable opportunities for the integration of new and archival datasets spanning diverse archaeological assemblages, paleontological localities, and stratigraphic sequences are often neglected, which limits the range of research questions that can be addressed. Using geoinformatic tools we integrate spatial, temporal, and semantically disparate paleoanthropological and geological datasets from the Hadar sedimentary basin of the Afar Rift, Ethiopia. Applying newly integrated data to investigations of fossil- rich sediments will provide the geospatial framework critical for addressing fundamental questions concerning hominins and their paleoenvironmental context. We present a preliminary cyberinfrastructure for data management that will allow scientists, students, and interested citizens to interact with, integrate, and visualize data from the Afar region. Examples of our initial integration efforts include generating a regional high-resolution satellite imagery base layer for georeferencing, standardizing and compiling multiple project datasets and digitizing paper maps. We also demonstrate how the robust datasets generated from our work are being incorporated into a new, digital module for Arizona State University's Hadar Paleoanthropology Field School - modernizing field data collection methods, on-the-fly data visualization and query, and subsequent analysis and interpretation. Armed with a fully fused database tethered to high-resolution satellite imagery, we can more accurately reconstruct spatial and temporal paleoenvironmental conditions and efficiently address key scientific questions, such as those regarding the relative importance of internal and external ecological, climatological, and tectonic forcings on evolutionary change in the fossil record. In close association with colleagues working in neighboring project areas, this work advances multidisciplinary and collaborative research, training, and long-range antiquities conservation in the Hadar region.

  17. Discriminating response groups in metabolic and regulatory pathway networks.

    PubMed

    Van Hemert, John L; Dickerson, Julie A

    2012-04-01

    Analysis of omics experiments generates lists of entities (genes, metabolites, etc.) selected based on specific behavior, such as changes in response to stress or other signals. Functional interpretation of these lists often uses category enrichment tests using functional annotations like Gene Ontology terms and pathway membership. This approach does not consider the connected structure of biochemical pathways or the causal directionality of events. The Omics Response Group (ORG) method, described in this work, interprets omics lists in the context of metabolic pathway and regulatory networks using a statistical model for flow within the networks. Statistical results for all response groups are visualized in a novel Pathway Flow plot. The statistical tests are based on the Erlang distribution model under the assumption of independent and identically Exponential-distributed random walk flows through pathways. As a proof of concept, we applied our method to an Escherichia coli transcriptomics dataset where we confirmed common knowledge of the E.coli transcriptional response to Lipid A deprivation. The main response is related to osmotic stress, and we were also able to detect novel responses that are supported by the literature. We also applied our method to an Arabidopsis thaliana expression dataset from an abscisic acid study. In both cases, conventional pathway enrichment tests detected nothing, while our approach discovered biological processes beyond the original studies. We created a prototype for an interactive ORG web tool at http://ecoserver.vrac.iastate.edu/pathwayflow (source code is available from https://subversion.vrac.iastate.edu/Subversion/jlv/public/jlv/pathwayflow). The prototype is described along with additional figures and tables in Supplementary Material. julied@iastate.edu Supplementary data are available at Bioinformatics online.

  18. Comparison of Cortical and Subcortical Measurements in Normal Older Adults across Databases and Software Packages

    PubMed Central

    Rane, Swati; Plassard, Andrew; Landman, Bennett A.; Claassen, Daniel O.; Donahue, Manus J.

    2017-01-01

    This work explores the feasibility of combining anatomical MRI data across two public repositories namely, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Progressive Parkinson’s Markers Initiative (PPMI). We compared cortical thickness and subcortical volumes in cognitively normal older adults between datasets with distinct imaging parameters to assess if they would provide equivalent information. Three distinct datasets were identified. Major differences in data were scanner manufacturer and the use of magnetization inversion to enhance tissue contrast. Equivalent datasets, i.e., those providing similar volumetric measurements in cognitively normal controls, were identified in ADNI and PPMI. These were datasets obtained on the Siemens scanner with TI = 900 ms. Our secondary goal was to assess the agreement between subcortical volumes that are obtained with different software packages. Three subcortical measurement applications (FSL, FreeSurfer, and a recent multi-atlas approach) were compared. Our results show significant agreement in the measurements of caudate, putamen, pallidum, and hippocampus across the packages and poor agreement between measurements of accumbens and amygdala. This is likely due to their smaller size and lack of gray matter-white matter tissue contrast for accurate segmentation. This work provides a segue to combine imaging data from ADNI and PPMI to increase statistical power as well as to interrogate common mechanisms in disparate pathologies such as Alzheimer’s and Parkinson’s diseases. It lays the foundation for comparison of anatomical data acquired with disparate imaging parameters and analyzed with disparate software tools. Furthermore, our work partly explains the variability in the results of studies using different software packages. PMID:29756095

  19. Comparison of Cortical and Subcortical Measurements in Normal Older Adults across Databases and Software Packages.

    PubMed

    Rane, Swati; Plassard, Andrew; Landman, Bennett A; Claassen, Daniel O; Donahue, Manus J

    2017-01-01

    This work explores the feasibility of combining anatomical MRI data across two public repositories namely, the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Progressive Parkinson's Markers Initiative (PPMI). We compared cortical thickness and subcortical volumes in cognitively normal older adults between datasets with distinct imaging parameters to assess if they would provide equivalent information. Three distinct datasets were identified. Major differences in data were scanner manufacturer and the use of magnetization inversion to enhance tissue contrast. Equivalent datasets, i.e., those providing similar volumetric measurements in cognitively normal controls, were identified in ADNI and PPMI. These were datasets obtained on the Siemens scanner with TI = 900 ms. Our secondary goal was to assess the agreement between subcortical volumes that are obtained with different software packages. Three subcortical measurement applications (FSL, FreeSurfer, and a recent multi-atlas approach) were compared. Our results show significant agreement in the measurements of caudate, putamen, pallidum, and hippocampus across the packages and poor agreement between measurements of accumbens and amygdala. This is likely due to their smaller size and lack of gray matter-white matter tissue contrast for accurate segmentation. This work provides a segue to combine imaging data from ADNI and PPMI to increase statistical power as well as to interrogate common mechanisms in disparate pathologies such as Alzheimer's and Parkinson's diseases. It lays the foundation for comparison of anatomical data acquired with disparate imaging parameters and analyzed with disparate software tools. Furthermore, our work partly explains the variability in the results of studies using different software packages.

  20. Dental age assessment of southern Chinese using the United Kingdom Caucasian reference dataset.

    PubMed

    Jayaraman, Jayakumar; Roberts, Graham J; King, Nigel M; Wong, Hai Ming

    2012-03-10

    Dental age assessment is one the most accurate methods for estimating the age of an unknown person. Demirjian's dataset on a French-Canadian population has been widely tested for its applicability on various ethnic groups including southern Chinese. Following inaccurate results from these studies, investigators are now confronted with using alternate datasets for comparison. Testing the applicability of other reliable datasets which result in accurate findings might limit the need to develop population specific standards. Recently, a Reference Data Set (RDS) similar to the Demirjian was prepared in the United Kingdom (UK) and has been subsequently validated. The advantages of the UK Caucasian RDS includes versatility from including both the maxillary and mandibular dentitions, involvement of a wide age group of subjects for evaluation and the possibility of precise age estimation with the mathematical technique of meta-analysis. The aim of this study was to evaluate the applicability of the United Kingdom Caucasian RDS on southern Chinese subjects. Dental panoramic tomographs (DPT) of 266 subjects (133 males and 133 females) aged 2-21 years that were previously taken for clinical diagnostic purposes were selected and scored by a single calibrated examiner based on Demirjian's classification of tooth developmental stages (A-H). The ages corresponding to each stage of tooth developmental stage were obtained from the UK dataset. Intra-examiner reproducibility was tested and the Cohen kappa (0.88) showed that the level of agreement was 'almost perfect'. The estimated dental age was then compared with the chronological age using a paired t-test, with statistical significance set at p<0.01. The results showed that the UK dataset, underestimated the age of southern Chinese subjects by 0.24 years but the results were not statistically significant. In conclusion, the UK Caucasian RDS may not be suitable for estimating the age of southern Chinese subjects and there is a need for an ethnic specific reference dataset for southern Chinese. Copyright © 2011. Published by Elsevier Ireland Ltd.

  1. Filling in the gaps: estimating numbers of chlamydia tests and diagnoses by age group and sex before and during the implementation of the English National Screening Programme, 2000 to 2012.

    PubMed

    Chandra, Nastassya L; Soldan, Kate; Dangerfield, Ciara; Sile, Bersabeh; Duffell, Stephen; Talebi, Alireza; Choi, Yoon H; Hughes, Gwenda; Woodhall, Sarah C

    2017-02-02

    To inform mathematical modelling of the impact of chlamydia screening in England since 2000, a complete picture of chlamydia testing is needed. Monitoring and surveillance systems evolved between 2000 and 2012. Since 2012, data on publicly funded chlamydia tests and diagnoses have been collected nationally. However, gaps exist for earlier years. We collated available data on chlamydia testing and diagnosis rates among 15-44-year-olds by sex and age group for 2000-2012. Where data were unavailable, we applied data- and evidence-based assumptions to construct plausible minimum and maximum estimates and set bounds on uncertainty. There was a large range between estimates in years when datasets were less comprehensive (2000-2008); smaller ranges were seen hereafter. In 15-19-year-old women in 2000, the estimated diagnosis rate ranged between 891 and 2,489 diagnoses per 100,000 persons. Testing and diagnosis rates increased between 2000 and 2012 in women and men across all age groups using minimum or maximum estimates, with greatest increases seen among 15-24-year-olds. Our dataset can be used to parameterise and validate mathematical models and serve as a reference dataset to which trends in chlamydia-related complications can be compared. Our analysis highlights the complexities of combining monitoring and surveillance datasets. This article is copyright of The Authors, 2017.

  2. Benchmarking protein classification algorithms via supervised cross-validation.

    PubMed

    Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor

    2008-04-24

    Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

  3. Atlas-Guided Cluster Analysis of Large Tractography Datasets

    PubMed Central

    Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

    2013-01-01

    Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292

  4. Toward Computational Cumulative Biology by Combining Models of Biological Datasets

    PubMed Central

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176

  5. Toward computational cumulative biology by combining models of biological datasets.

    PubMed

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.

  6. Successful Design Patterns in the Day-to-Day Work with Planetary Mission Data

    NASA Astrophysics Data System (ADS)

    Aye, K.-M.

    2018-04-01

    I will describe successful data storage, data access, and data processing techniques, like embarrassingly parallel processing, that I have established over the years working with large datasets in the planetary science domain; using Jupyter notebooks.

  7. A collection of non-human primate computed tomography scans housed in MorphoSource, a repository for 3D data

    PubMed Central

    Copes, Lynn E.; Lucas, Lynn M.; Thostenson, James O.; Hoekstra, Hopi E.; Boyer, Doug M.

    2016-01-01

    A dataset of high-resolution microCT scans of primate skulls (crania and mandibles) and certain postcranial elements was collected to address questions about primate skull morphology. The sample consists of 489 scans taken from 431 specimens, representing 59 species of most Primate families. These data have transformative reuse potential as such datasets are necessary for conducting high power research into primate evolution, but require significant time and funding to collect. Similar datasets were previously only available to select research groups across the world. The physical specimens are vouchered at Harvard’s Museum of Comparative Zoology. The data collection took place at the Center for Nanoscale Systems at Harvard. The dataset is archived on MorphoSource.org. Though this is the largest high fidelity comparative dataset yet available, its provisioning on a web archive that allows unlimited researcher contributions promises a future with vastly increased digital collections available at researchers’ finger tips. PMID:26836025

  8. An extensive dataset of eye movements during viewing of complex images.

    PubMed

    Wilming, Niklas; Onat, Selim; Ossandón, José P; Açık, Alper; Kietzmann, Tim C; Kaspar, Kai; Gameiro, Ricardo R; Vormberg, Alexandra; König, Peter

    2017-01-31

    We present a dataset of free-viewing eye-movement recordings that contains more than 2.7 million fixation locations from 949 observers on more than 1000 images from different categories. This dataset aggregates and harmonizes data from 23 different studies conducted at the Institute of Cognitive Science at Osnabrück University and the University Medical Center in Hamburg-Eppendorf. Trained personnel recorded all studies under standard conditions with homogeneous equipment and parameter settings. All studies allowed for free eye-movements, and differed in the age range of participants (~7-80 years), stimulus sizes, stimulus modifications (phase scrambled, spatial filtering, mirrored), and stimuli categories (natural and urban scenes, web sites, fractal, pink-noise, and ambiguous artistic figures). The size and variability of viewing behavior within this dataset presents a strong opportunity for evaluating and comparing computational models of overt attention, and furthermore, for thoroughly quantifying strategies of viewing behavior. This also makes the dataset a good starting point for investigating whether viewing strategies change in patient groups.

  9. Pride and Prejudice: Racial Contacts Mediating the Change of In-Group and Out-Group Racial Perceptions

    ERIC Educational Resources Information Center

    Zhou, Ji

    2012-01-01

    Using the National Longitudinal Survey of Freshmen dataset, this study examined how students' within- and between-group racial contacts mediated the change of in-group and out-group racial perceptions across White, Black, Latino, and Asian students. This study was grounded in intergroup contact theory and employed multi-trait multi-method…

  10. Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets

    NASA Astrophysics Data System (ADS)

    Day-Lewis, F. D.; Slater, L. D.; Johnson, T.

    2012-12-01

    Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.

  11. Open source platform for collaborative construction of wearable sensor datasets for human motion analysis and an application for gait analysis.

    PubMed

    Llamas, César; González, Manuel A; Hernández, Carmen; Vegas, Jesús

    2016-10-01

    Nearly every practical improvement in modeling human motion is well founded in a properly designed collection of data or datasets. These datasets must be made publicly available for the community could validate and accept them. It is reasonable to concede that a collective, guided enterprise could serve to devise solid and substantial datasets, as a result of a collaborative effort, in the same sense as the open software community does. In this way datasets could be complemented, extended and expanded in size with, for example, more individuals, samples and human actions. For this to be possible some commitments must be made by the collaborators, being one of them sharing the same data acquisition platform. In this paper, we offer an affordable open source hardware and software platform based on inertial wearable sensors in a way that several groups could cooperate in the construction of datasets through common software suitable for collaboration. Some experimental results about the throughput of the overall system are reported showing the feasibility of acquiring data from up to 6 sensors with a sampling frequency no less than 118Hz. Also, a proof-of-concept dataset is provided comprising sampled data from 12 subjects suitable for gait analysis. Copyright © 2016 Elsevier Inc. All rights reserved.

  12. Sea Ice Mass Balance Buoys (IMBs): First Results from a Data Processing Intercomparison Study

    NASA Astrophysics Data System (ADS)

    Hoppmann, Mario; Tiemann, Louisa; Itkin, Polona

    2017-04-01

    IMBs are autonomous instruments able to continuously monitor the growth and melt of sea ice and its snow cover at a single point on an ice floe. Complementing field expeditions, remote sensing observations and modelling studies, these in-situ data are crucial to assess the mass balance and seasonal evolution of sea ice and snow in the polar oceans. Established subtypes of IMBs combine coarse-resolution temperature profiles through air, snow, ice and ocean with ultrasonic pingers to detect snow accumulation and ice thermodynamic growth. Recent technological advancements enable the use of high-resolution temperature chains, which are also able to identify the surrounding medium through a „heating cycle". The temperature change during this heating cycle provides additional information on the internal properties and processes of the ice. However, a unified data processing technique to reliably and accurately determine sea ice thickness and snow depth from this kind of data is still missing, and an unambiguous interpretation remains a challenge. Following the need to improve techniques for remotely measuring sea ice mass balance, an international IMB working group has recently been established. The main goals are 1) to coordinate IMB deployments, 2) to enhance current IMB data processing and -interpretation techniques, and 3) to provide standardized IMB data products to a broader community. Here we present first results from two different data processing algorithms, applied to selected IMB datasets from the Arctic and Antarctic. Their performance with regard to sea ice thickness and snow depth retrieval is evaluated, and an uncertainty is determined. Although several challenges and caveats in IMB data processing and -interpretation are found, such datasets bear great potential and yield plenty of useful information about sea ice properties and processes. It is planned to include many more algorithms from contributors within the working group, and we explicitly invite other interested scientists to join this promising effort.

  13. Pattern Genes Suggest Functional Connectivity of Organs

    NASA Astrophysics Data System (ADS)

    Qin, Yangmei; Pan, Jianbo; Cai, Meichun; Yao, Lixia; Ji, Zhiliang

    2016-05-01

    Human organ, as the basic structural and functional unit in human body, is made of a large community of different cell types that organically bound together. Each organ usually exerts highly specified physiological function; while several related organs work smartly together to perform complicated body functions. In this study, we present a computational effort to understand the roles of genes in building functional connection between organs. More specifically, we mined multiple transcriptome datasets sampled from 36 human organs and tissues, and quantitatively identified 3,149 genes whose expressions showed consensus modularly patterns: specific to one organ/tissue, selectively expressed in several functionally related tissues and ubiquitously expressed. These pattern genes imply intrinsic connections between organs. According to the expression abundance of the 766 selective genes, we consistently cluster the 36 human organs/tissues into seven functional groups: adipose & gland, brain, muscle, immune, metabolism, mucoid and nerve conduction. The organs and tissues in each group either work together to form organ systems or coordinate to perform particular body functions. The particular roles of specific genes and selective genes suggest that they could not only be used to mechanistically explore organ functions, but also be designed for selective biomarkers and therapeutic targets.

  14. In-the-wild facial expression recognition in extreme poses

    NASA Astrophysics Data System (ADS)

    Yang, Fei; Zhang, Qian; Zheng, Chi; Qiu, Guoping

    2018-04-01

    In the computer research area, facial expression recognition is a hot research problem. Recent years, the research has moved from the lab environment to in-the-wild circumstances. It is challenging, especially under extreme poses. But current expression detection systems are trying to avoid the pose effects and gain the general applicable ability. In this work, we solve the problem in the opposite approach. We consider the head poses and detect the expressions within special head poses. Our work includes two parts: detect the head pose and group it into one pre-defined head pose class; do facial expression recognize within each pose class. Our experiments show that the recognition results with pose class grouping are much better than that of direct recognition without considering poses. We combine the hand-crafted features, SIFT, LBP and geometric feature, with deep learning feature as the representation of the expressions. The handcrafted features are added into the deep learning framework along with the high level deep learning features. As a comparison, we implement SVM and random forest to as the prediction models. To train and test our methodology, we labeled the face dataset with 6 basic expressions.

  15. Cloud-Based Mobile Application Development Tools and NASA Science Datasets

    NASA Astrophysics Data System (ADS)

    Oostra, D.; Lewis, P. M.; Chambers, L. H.; Moore, S. W.

    2011-12-01

    A number of cloud-based visual development tools have emerged that provide methods for developing mobile applications quickly and without previous programming experience. This paper will explore how our new and current data users can best combine these cloud-based mobile application tools and available NASA climate science datasets. Our vision is that users will create their own mobile applications for visualizing our data and will develop tools for their own needs. The approach we are documenting is based on two main ideas. The first is to provide training and information. Through examples, sharing experiences, and providing workshops, users can be shown how to use free online tools to easily create mobile applications that interact with NASA datasets. The second approach is to provide application programming interfaces (APIs), databases, and web applications to access data in a way that educators, students and scientists can quickly integrate it into their own mobile application development. This framework allows us to foster development activities and boost interaction with NASA's data while saving resources that would be required for a large internal application development staff. The findings of this work will include data gathered through meetings with local data providers, educators, libraries and individuals. From the very first queries into this topic, a high level of interest has been identified from our groups of users. This overt interest, combined with the marked popularity of mobile applications, has created a new channel for outreach and communications between the science and education communities. As a result, we would like to offer educators and other stakeholders some insight into the mobile application development arena, and provide some next steps and new approaches. Our hope is that, through our efforts, we will broaden the scope and usage of NASA's climate science data by providing new ways to access environmentally relevant datasets.

  16. Monitoring and long-term assessment of the Mediterranean Sea physical state

    NASA Astrophysics Data System (ADS)

    Simoncelli, Simona; Fratianni, Claudia; Clementi, Emanuela; Drudi, Massimiliano; Pistoia, Jenny; Grandi, Alessandro; Del Rosso, Damiano

    2017-04-01

    The near real time monitoring and long-term assessment of the physical state of the ocean are crucial for the wide CMEMS user community providing a continuous and up to date overview of key indicators computed from operational analysis and reanalysis datasets. This constitutes an operational warning system on particular events, stimulating the research towards a deeper understanding of them and consequently increasing CMEMS products uptake. Ocean Monitoring Indicators (OMIs) of some Essential Ocean Variables have been identified and developed by the Mediterranean Monitoring and Forecasting Centre (MED-MFC) under the umbrella of the CMEMS MYP WG (Multi Year Products Working Group). These OMIs have been operationally implemented starting from the physical reanalysis products and then they have been applied to the operational analyses product. Sea surface temperature, salinity, height as well as heat, water and momentum fluxes at the air-sea interface have been operationally implemented since the reanalysis system development as a real time monitoring of the data production. Their consistency analysis against available observational products or budget values recognized in literature guarantees the high quality of the numerical dataset. The results of the reanalysis validation procedures are yearly published in the QUality Information Document since 2014 available through the CMEMS catalogue (http://marine.copernicus.eu), together with the yearly dataset extension. New OMIs of the winter mixed layer depth, the eddy kinetic energy and the heat content will be presented, in particular we will analyze their time evolution and trends starting from 1987, then we will focus on the recent time period 2013-2016 when reanalysis and analyses datasets overlap to show their consistency beside their different system implementation (i.e. atmospheric forcing, wave coupling, nesting). At the end the focus will be on 2016 sea state and circulation of the Mediterranean Sea and its anomaly with respect to the climatological fields to early detect the 2016 peculiarities.

  17. Extended Kd distributions for freshwater environment.

    PubMed

    Boyer, Patrick; Wells, Claire; Howard, Brenda

    2018-06-18

    Many of the freshwater K d values required for quantifying radionuclide transfer in the environment (e.g. ERICA Tool, Symbiose modelling platform) are either poorly reported in the literature or not available. To partially address this deficiency, Working Group 4 of the IAEA program MODARIA (2012-2015) has completed an update of the freshwater K d databases and K d distributions given in TRS 472 (IAEA, 2010). Over 2300 new values for 27 new elements were added to the dataset and 270 new K d values were added for the 25 elements already included in TRS 472 (IAEA, 2010). For 49 chemical elements, the K d values have been classified according to three solid-liquid exchange conditions (adsorption, desorption and field) as was previously carried out in TRS 472. Additionally, the K d values were classified into two environmental components (suspended and deposited sediments). Each combination (radionuclide x component x condition) was associated with log-normal distributions when there was at least ten K d values in the dataset and to a geometric mean when there was less than ten values. The enhanced K d dataset shows that K d values for suspended sediments are significantly higher than for deposited sediments and that the variability of K d distributions are higher for deposited than for suspended sediments. For suspended sediments in field conditions, the variability of K d distributions can be significantly reduced as a function of the suspended load that explains more than 50% of the variability of the K d datasets of U, Si, Mo, Pb, S, Se, Cd, Ca, B, K, Ra and Po. The distinction between adsorption and desorption conditions is justified for deterministic calculations because the geometric means are systematically greater in desorption conditions. Conversely, this distinction is less relevant for probabilistic calculations due to systematic overlapping between the K d distributions of these two conditions. Copyright © 2018. Published by Elsevier Ltd.

  18. Check your biosignals here: a new dataset for off-the-person ECG biometrics.

    PubMed

    da Silva, Hugo Plácido; Lourenço, André; Fred, Ana; Raposo, Nuno; Aires-de-Sousa, Marta

    2014-02-01

    The Check Your Biosignals Here initiative (CYBHi) was developed as a way of creating a dataset and consistently repeatable acquisition framework, to further extend research in electrocardiographic (ECG) biometrics. In particular, our work targets the novel trend towards off-the-person data acquisition, which opens a broad new set of challenges and opportunities both for research and industry. While datasets with ECG signals collected using medical grade equipment at the chest can be easily found, for off-the-person ECG data the solution is generally for each team to collect their own corpus at considerable expense of resources. In this paper we describe the context, experimental considerations, methods, and preliminary findings of two public datasets created by our team, one for short-term and another for long-term assessment, with ECG data collected at the hand palms and fingers. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  19. An Ensemble Multilabel Classification for Disease Risk Prediction

    PubMed Central

    Liu, Wei; Zhao, Hongling; Zhang, Chaoyang

    2017-01-01

    It is important to identify and prevent disease risk as early as possible through regular physical examinations. We formulate the disease risk prediction into a multilabel classification problem. A novel Ensemble Label Power-set Pruned datasets Joint Decomposition (ELPPJD) method is proposed in this work. First, we transform the multilabel classification into a multiclass classification. Then, we propose the pruned datasets and joint decomposition methods to deal with the imbalance learning problem. Two strategies size balanced (SB) and label similarity (LS) are designed to decompose the training dataset. In the experiments, the dataset is from the real physical examination records. We contrast the performance of the ELPPJD method with two different decomposition strategies. Moreover, the comparison between ELPPJD and the classic multilabel classification methods RAkEL and HOMER is carried out. The experimental results show that the ELPPJD method with label similarity strategy has outstanding performance. PMID:29065647

  20. Detection of drug active ingredients by chemometric processing of solid-state NMR spectrometry data -- the case of acetaminophen.

    PubMed

    Paradowska, Katarzyna; Jamróz, Marta Katarzyna; Kobyłka, Mariola; Gowin, Ewelina; Maczka, Paulina; Skibiński, Robert; Komsta, Łukasz

    2012-01-01

    This paper presents a preliminary study in building discriminant models from solid-state NMR spectrometry data to detect the presence of acetaminophen in over-the-counter pharmaceutical formulations. The dataset, containing 11 spectra of pure substances and 21 spectra of various formulations, was processed by partial least squares discriminant analysis (PLS-DA). The model found coped with the discrimination, and its quality parameters were acceptable. It was found that standard normal variate preprocessing had almost no influence on unsupervised investigation of the dataset. The influence of variable selection with the uninformative variable elimination by PLS method was studied, reducing the dataset from 7601 variables to around 300 informative variables, but not improving the model performance. The results showed the possibility to construct well-working PLS-DA models from such small datasets without a full experimental design.

  1. Persistent Identifiers Implementation in EOSDIS

    NASA Technical Reports Server (NTRS)

    Ramapriyan, H. K. " Rama"

    2016-01-01

    This presentation provides the motivation for and status of implementation of persistent identifiers in NASA's Earth Observation System Data and Information System (EOSDIS). The motivation is provided from the point of view of long-term preservation of datasets such that a number of questions raised by current and future users can be answered easily and precisely. A number of artifacts need to be preserved along with datasets to make this possible, especially when the authors of datasets are no longer available to address users questions. The artifacts and datasets need to be uniquely and persistently identified and linked with each other for full traceability, understandability and scientific reproducibility. Current work in the Earth Science Data and Information System (ESDIS) Project and the Distributed Active Archive Centers (DAACs) in assigning Digital Object Identifiers (DOI) is discussed as well as challenges that remain to be addressed in the future.

  2. Part-time work among older workers with disabilities in Europe.

    PubMed

    Pagán, R

    2009-05-01

    To analyse the use of part-time work among older workers with disabilities compared with their non-disabled counterparts within a European context. Cross-sectional. Data were drawn from the 2004 Survey of Health, Ageing and Retirement in Europe. The key advantage of this dataset is that it provides a harmonized cross-national dimension, and contains information for European individuals aged 50 years or over on a wide range of health indicators, disability, socio-economic situation, social relations, etc. Older people with disabilities (aged 50-64 years) are more likely to have a part-time job compared with their non-disabled counterparts. Although there is an important employment gap between the two groups, many older workers with disabilities use part-time work to achieve a better balance between their health status and working life. The econometric analysis corroborated that being disabled has a positive effect on the probability of working on a part-time basis, although this effect varies by country. Policy makers must encourage part-time employment as a means of increasing employment opportunities for older workers with disabilities, and support gradual retirement opportunities with flexible and reduced working hours. It is crucial to change attitudes towards older people with disabilities in order to increase their labour participation and reduce their levels of poverty and marginalization.

  3. [Volunteer work and potential volunteer work among 55 to 70-year-olds in Germany].

    PubMed

    Micheel, Frank

    2017-02-01

    The aim of this article is to describe the potential with respect to volunteer work among 55 to 70-year-old persons along with a two-dimensional typology (actual volunteer work and intention of volunteering or expanding actual volunteer work) and to identify the influencing factors. Based on the dataset from the transitions and old age potential (TOP) study, a total of 4421 men and women born between 1942 and 1958 were included. A multinomial regression model showed the predictors for group affiliation along with an engagement-related typology (internal, utilized and external volunteer potential as well as definite non-volunteers). More than a half of the persons in the study sample could be classified as internal or external volunteer potential. Volunteers and potential volunteers revealed more similarities regarding resources and social factors than potential volunteers and definite non-volunteers. Potential volunteers were more active in other informal fields of activity (e.g. nursing or child care) than definite non-volunteers. With respect to volunteer work, definite non-volunteers showed various social disadvantages (in particular with respect to education and health) compared to (potential) volunteers. Other informal activities did not seem to be in major conflict with volunteer activities, e.g. nursing or child care, as long as they were carried out with moderate or low intensity.

  4. Differences in stroke and ischemic heart disease mortality by occupation and industry among Japanese working-aged men.

    PubMed

    Wada, Koji; Eguchi, Hisashi; Prieto-Merino, David

    2016-12-01

    Occupation- and industry-based risks for stroke and ischemic heart disease may vary among Japanese working-aged men. We examined the differences in mortality rates between stroke and ischemic heart disease by occupation and industry among employed Japanese men aged 25-59 years. In 2010, we obtained occupation- and industry-specific vital statistics data from the Japanese Ministry of Health, Labour, and Welfare dataset. We analyzed data for Japanese men who were aged 25-59 years in 2010, grouped in 5-year age intervals. We estimated the mortality rates of stroke and ischemic heart disease in each age group for occupation and industry categories as defined in the national census. We did not have detailed individual-level variables. We used the number of employees in 2010 as the denominator and the number of events as the numerator, assuming a Poisson distribution. We conducted separate regression models to estimate the incident relative risk for stroke and ischemic heart disease for each category compared with the reference categories "sales" (occupation) and "wholesale and retail" (industry). When compared with the reference groups, we found that occupations and industries with a relatively higher risk of stroke and ischemic heart disease were: service, administrative and managerial, agriculture and fisheries, construction and mining, electricity and gas, transport, and professional and engineering. This suggests there are occupation- and industry-based mortality risk differences of stroke and ischemic heart disease for Japanese working-aged men. These differences in risk might be explained to factors associated with specific occupations or industries, such as lifestyles or work styles, which should be explored in further research. The mortality risk differences of stroke and ischemic heart disease shown in the present study may reflect an excessive risk of Karoshi (death from overwork).

  5. Large Scale Survey Data in Career Development Research

    ERIC Educational Resources Information Center

    Diemer, Matthew A.

    2008-01-01

    Large scale survey datasets have been underutilized but offer numerous advantages for career development scholars, as they contain numerous career development constructs with large and diverse samples that are followed longitudinally. Constructs such as work salience, vocational expectations, educational expectations, work satisfaction, and…

  6. Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A Normative Database Created from Control Datasets

    PubMed Central

    de Vent, Nathalie R.; Agelink van Rentergem, Joost A.; Schmand, Ben A.; Murre, Jaap M. J.; Huizenga, Hilde M.

    2016-01-01

    In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity, and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given. PMID:27812340

  7. Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A Normative Database Created from Control Datasets.

    PubMed

    de Vent, Nathalie R; Agelink van Rentergem, Joost A; Schmand, Ben A; Murre, Jaap M J; Huizenga, Hilde M

    2016-01-01

    In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity, and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given.

  8. Unified Ecoregions of Alaska: 2001

    USGS Publications Warehouse

    Nowacki, Gregory J.; Spencer, Page; Fleming, Michael; Brock, Terry; Jorgenson, Torre

    2003-01-01

    Major ecosystems have been mapped and described for the State of Alaska and nearby areas. Ecoregion units are based on newly available datasets and field experience of ecologists, biologists, geologists and regional experts. Recently derived datasets for Alaska included climate parameters, vegetation, surficial geology and topography. Additional datasets incorporated in the mapping process were lithology, soils, permafrost, hydrography, fire regime and glaciation. Thirty two units are mapped using a combination of the approaches of Bailey (hierarchial), and Omernick (integrated). The ecoregions are grouped into two higher levels using a 'tri-archy' based on climate parameters, vegetation response and disturbance processes. The ecoregions are described with text, photos and tables on the published map.

  9. Data Discovery of Big and Diverse Climate Change Datasets - Options, Practices and Challenges

    NASA Astrophysics Data System (ADS)

    Palanisamy, G.; Boden, T.; McCord, R. A.; Frame, M. T.

    2013-12-01

    Developing data search tools is a very common, but often confusing, task for most of the data intensive scientific projects. These search interfaces need to be continually improved to handle the ever increasing diversity and volume of data collections. There are many aspects which determine the type of search tool a project needs to provide to their user community. These include: number of datasets, amount and consistency of discovery metadata, ancillary information such as availability of quality information and provenance, and availability of similar datasets from other distributed sources. Environmental Data Science and Systems (EDSS) group within the Environmental Science Division at the Oak Ridge National Laboratory has a long history of successfully managing diverse and big observational datasets for various scientific programs via various data centers such as DOE's Atmospheric Radiation Measurement Program (ARM), DOE's Carbon Dioxide Information and Analysis Center (CDIAC), USGS's Core Science Analytics and Synthesis (CSAS) metadata Clearinghouse and NASA's Distributed Active Archive Center (ORNL DAAC). This talk will showcase some of the recent developments for improving the data discovery within these centers The DOE ARM program recently developed a data discovery tool which allows users to search and discover over 4000 observational datasets. These datasets are key to the research efforts related to global climate change. The ARM discovery tool features many new functions such as filtered and faceted search logic, multi-pass data selection, filtering data based on data quality, graphical views of data quality and availability, direct access to data quality reports, and data plots. The ARM Archive also provides discovery metadata to other broader metadata clearinghouses such as ESGF, IASOA, and GOS. In addition to the new interface, ARM is also currently working on providing DOI metadata records to publishers such as Thomson Reuters and Elsevier. The ARM program also provides a standards based online metadata editor (OME) for PIs to submit their data to the ARM Data Archive. USGS CSAS metadata Clearinghouse aggregates metadata records from several USGS projects and other partner organizations. The Clearinghouse allows users to search and discover over 100,000 biological and ecological datasets from a single web portal. The Clearinghouse also enabled some new data discovery functions such as enhanced geo-spatial searches based on land and ocean classifications, metadata completeness rankings, data linkage via digital object identifiers (DOIs), and semantically enhanced keyword searches. The Clearinghouse also currently working on enabling a dashboard which allows the data providers to look at various statistics such as number their records accessed via the Clearinghouse, most popular keywords, metadata quality report and DOI creation service. The Clearinghouse also publishes metadata records to broader portals such as NSF DataONE and Data.gov. The author will also present how these capabilities are currently reused by the recent and upcoming data centers such as DOE's NGEE-Arctic project. References: [1] Devarakonda, R., Palanisamy, G., Wilson, B. E., & Green, J. M. (2010). Mercury: reusable metadata management, data discovery and access system. Earth Science Informatics, 3(1-2), 87-94. [2]Devarakonda, R., Shrestha, B., Palanisamy, G., Hook, L., Killeffer, T., Krassovski, M., ... & Frame, M. (2014, October). OME: Tool for generating and managing metadata to handle BigData. In BigData Conference (pp. 8-10).

  10. TH-A-9A-01: Active Optical Flow Model: Predicting Voxel-Level Dose Prediction in Spine SBRT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, J; Wu, Q.J.; Yin, F

    2014-06-15

    Purpose: To predict voxel-level dose distribution and enable effective evaluation of cord dose sparing in spine SBRT. Methods: We present an active optical flow model (AOFM) to statistically describe cord dose variations and train a predictive model to represent correlations between AOFM and PTV contours. Thirty clinically accepted spine SBRT plans are evenly divided into training and testing datasets. The development of predictive model consists of 1) collecting a sequence of dose maps including PTV and OAR (spinal cord) as well as a set of associated PTV contours adjacent to OAR from the training dataset, 2) classifying data into fivemore » groups based on PTV's locations relative to OAR, two “Top”s, “Left”, “Right”, and “Bottom”, 3) randomly selecting a dose map as the reference in each group and applying rigid registration and optical flow deformation to match all other maps to the reference, 4) building AOFM by importing optical flow vectors and dose values into the principal component analysis (PCA), 5) applying another PCA to features of PTV and OAR contours to generate an active shape model (ASM), and 6) computing a linear regression model of correlations between AOFM and ASM.When predicting dose distribution of a new case in the testing dataset, the PTV is first assigned to a group based on its contour characteristics. Contour features are then transformed into ASM's principal coordinates of the selected group. Finally, voxel-level dose distribution is determined by mapping from the ASM space to the AOFM space using the predictive model. Results: The DVHs predicted by the AOFM-based model and those in clinical plans are comparable in training and testing datasets. At 2% volume the dose difference between predicted and clinical plans is 4.2±4.4% and 3.3±3.5% in the training and testing datasets, respectively. Conclusion: The AOFM is effective in predicting voxel-level dose distribution for spine SBRT. Partially supported by NIH/NCI under grant #R21CA161389 and a master research grant by Varian Medical System.« less

  11. Impact of automatization in temperature series in Spain and comparison with the POST-AWS dataset

    NASA Astrophysics Data System (ADS)

    Aguilar, Enric; López-Díaz, José Antonio; Prohom Duran, Marc; Gilabert, Alba; Luna Rico, Yolanda; Venema, Victor; Auchmann, Renate; Stepanek, Petr; Brandsma, Theo

    2016-04-01

    Climate data records are most of the times affected by inhomogeneities. Especially inhomogeneities introducing network-wide biases are sometimes related to changes happening almost simultaneously in an entire network. Relative homogenization is difficult in these cases, especially at the daily scale. A good example of this is the substitution of manual observations (MAN) by automatic weather stations (AWS). Parallel measurements (i.e. records taken at the same time with the old (MAN) and new (AWS) sensors can provide an idea of the bias introduced and help to evaluate the suitability of different correction approaches. We present here a quality controlled dataset compiled under the DAAMEC Project, comprising 46 stations across Spain and over 85,000 parallel measurements (AWS-MAN) of daily maximum and minimum temperature. We study the differences between both sensors and compare it with the available metadata to account for internal inhomogeneities. The differences between both systems vary much across stations, with patterns more related to their particular settings than to climatic/geographical reasons. The typical median biases (AWS-MAN) by station (comprised between the interquartile range) oscillate between -0.2°C and 0.4 in daily maximum temperature and between -0.4°C and 0.2°C in daily minimum temperature. These and other results are compared with a larger network, the Parallel Observations Scientific Team, a working group of the International Surface Temperatures Initiative (ISTI-POST) dataset, which comprises our stations, as well as others from different countries in America, Asia and Europe.

  12. Dataset of Fourier transform-infrared coupled with chemometric analysis used to distinguish accessions of Garcinia mangostana L. in Peninsular Malaysia.

    PubMed

    Samsir, Sri A'jilah; Bunawan, Hamidun; Yen, Choong Chee; Noor, Normah Mohd

    2016-09-01

    In this dataset, we distinguish 15 accessions of Garcinia mangostana from Peninsular Malaysia using Fourier transform-infrared spectroscopy coupled with chemometric analysis. We found that the position and intensity of characteristic peaks at 3600-3100 cm(-) (1) in IR spectra allowed discrimination of G. mangostana from different locations. Further principal component analysis (PCA) of all the accessions suggests the two main clusters were formed: samples from Johor, Melaka, and Negeri Sembilan (South) were clustered together in one group while samples from Perak, Kedah, Penang, Selangor, Kelantan, and Terengganu (North and East Coast) were in another clustered group.

  13. Reconstruction of Complex Directional Networks with Group Lasso Nonlinear Conditional Granger Causality.

    PubMed

    Yang, Guanxue; Wang, Lin; Wang, Xiaofan

    2017-06-07

    Reconstruction of networks underlying complex systems is one of the most crucial problems in many areas of engineering and science. In this paper, rather than identifying parameters of complex systems governed by pre-defined models or taking some polynomial and rational functions as a prior information for subsequent model selection, we put forward a general framework for nonlinear causal network reconstruction from time-series with limited observations. With obtaining multi-source datasets based on the data-fusion strategy, we propose a novel method to handle nonlinearity and directionality of complex networked systems, namely group lasso nonlinear conditional granger causality. Specially, our method can exploit different sets of radial basis functions to approximate the nonlinear interactions between each pair of nodes and integrate sparsity into grouped variables selection. The performance characteristic of our approach is firstly assessed with two types of simulated datasets from nonlinear vector autoregressive model and nonlinear dynamic models, and then verified based on the benchmark datasets from DREAM3 Challenge4. Effects of data size and noise intensity are also discussed. All of the results demonstrate that the proposed method performs better in terms of higher area under precision-recall curve.

  14. Using Third Party Data to Update a Reference Dataset in a Quality Evaluation Service

    NASA Astrophysics Data System (ADS)

    Xavier, E. M. A.; Ariza-López, F. J.; Ureña-Cámara, M. A.

    2016-06-01

    Nowadays it is easy to find many data sources for various regions around the globe. In this 'data overload' scenario there are few, if any, information available about the quality of these data sources. In order to easily provide these data quality information we presented the architecture of a web service for the automation of quality control of spatial datasets running over a Web Processing Service (WPS). For quality procedures that require an external reference dataset, like positional accuracy or completeness, the architecture permits using a reference dataset. However, this reference dataset is not ageless, since it suffers the natural time degradation inherent to geospatial features. In order to mitigate this problem we propose the Time Degradation & Updating Module which intends to apply assessed data as a tool to maintain the reference database updated. The main idea is to utilize datasets sent to the quality evaluation service as a source of 'candidate data elements' for the updating of the reference database. After the evaluation, if some elements of a candidate dataset reach a determined quality level, they can be used as input data to improve the current reference database. In this work we present the first design of the Time Degradation & Updating Module. We believe that the outcomes can be applied in the search of a full-automatic on-line quality evaluation platform.

  15. Pedestrian detection in video surveillance using fully convolutional YOLO neural network

    NASA Astrophysics Data System (ADS)

    Molchanov, V. V.; Vishnyakov, B. V.; Vizilter, Y. V.; Vishnyakova, O. V.; Knyaz, V. A.

    2017-06-01

    More than 80% of video surveillance systems are used for monitoring people. Old human detection algorithms, based on background and foreground modelling, could not even deal with a group of people, to say nothing of a crowd. Recent robust and highly effective pedestrian detection algorithms are a new milestone of video surveillance systems. Based on modern approaches in deep learning, these algorithms produce very discriminative features that can be used for getting robust inference in real visual scenes. They deal with such tasks as distinguishing different persons in a group, overcome problem with sufficient enclosures of human bodies by the foreground, detect various poses of people. In our work we use a new approach which enables to combine detection and classification tasks into one challenge using convolution neural networks. As a start point we choose YOLO CNN, whose authors propose a very efficient way of combining mentioned above tasks by learning a single neural network. This approach showed competitive results with state-of-the-art models such as FAST R-CNN, significantly overcoming them in speed, which allows us to apply it in real time video surveillance and other video monitoring systems. Despite all advantages it suffers from some known drawbacks, related to the fully-connected layers that obstruct applying the CNN to images with different resolution. Also it limits the ability to distinguish small close human figures in groups which is crucial for our tasks since we work with rather low quality images which often include dense small groups of people. In this work we gradually change network architecture to overcome mentioned above problems, train it on a complex pedestrian dataset and finally get the CNN detecting small pedestrians in real scenes.

  16. Comparison of present global reanalysis datasets in the context of a statistical downscaling method for precipitation prediction

    NASA Astrophysics Data System (ADS)

    Horton, Pascal; Weingartner, Rolf; Brönnimann, Stefan

    2017-04-01

    The analogue method is a statistical downscaling method for precipitation prediction. It uses similarity in terms of synoptic-scale predictors with situations in the past in order to provide a probabilistic prediction for the day of interest. It has been used for decades in a context of weather or flood forecasting, and is more recently also applied to climate studies, whether for reconstruction of past weather conditions or future climate impact studies. In order to evaluate the relationship between synoptic scale predictors and the local weather variable of interest, e.g. precipitation, reanalysis datasets are necessary. Nowadays, the number of available reanalysis datasets increases. These are generated by different atmospheric models with different assimilation techniques and offer various spatial and temporal resolutions. A major difference between these datasets is also the length of the archive they provide. While some datasets start at the beginning of the satellite era (1980) and assimilate these data, others aim at homogeneity on a longer period (e.g. 20th century) and only assimilate conventional observations. The context of the application of analogue methods might drive the choice of an appropriate dataset, for example when the archive length is a leading criterion. However, in many studies, a reanalysis dataset is subjectively chosen, according to the user's preferences or the ease of access. The impact of this choice on the results of the downscaling procedure is rarely considered and no comprehensive comparison has been undertaken so far. In order to fill this gap and to advise on the choice of appropriate datasets, nine different global reanalysis datasets were compared in seven distinct versions of analogue methods, over 300 precipitation stations in Switzerland. Significant differences in terms of prediction performance were identified. Although the impact of the reanalysis dataset on the skill score varies according to the chosen predictor, be it atmospheric circulation or thermodynamic variables, some hierarchy between the datasets is often preserved. This work can thus help choosing an appropriate dataset for the analogue method, or raise awareness of the consequences of using a certain dataset.

  17. A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

    NASA Astrophysics Data System (ADS)

    Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

    2017-10-01

    In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.

  18. Big Data Approaches for the Analysis of Large-Scale fMRI Data Using Apache Spark and GPU Processing: A Demonstration on Resting-State fMRI Data from the Human Connectome Project

    PubMed Central

    Boubela, Roland N.; Kalcher, Klaudius; Huf, Wolfgang; Našel, Christian; Moser, Ewald

    2016-01-01

    Technologies for scalable analysis of very large datasets have emerged in the domain of internet computing, but are still rarely used in neuroimaging despite the existence of data and research questions in need of efficient computation tools especially in fMRI. In this work, we present software tools for the application of Apache Spark and Graphics Processing Units (GPUs) to neuroimaging datasets, in particular providing distributed file input for 4D NIfTI fMRI datasets in Scala for use in an Apache Spark environment. Examples for using this Big Data platform in graph analysis of fMRI datasets are shown to illustrate how processing pipelines employing it can be developed. With more tools for the convenient integration of neuroimaging file formats and typical processing steps, big data technologies could find wider endorsement in the community, leading to a range of potentially useful applications especially in view of the current collaborative creation of a wealth of large data repositories including thousands of individual fMRI datasets. PMID:26778951

  19. MicroRNA array normalization: an evaluation using a randomized dataset as the benchmark.

    PubMed

    Qin, Li-Xuan; Zhou, Qin

    2014-01-01

    MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.

  20. MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark

    PubMed Central

    Qin, Li-Xuan; Zhou, Qin

    2014-01-01

    MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456

  1. Analysis of energy-based algorithms for RNA secondary structure prediction

    PubMed Central

    2012-01-01

    Background RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. Results We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). Conclusions Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets. PMID:22296803

  2. Analysis of energy-based algorithms for RNA secondary structure prediction.

    PubMed

    Hajiaghayi, Monir; Condon, Anne; Hoos, Holger H

    2012-02-01

    RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets.

  3. A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis

    PubMed Central

    Lee, Kyubum; Kim, Byounggun; Jeon, Minji; Kim, Jihye; Tan, Aik Choon

    2018-01-01

    Background With the development of artificial intelligence (AI) technology centered on deep-learning, the computer has evolved to a point where it can read a given text and answer a question based on the context of the text. Such a specific task is known as the task of machine comprehension. Existing machine comprehension tasks mostly use datasets of general texts, such as news articles or elementary school-level storybooks. However, no attempt has been made to determine whether an up-to-date deep learning-based machine comprehension model can also process scientific literature containing expert-level knowledge, especially in the biomedical domain. Objective This study aims to investigate whether a machine comprehension model can process biomedical articles as well as general texts. Since there is no dataset for the biomedical literature comprehension task, our work includes generating a large-scale question answering dataset using PubMed and manually evaluating the generated dataset. Methods We present an attention-based deep neural model tailored to the biomedical domain. To further enhance the performance of our model, we used a pretrained word vector and biomedical entity type embedding. We also developed an ensemble method of combining the results of several independent models to reduce the variance of the answers from the models. Results The experimental results showed that our proposed deep neural network model outperformed the baseline model by more than 7% on the new dataset. We also evaluated human performance on the new dataset. The human evaluation result showed that our deep neural model outperformed humans in comprehension by 22% on average. Conclusions In this work, we introduced a new task of machine comprehension in the biomedical domain using a deep neural model. Since there was no large-scale dataset for training deep neural models in the biomedical domain, we created the new cloze-style datasets Biomedical Knowledge Comprehension Title (BMKC_T) and Biomedical Knowledge Comprehension Last Sentence (BMKC_LS) (together referred to as BioMedical Knowledge Comprehension) using the PubMed corpus. The experimental results showed that the performance of our model is much higher than that of humans. We observed that our model performed consistently better regardless of the degree of difficulty of a text, whereas humans have difficulty when performing biomedical literature comprehension tasks that require expert level knowledge. PMID:29305341

  4. PCA-based approach for subtracting thermal background emission in high-contrast imaging data

    NASA Astrophysics Data System (ADS)

    Hunziker, S.; Quanz, S. P.; Amara, A.; Meyer, M. R.

    2018-03-01

    Aims.Ground-based observations at thermal infrared wavelengths suffer from large background radiation due to the sky, telescope and warm surfaces in the instrument. This significantly limits the sensitivity of ground-based observations at wavelengths longer than 3 μm. The main purpose of this work is to analyse this background emission in infrared high-contrast imaging data as illustrative of the problem, show how it can be modelled and subtracted and demonstrate that it can improve the detection of faint sources, such as exoplanets. Methods: We used principal component analysis (PCA) to model and subtract the thermal background emission in three archival high-contrast angular differential imaging datasets in the M' and L' filter. We used an M' dataset of β Pic to describe in detail how the algorithm works and explain how it can be applied. The results of the background subtraction are compared to the results from a conventional mean background subtraction scheme applied to the same dataset. Finally, both methods for background subtraction are compared by performing complete data reductions. We analysed the results from the M' dataset of HD 100546 only qualitatively. For the M' band dataset of β Pic and the L' band dataset of HD 169142, which was obtained with an angular groove phase mask vortex vector coronagraph, we also calculated and analysed the achieved signal-to-noise ratio (S/N). Results: We show that applying PCA is an effective way to remove spatially and temporarily varying thermal background emission down to close to the background limit. The procedure also proves to be very successful at reconstructing the background that is hidden behind the point spread function. In the complete data reductions, we find at least qualitative improvements for HD 100546 and HD 169142, however, we fail to find a significant increase in S/N of β Pic b. We discuss these findings and argue that in particular datasets with strongly varying observing conditions or infrequently sampled sky background will benefit from the new approach.

  5. A Common Methodology: Using Cluster Analysis to Identify Organizational Culture across Two Workforce Datasets

    ERIC Educational Resources Information Center

    Munn, Sunny L.

    2016-01-01

    Organizational structures are comprised of an organizational culture created by the beliefs, values, traditions, policies and processes carried out by the organization. The work-life system in which individuals use work-life initiatives to achieve a work-life balance can be influenced by the type of organizational culture within one's workplace,…

  6. Research support of the WETNET Program

    NASA Technical Reports Server (NTRS)

    Estes, John E.; Mcgwire, Kenneth C.; Scepan, Joseph; Henderson, SY; Lawless, Michael

    1995-01-01

    This study examines various aspects of the Microwave Vegetation Index (MVI). MVI is a derived signal created by differencing the spectral response of the 37 GHz horizontally and vertically polarized passive microwave signals. The microwave signal employed to derive this index is thought to be primarily influenced by vegetation structure, vegetation growth, standing water, and precipitation. The state of California is the study site for this research. Imagery from the Special Sensor Microwave/Imager (SSM/I) is used for the creation of MVI datasets analyzed in this research. The object of this research is to determine whether MVI corresponds with some quantifiable vegetation parameter (such as vegetation density) or whether the index is more affected by known biogeophysical parameters such antecedent precipitation. A secondary question associated with the above is whether the vegetation attributes that MVI is employed to determine can be more easily and accurately evaluated by other remote sensing means. An important associated question to be addressed in the study is the effect of different multi-temporal composting techniques on the derived MVI dataset. This work advances our understanding of the fundamental nature of MVI by studying vegetation as a mixture of structural types, such as forest and grassland. The study further advances our understanding by creating multitemporal precipitation datasets to compare the affects of precipitation upon MVI. This work will help to lay the groundwork for the use of passive microwave spectral information either as an adjunct to visible and near infrared imagery in areas where that is feasible or for the use of passive microwave alone in areas of moderate cloud coverage. In this research, an MVI dataset, spanning the period February 15, 1989 through April 25, 1990, has been created using National Aeronautic and Space Administration (NASA) supplied brightness temperature data. Information from the DMSP satellite 37 GHz wavelength SSM/I sensor in both horizontal and vertical polarization has been processed using the MVI algorithm. In conjunction with the MVI algorithm a multitemporal compositing technique was used to create datasets that correspond to 14 day periods. In this technical report, Section Two contains background information on the State of California and the three MVI study sites. Section Three describes the methods used to create the MVI and independent variables datasets. Section Four presents the results of the experiment. Section Five summarizes and concludes the work.

  7. Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure.

    PubMed

    Kilborn, Joshua P; Jones, David L; Peebles, Ernst B; Naar, David F

    2017-04-01

    Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

  8. Learning to Work with Databases in Astronomy: Quantitative Analysis of Science Educators' and Students' Pre-/Post-Tests

    NASA Astrophysics Data System (ADS)

    Schwortz, Andria C.; Burrows, Andrea C.; Myers, Adam D.

    2015-01-01

    Astronomy is increasingly moving towards working with large databases, from the state-of-the-art Sloan Digital Sky Survey Data Release 10, to the historical Digital Access to a Sky Century at Harvard. Non-astronomy fields as well tend to work with large datasets, be it in the form of warehouse inventory, health trends, or the stock market. However very few fields explicitly teach students the necessary skills to analyze such data. The authors studied a matched set of 37 participants working with 200-entry databases in astronomy using Google Spreadsheets, with limited information about a random set of quasars drawn from SDSS DR5. Here the authors present the quantitative results from an eight question pre-/post-test, with questions designed to span Bloom's taxonomy, on both the topics of the skills of using spreadsheets, and the content of quasars. Participants included both Astro 101 summer students and professionals including in-service K-12 teachers and science communicators. All groups showed statistically significant gains (as per Hake, 1998), with the greatest difference between women's gains of 0.196 and men's of 0.480.

  9. The DSM diagnostic criteria for transvestic fetishism.

    PubMed

    Blanchard, Ray

    2010-04-01

    This paper contains the author's report on transvestism, submitted on July 31, 2008, to the work group charged with revising the diagnoses concerning sexual and gender identity disorders for the fifth edition of the American Psychiatric Association's Diagnostic and Statistical Manual of Mental Disorders (DSM). In the first part of this report, the author reviews differences among previous editions of the DSM as a convenient way to illustrate problems with the nomenclature and uncertainties in the descriptive pathology of transvestism. He concludes this part by proposing a revised set of diagnostic criteria, including a new set of specifiers. In the second part, he presents a secondary analysis of a pre-existing dataset in order to investigate the utility of the proposed specifiers.

  10. Simultaneous Co-Clustering and Classification in Customers Insight

    NASA Astrophysics Data System (ADS)

    Anggistia, M.; Saefuddin, A.; Sartono, B.

    2017-04-01

    Building predictive model based on the heterogeneous dataset may yield many problems, such as less precise in parameter and prediction accuracy. Such problem can be solved by segmenting the data into relatively homogeneous groups and then build a predictive model for each cluster. The advantage of using this strategy usually gives result in simpler models, more interpretable, and more actionable without any loss in accuracy and reliability. This work concerns on marketing data set which recorded a customer behaviour across products. There are some variables describing customer and product as attributes. The basic idea of this approach is to combine co-clustering and classification simultaneously. The objective of this research is to analyse the customer across product characteristics, so the marketing strategy implemented precisely.

  11. Identification of families among highly inclined asteroids

    NASA Astrophysics Data System (ADS)

    Gil-Hutton, R.

    2006-07-01

    A dataset of 3652 high-inclination numbered asteroids was analyzed to search for dynamical families. A fully automated multivariate data analysis technique was applied to identify the groupings. Thirteen dynamical families and twenty-two clumps were found. When taxonomic information is available, the families show cosmochemical consistency and support an interpretation based on a common origin from a single parent body. Four families and three clumps found in this work show a size distribution which is compatible with a formation due to a cratering event on the largest member of the family, and also three families have B- or related taxonomic types members, which represents a 14% of the B-types classified by Bus and Binzel [2002. Icarus 158, 146-177].

  12. cellVIEW: a Tool for Illustrative and Multi-Scale Rendering of Large Biomolecular Datasets

    PubMed Central

    Le Muzic, Mathieu; Autin, Ludovic; Parulek, Julius; Viola, Ivan

    2017-01-01

    In this article we introduce cellVIEW, a new system to interactively visualize large biomolecular datasets on the atomic level. Our tool is unique and has been specifically designed to match the ambitions of our domain experts to model and interactively visualize structures comprised of several billions atom. The cellVIEW system integrates acceleration techniques to allow for real-time graphics performance of 60 Hz display rate on datasets representing large viruses and bacterial organisms. Inspired by the work of scientific illustrators, we propose a level-of-detail scheme which purpose is two-fold: accelerating the rendering and reducing visual clutter. The main part of our datasets is made out of macromolecules, but it also comprises nucleic acids strands which are stored as sets of control points. For that specific case, we extend our rendering method to support the dynamic generation of DNA strands directly on the GPU. It is noteworthy that our tool has been directly implemented inside a game engine. We chose to rely on a third party engine to reduce software development work-load and to make bleeding-edge graphics techniques more accessible to the end-users. To our knowledge cellVIEW is the only suitable solution for interactive visualization of large bimolecular landscapes on the atomic level and is freely available to use and extend. PMID:29291131

  13. Simultaneous fingerprint and high-wavenumber fiber-optic Raman spectroscopy improves in vivo diagnosis of esophageal squamous cell carcinoma at endoscopy

    NASA Astrophysics Data System (ADS)

    Wang, Jianfeng; Lin, Kan; Zheng, Wei; Yu Ho, Khek; Teh, Ming; Guan Yeoh, Khay; Huang, Zhiwei

    2015-08-01

    This work aims to evaluate clinical value of a fiber-optic Raman spectroscopy technique developed for in vivo diagnosis of esophageal squamous cell carcinoma (ESCC) during clinical endoscopy. We have developed a rapid fiber-optic Raman endoscopic system capable of simultaneously acquiring both fingerprint (FP)(800-1800 cm-1) and high-wavenumber (HW)(2800-3600 cm-1) Raman spectra from esophageal tissue in vivo. A total of 1172 in vivo FP/HW Raman spectra were acquired from 48 esophageal patients undergoing endoscopic examination. The total Raman dataset was split into two parts: 80% for training; while 20% for testing. Partial least squares-discriminant analysis (PLS-DA) and leave-one patient-out, cross validation (LOPCV) were implemented on training dataset to develop diagnostic algorithms for tissue classification. PLS-DA-LOPCV shows that simultaneous FP/HW Raman spectroscopy on training dataset provides a diagnostic sensitivity of 97.0% and specificity of 97.4% for ESCC classification. Further, the diagnostic algorithm applied to the independent testing dataset based on simultaneous FP/HW Raman technique gives a predictive diagnostic sensitivity of 92.7% and specificity of 93.6% for ESCC identification, which is superior to either FP or HW Raman technique alone. This work demonstrates that the simultaneous FP/HW fiber-optic Raman spectroscopy technique improves real-time in vivo diagnosis of esophageal neoplasia at endoscopy.

  14. A hierarchical network-based algorithm for multi-scale watershed delineation

    NASA Astrophysics Data System (ADS)

    Castronova, Anthony M.; Goodall, Jonathan L.

    2014-11-01

    Watershed delineation is a process for defining a land area that contributes surface water flow to a single outlet point. It is a commonly used in water resources analysis to define the domain in which hydrologic process calculations are applied. There has been a growing effort over the past decade to improve surface elevation measurements in the U.S., which has had a significant impact on the accuracy of hydrologic calculations. Traditional watershed processing on these elevation rasters, however, becomes more burdensome as data resolution increases. As a result, processing of these datasets can be troublesome on standard desktop computers. This challenge has resulted in numerous works that aim to provide high performance computing solutions to large data, high resolution data, or both. This work proposes an efficient watershed delineation algorithm for use in desktop computing environments that leverages existing data, U.S. Geological Survey (USGS) National Hydrography Dataset Plus (NHD+), and open source software tools to construct watershed boundaries. This approach makes use of U.S. national-level hydrography data that has been precomputed using raster processing algorithms coupled with quality control routines. Our approach uses carefully arranged data and mathematical graph theory to traverse river networks and identify catchment boundaries. We demonstrate this new watershed delineation technique, compare its accuracy with traditional algorithms that derive watershed solely from digital elevation models, and then extend our approach to address subwatershed delineation. Our findings suggest that the open-source hierarchical network-based delineation procedure presented in the work is a promising approach to watershed delineation that can be used summarize publicly available datasets for hydrologic model input pre-processing. Through our analysis, we explore the benefits of reusing the NHD+ datasets for watershed delineation, and find that the our technique offers greater flexibility and extendability than traditional raster algorithms.

  15. How Accurately Can Your Wrist Device Recognize Daily Activities and Detect Falls?

    PubMed Central

    Gjoreski, Martin; Gjoreski, Hristijan; Luštrek, Mitja; Gams, Matjaž

    2016-01-01

    Although wearable accelerometers can successfully recognize activities and detect falls, their adoption in real life is low because users do not want to wear additional devices. A possible solution is an accelerometer inside a wrist device/smartwatch. However, wrist placement might perform poorly in terms of accuracy due to frequent random movements of the hand. In this paper we perform a thorough, large-scale evaluation of methods for activity recognition and fall detection on four datasets. On the first two we showed that the left wrist performs better compared to the dominant right one, and also better compared to the elbow and the chest, but worse compared to the ankle, knee and belt. On the third (Opportunity) dataset, our method outperformed the related work, indicating that our feature-preprocessing creates better input data. And finally, on a real-life unlabeled dataset the recognized activities captured the subject’s daily rhythm and activities. Our fall-detection method detected all of the fast falls and minimized the false positives, achieving 85% accuracy on the first dataset. Because the other datasets did not contain fall events, only false positives were evaluated, resulting in 9 for the second, 1 for the third and 15 for the real-life dataset (57 days data). PMID:27258282

  16. Multivendor Spectral-Domain Optical Coherence Tomography Dataset, Observer Annotation Performance Evaluation, and Standardized Evaluation Framework for Intraretinal Cystoid Fluid Segmentation.

    PubMed

    Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S; Langs, Georg; Simader, Christian; Waldstein, Sebastian M; Schmidt-Erfurth, Ursula M

    2016-01-01

    Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge.

  17. Multivendor Spectral-Domain Optical Coherence Tomography Dataset, Observer Annotation Performance Evaluation, and Standardized Evaluation Framework for Intraretinal Cystoid Fluid Segmentation

    PubMed Central

    Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S.; Langs, Georg; Simader, Christian

    2016-01-01

    Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge. PMID:27579177

  18. Association of Exercise and Metabolic Equivalent of Task (MET) Score with Survival Outcomes after Out-of-Hospital Cardiac Arrest of Young and Middle Age.

    PubMed

    Ro, Young Sun; Shin, Sang Do; Song, Kyoung Jun; Hong, Ki Jeong; Ahn, Ki Ok

    2017-06-01

    Regular physical activity is recommended to prevent cardiovascular disease including out-of-hospital cardiac arrest (OHCA). However, it is uncertain whether the intensity during physical activity is associated with better outcomes. We studied the effect of exercise at the time of arrest and the association between metabolic equivalent of task (MET) score and survival of OHCA patients of young and middle age. All OHCAs of presumed cardiac etiology who were 18-65 years of age and were witnessed by a layperson between 2013 and 2015 were analyzed. The main exposure of interest was physical activity at the time of, or immediately prior to, the arrest and the MET score groups (0-3 for light, 3-6 for moderate, and ≥6 for vigorous). The endpoint was survival with good neurological recovery. For the sensitivity analysis, we created a matched dataset by matching for age, gender, residential area, and comorbidities (diabetes, hypertension, heart disease, and stroke). Multivariable logistic regression analysis was performed, adjusting for patient and arrest-environmental factors. A total of 6,273 patients in the original dataset were included, and 762 (12.1%) patients had a cardiac arrest during exercise. The exercise-related OHCAs were more likely to have a good neurological recovery rate (25.9%) than the non-exercise-related OHCA (12.9%) in the original dataset (AOR (95% CI): 1.36 (1.08-1.70)) but not in the matched dataset (1.37 (0.92-1.97)). Using MET score groups, the moderate-intensity group compared with the non-exercise group was associated with better neurological outcome (1.70 (1.11-2.63)), but neither light-intensity (0.77 (0.40-1.49)) nor vigorous-intensity (1.44 (0.91-2.28)) groups were associated with better outcomes. Patients who had an OHCA during exercise were more likely to have neurologically intact survival compared to patients who had an OHCA during periods of non-exercise; however, only the moderate-intensity group was associated with a better neurological outcome. Copyright © 2017 Elsevier B.V. All rights reserved.

  19. Sparse modeling of spatial environmental variables associated with asthma

    PubMed Central

    Chang, Timothy S.; Gangnon, Ronald E.; Page, C. David; Buckingham, William R.; Tandias, Aman; Cowan, Kelly J.; Tomasallo, Carrie D.; Arndt, Brian G.; Hanrahan, Lawrence P.; Guilbert, Theresa W.

    2014-01-01

    Geographically distributed environmental factors influence the burden of diseases such as asthma. Our objective was to identify sparse environmental variables associated with asthma diagnosis gathered from a large electronic health record (EHR) dataset while controlling for spatial variation. An EHR dataset from the University of Wisconsin’s Family Medicine, Internal Medicine and Pediatrics Departments was obtained for 199,220 patients aged 5–50 years over a three-year period. Each patient’s home address was geocoded to one of 3,456 geographic census block groups. Over one thousand block group variables were obtained from a commercial database. We developed a Sparse Spatial Environmental Analysis (SASEA). Using this method, the environmental variables were first dimensionally reduced with sparse principal component analysis. Logistic thin plate regression spline modeling was then used to identify block group variables associated with asthma from sparse principal components. The addresses of patients from the EHR dataset were distributed throughout the majority of Wisconsin’s geography. Logistic thin plate regression spline modeling captured spatial variation of asthma. Four sparse principal components identified via model selection consisted of food at home, dog ownership, household size, and disposable income variables. In rural areas, dog ownership and renter occupied housing units from significant sparse principal components were associated with asthma. Our main contribution is the incorporation of sparsity in spatial modeling. SASEA sequentially added sparse principal components to Logistic thin plate regression spline modeling. This method allowed association of geographically distributed environmental factors with asthma using EHR and environmental datasets. SASEA can be applied to other diseases with environmental risk factors. PMID:25533437

  20. Sparse modeling of spatial environmental variables associated with asthma.

    PubMed

    Chang, Timothy S; Gangnon, Ronald E; David Page, C; Buckingham, William R; Tandias, Aman; Cowan, Kelly J; Tomasallo, Carrie D; Arndt, Brian G; Hanrahan, Lawrence P; Guilbert, Theresa W

    2015-02-01

    Geographically distributed environmental factors influence the burden of diseases such as asthma. Our objective was to identify sparse environmental variables associated with asthma diagnosis gathered from a large electronic health record (EHR) dataset while controlling for spatial variation. An EHR dataset from the University of Wisconsin's Family Medicine, Internal Medicine and Pediatrics Departments was obtained for 199,220 patients aged 5-50years over a three-year period. Each patient's home address was geocoded to one of 3456 geographic census block groups. Over one thousand block group variables were obtained from a commercial database. We developed a Sparse Spatial Environmental Analysis (SASEA). Using this method, the environmental variables were first dimensionally reduced with sparse principal component analysis. Logistic thin plate regression spline modeling was then used to identify block group variables associated with asthma from sparse principal components. The addresses of patients from the EHR dataset were distributed throughout the majority of Wisconsin's geography. Logistic thin plate regression spline modeling captured spatial variation of asthma. Four sparse principal components identified via model selection consisted of food at home, dog ownership, household size, and disposable income variables. In rural areas, dog ownership and renter occupied housing units from significant sparse principal components were associated with asthma. Our main contribution is the incorporation of sparsity in spatial modeling. SASEA sequentially added sparse principal components to Logistic thin plate regression spline modeling. This method allowed association of geographically distributed environmental factors with asthma using EHR and environmental datasets. SASEA can be applied to other diseases with environmental risk factors. Copyright © 2014 Elsevier Inc. All rights reserved.

  1. Progressive decline of cognition during the conversion from prodrome to psychosis with a characteristic pattern of the theory of mind compensated by neurocognition.

    PubMed

    Zhang, TianHong; Cui, HuiRu; Wei, YanYan; Tang, YingYing; Xu, LiHua; Tang, XiaoChen; Zhu, YiKang; Jiang, LiJuan; Zhang, Bin; Qian, ZhenYing; Chow, Annabelle; Liu, XiaoHua; Li, ChunBo; Xiao, ZePing; Wang, JiJun

    2018-05-01

    The association between neurocognition and the theory of mind (ToM) abilities during the progression of psychosis is unclear. This study included 83 individuals with attenuated psychosis syndrome (APS), from which 26 converted to psychosis (converters) after a follow up period of 18months. Comprehensive cognitive tests (including MATRICS Consensus Cognitive Battery, Faux-Pas Task, and Reading-Mind-in-Eyes Tasks) were administered at baseline. A structural equation modeling (SEM) analysis was conducted to estimate the effects of neurocognition on the ToM functioning in both APS and healthy controls (HC) datasets. At baseline, the converters and non-converters groups differed significantly on several domains of cognitive performance. The SEM analysis demonstrated that the path from neurocognition to ToM was statistically significant in the APS dataset (p<0.001). However, in the HC dataset, the result of the same analysis was not significant (p=0.117). Positive correlations between neurocognition and ToM were observed, and the most obvious correlations were found in the converters group compared with the non-converters group (p=0.064) and compared with the HC group (p=0.002). The correlation between ToM abilities and neurocognition may be increased during the progression of the condition, especially for individuals who convert to psychosis after a short period. Copyright © 2017. Published by Elsevier B.V.

  2. Climate Model Diagnostic Analyzer

    NASA Technical Reports Server (NTRS)

    Lee, Seungwon; Pan, Lei; Zhai, Chengxing; Tang, Benyang; Kubar, Terry; Zhang, Zia; Wang, Wei

    2015-01-01

    The comprehensive and innovative evaluation of climate models with newly available global observations is critically needed for the improvement of climate model current-state representation and future-state predictability. A climate model diagnostic evaluation process requires physics-based multi-variable analyses that typically involve large-volume and heterogeneous datasets, making them both computation- and data-intensive. With an exploratory nature of climate data analyses and an explosive growth of datasets and service tools, scientists are struggling to keep track of their datasets, tools, and execution/study history, let alone sharing them with others. In response, we have developed a cloud-enabled, provenance-supported, web-service system called Climate Model Diagnostic Analyzer (CMDA). CMDA enables the physics-based, multivariable model performance evaluations and diagnoses through the comprehensive and synergistic use of multiple observational data, reanalysis data, and model outputs. At the same time, CMDA provides a crowd-sourcing space where scientists can organize their work efficiently and share their work with others. CMDA is empowered by many current state-of-the-art software packages in web service, provenance, and semantic search.

  3. Quantum-assisted Helmholtz machines: A quantum–classical deep learning framework for industrial datasets in near-term devices

    NASA Astrophysics Data System (ADS)

    Benedetti, Marcello; Realpe-Gómez, John; Perdomo-Ortiz, Alejandro

    2018-07-01

    Machine learning has been presented as one of the key applications for near-term quantum technologies, given its high commercial value and wide range of applicability. In this work, we introduce the quantum-assisted Helmholtz machine:a hybrid quantum–classical framework with the potential of tackling high-dimensional real-world machine learning datasets on continuous variables. Instead of using quantum computers only to assist deep learning, as previous approaches have suggested, we use deep learning to extract a low-dimensional binary representation of data, suitable for processing on relatively small quantum computers. Then, the quantum hardware and deep learning architecture work together to train an unsupervised generative model. We demonstrate this concept using 1644 quantum bits of a D-Wave 2000Q quantum device to model a sub-sampled version of the MNIST handwritten digit dataset with 16 × 16 continuous valued pixels. Although we illustrate this concept on a quantum annealer, adaptations to other quantum platforms, such as ion-trap technologies or superconducting gate-model architectures, could be explored within this flexible framework.

  4. BABAR: an R package to simplify the normalisation of common reference design microarray-based transcriptomic datasets

    PubMed Central

    2010-01-01

    Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show BABAR transforms real and simulated datasets to allow for the correct interpretation of these data, and is the ideal tool to facilitate the identification of differentially expressed genes or network inference analysis from transcriptomic datasets. PMID:20128918

  5. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

    PubMed Central

    Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.

    2017-01-01

    Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines. PMID:29372115

  6. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance.

    PubMed

    Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S

    2017-01-01

    As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.

  7. A hybrid organic-inorganic perovskite dataset

    NASA Astrophysics Data System (ADS)

    Kim, Chiho; Huan, Tran Doan; Krishnan, Sridevi; Ramprasad, Rampi

    2017-05-01

    Hybrid organic-inorganic perovskites (HOIPs) have been attracting a great deal of attention due to their versatility of electronic properties and fabrication methods. We prepare a dataset of 1,346 HOIPs, which features 16 organic cations, 3 group-IV cations and 4 halide anions. Using a combination of an atomic structure search method and density functional theory calculations, the optimized structures, the bandgap, the dielectric constant, and the relative energies of the HOIPs are uniformly prepared and validated by comparing with relevant experimental and/or theoretical data. We make the dataset available at Dryad Digital Repository, NoMaD Repository, and Khazana Repository (http://khazana.uconn.edu/), hoping that it could be useful for future data-mining efforts that can explore possible structure-property relationships and phenomenological models. Progressive extension of the dataset is expected as new organic cations become appropriate within the HOIP framework, and as additional properties are calculated for the new compounds found.

  8. EnviroAtlas - Austin, TX - Riparian Buffer Land Cover by Block Group

    EPA Pesticide Factsheets

    This EnviroAtlas dataset describes the percentage of forested, vegetated, and impervious land within 15- and 50-meters of hydrologically connected streams, rivers, and other water bodies within the EnviroAtlas community area. Forest is defined as Trees & Forest. Vegetated cover is defined as Trees & Forest and Grass & Herbaceous. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  9. Education and Work

    ERIC Educational Resources Information Center

    Trostel, Philip; Walker, Ian

    2006-01-01

    This paper examines the relationship between the incentives to work and to invest in human capital through education in a lifecycle optimizing model. These incentives are shown to be mutually reinforcing in a simple stylized model. This theoretical prediction is investigated empirically using three large micro datasets covering a broad range of…

  10. [German national consensus on wound documentation of leg ulcer : Part 1: Routine care - standard dataset and minimum dataset].

    PubMed

    Heyer, K; Herberger, K; Protz, K; Mayer, A; Dissemond, J; Debus, S; Augustin, M

    2017-09-01

    Standards for basic documentation and the course of treatment increase quality assurance and efficiency in health care. To date, no standards for the treatment of patients with leg ulcers are available in Germany. The aim of the study was to develop standards under routine conditions in the documentation of patients with leg ulcers. This article shows the recommended variables of a "standard dataset" and a "minimum dataset". Consensus building among experts from 38 scientific societies, professional associations, insurance and supply networks (n = 68 experts) took place. After conducting a systematic international literature research, available standards were reviewed and supplemented with our own considerations of the expert group. From 2012-2015 standards for documentation were defined in multistage online visits and personal meetings. A consensus was achieved for 18 variables for the minimum dataset and 48 variables for the standard dataset in a total of seven meetings and nine online Delphi visits. The datasets involve patient baseline data, data on the general health status, wound characteristics, diagnostic and therapeutic interventions, patient reported outcomes, nutrition, and education status. Based on a multistage continuous decision-making process, a standard in the measurement of events in routine care in patients with a leg ulcer was developed.

  11. Automatic liver volume segmentation and fibrosis classification

    NASA Astrophysics Data System (ADS)

    Bal, Evgeny; Klang, Eyal; Amitai, Michal; Greenspan, Hayit

    2018-02-01

    In this work, we present an automatic method for liver segmentation and fibrosis classification in liver computed-tomography (CT) portal phase scans. The input is a full abdomen CT scan with an unknown number of slices, and the output is a liver volume segmentation mask and a fibrosis grade. A multi-stage analysis scheme is applied to each scan, including: volume segmentation, texture features extraction and SVM based classification. Data contains portal phase CT examinations from 80 patients, taken with different scanners. Each examination has a matching Fibroscan grade. The dataset was subdivided into two groups: first group contains healthy cases and mild fibrosis, second group contains moderate fibrosis, severe fibrosis and cirrhosis. Using our automated algorithm, we achieved an average dice index of 0.93 ± 0.05 for segmentation and a sensitivity of 0.92 and specificity of 0.81for classification. To the best of our knowledge, this is a first end to end automatic framework for liver fibrosis classification; an approach that, once validated, can have a great potential value in the clinic.

  12. New quality assurance program integrating "modern radiotherapy" within the German Hodgkin Study Group.

    PubMed

    Kriz, J; Baues, C; Engenhart-Cabillic, R; Haverkamp, U; Herfarth, K; Lukas, P; Schmidberger, H; Marnitz-Schulze, S; Fuchs, M; Engert, A; Eich, H T

    2017-02-01

    Field design changed substantially from extended-field RT (EF-RT) to involved-field RT (IF-RT) and now to involved-node RT (IN-RT) and involved-site RT (IS-RT) as well as treatment techniques in radiotherapy (RT) of Hodgkin's lymphoma (HL). The purpose of this article is to demonstrate the establishment of a quality assurance program (QAP) including modern RT techniques and field designs within the German Hodgkin Study Group (GHSG). In the era of modern conformal RT, this QAP had to be fundamentally adapted and a new evaluation process has been intensively discussed by the radiotherapeutic expert panel of the GHSG. The expert panel developed guidelines and criteria to analyse "modern" field designs and treatment techniques. This work is based on a dataset of 11 patients treated within the sixth study generation (HD16-17). To develop a QAP of "modern RT", the expert panel defined criteria for analysing current RT procedures. The consensus of a modified QAP in ongoing and future trials is presented. With this schedule, the QAP of the GHSG could serve as a model for other study groups.

  13. Data Basin: Expanding Access to Conservation Data, Tools, and People

    NASA Astrophysics Data System (ADS)

    Comendant, T.; Strittholt, J.; Frost, P.; Ward, B. C.; Bachelet, D. M.; Osborne-Gowey, J.

    2009-12-01

    Mapping and spatial analysis are a fundamental part of problem solving in conservation science, yet spatial data are widely scattered, difficult to locate, and often unavailable. Valuable time and resources are wasted locating and gaining access to important biological, cultural, and economic datasets, scientific analysis, and experts. As conservation problems become more serious and the demand to solve them grows more urgent, a new way to connect science and practice is needed. To meet this need, an open-access, web tool called Data Basin (www.databasin.org) has been created by the Conservation Biology Institute in partnership with ESRI and the Wilburforce Foundation. Users of Data Basin can gain quick access to datasets, experts, groups, and tools to help solve real-world problems. Individuals and organizations can perform essential tasks such as exploring and downloading from a vast library of conservation datasets, uploading existing datasets, connecting to other external data sources, create groups, and produce customized maps that can be easily shared. Data Basin encourages sharing and publishing, but also provides privacy and security for sensitive information when needed. Users can publish projects within Data Basin to tell more complete and rich stories of discovery and solutions. Projects are an ideal way to publish collections of datasets, maps and other information on the internet to reach wider audiences. Data Basin also houses individual centers that provide direct access to data, maps, and experts focused on specific geographic areas or conservation topics. Current centers being developed include the Boreal Information Centre, the Data Basin Climate Center, and proposed Aquatic and Forest Conservation Centers.

  14. Evaluating the evidence for non-monotonic dose-response relationships: A systematic literature review and (re-)analysis of in vivo toxicity data in the area of food safety.

    PubMed

    Varret, C; Beronius, A; Bodin, L; Bokkers, B G H; Boon, P E; Burger, M; De Wit-Bos, L; Fischer, A; Hanberg, A; Litens-Karlsson, S; Slob, W; Wolterink, G; Zilliacus, J; Beausoleil, C; Rousselle, C

    2018-01-15

    This study aims to evaluate the evidence for the existence of non-monotonic dose-responses (NMDRs) of substances in the area of food safety. This review was performed following the systematic review methodology with the aim to identify in vivo studies published between January 2002 and February 2015 containing evidence for potential NMDRs. Inclusion and reliability criteria were defined and used to select relevant and reliable studies. A set of six checkpoints was developed to establish the likelihood that the data retrieved contained evidence for NMDR. In this review, 49 in vivo studies were identified as relevant and reliable, of which 42 were used for dose-response analysis. These studies contained 179 in vivo dose-response datasets with at least five dose groups (and a control group) as fewer doses cannot provide evidence for NMDR. These datasets were extracted and analyzed using the PROAST software package. The resulting dose-response relationships were evaluated for possible evidence of NMDRs by applying the six checkpoints. In total, 10 out of the 179 in vivo datasets fulfilled all six checkpoints. While these datasets could be considered as providing evidence for NMDR, replicated studies would still be needed to check if the results can be reproduced to rule out that the non-monotonicity was caused by incidental anomalies in that specific study. This approach, combining a systematic review with a set of checkpoints, is new and appears useful for future evaluations of the dose response datasets regarding evidence of non-monotonicity. Published by Elsevier Inc.

  15. Mining Gene Regulatory Networks by Neural Modeling of Expression Time-Series.

    PubMed

    Rubiolo, Mariano; Milone, Diego H; Stegmayer, Georgina

    2015-01-01

    Discovering gene regulatory networks from data is one of the most studied topics in recent years. Neural networks can be successfully used to infer an underlying gene network by modeling expression profiles as times series. This work proposes a novel method based on a pool of neural networks for obtaining a gene regulatory network from a gene expression dataset. They are used for modeling each possible interaction between pairs of genes in the dataset, and a set of mining rules is applied to accurately detect the subjacent relations among genes. The results obtained on artificial and real datasets confirm the method effectiveness for discovering regulatory networks from a proper modeling of the temporal dynamics of gene expression profiles.

  16. Optimized hardware framework of MLP with random hidden layers for classification applications

    NASA Astrophysics Data System (ADS)

    Zyarah, Abdullah M.; Ramesh, Abhishek; Merkel, Cory; Kudithipudi, Dhireesha

    2016-05-01

    Multilayer Perceptron Networks with random hidden layers are very efficient at automatic feature extraction and offer significant performance improvements in the training process. They essentially employ large collection of fixed, random features, and are expedient for form-factor constrained embedded platforms. In this work, a reconfigurable and scalable architecture is proposed for the MLPs with random hidden layers with a customized building block based on CORDIC algorithm. The proposed architecture also exploits fixed point operations for area efficiency. The design is validated for classification on two different datasets. An accuracy of ~ 90% for MNIST dataset and 75% for gender classification on LFW dataset was observed. The hardware has 299 speed-up over the corresponding software realization.

  17. Crowd counting via region based multi-channel convolution neural network

    NASA Astrophysics Data System (ADS)

    Cao, Xiaoguang; Gao, Siqi; Bai, Xiangzhi

    2017-11-01

    This paper proposed a novel region based multi-channel convolution neural network architecture for crowd counting. In order to effectively solve the perspective distortion in crowd datasets with a great diversity of scales, this work combines the main channel and three branch channels. These channels extract both the global and region features. And the results are used to estimate density map. Moreover, kernels with ladder-shaped sizes are designed across all the branch channels, which generate adaptive region features. Also, branch channels use relatively deep and shallow network to achieve more accurate detector. By using these strategies, the proposed architecture achieves state-of-the-art performance on ShanghaiTech datasets and competitive performance on UCF_CC_50 datasets.

  18. Re-Construction of Reference Population and Generating Weights by Decision Tree

    DTIC Science & Technology

    2017-07-21

    2017 Claflin University Orangeburg, SC 29115 DEFENSE EQUAL OPPORTUNITY MANAGEMENT INSTITUTE RESEARCH, DEVELOPMENT, AND STRATEGIC...Original Dataset 32 List of Figures in Appendix B Figure 1: Flow and Components of Project 20 Figure 2: Decision Tree 31 Figure 3: Effects of Weight...can compare the sample data. The dataset of this project has the reference population on unit level for group and gender, which is in red-dotted box

  19. Author Correction: Re-examination of the relationship between marine virus and microbial cell abundances.

    PubMed

    Wigington, Charles H; Sonderegger, Derek; Brussaard, Corina P D; Buchan, Alison; Finke, Jan F; Fuhrman, Jed A; Lennon, Jay T; Middelboe, Mathias; Suttle, Curtis A; Stock, Charles; Wilson, William H; Wommack, K Eric; Wilhelm, Steven W; Weitz, Joshua S

    2017-11-01

    The original publication of this Article included analysis of virus and microbial cell abundances and virus-to-microbial cell ratios. Data in the Article came from 25 studies intended to be exclusively from marine sites. However, 3 of the studies included in the original unified dataset were erroneously classified as marine sites during compilation. The records with mis-recorded longitude and latitude values were, in fact, taken from inland, freshwater sources. The three inland, freshwater datasets are ELA, TROUT and SWAT. The data from these three studies represent 163 of the 5,671 records in the original publication. In the updated version of the Article, all analyses have been recalculated using the same statistical analysis pipeline released via GitHub as part of the original publication. Removal of the three studies reduces the unified dataset to 5,508 records. Analyses involving all grouped datasets have been updated with changes noted in each figure. All key results remain qualitatively unchanged. All data and scripts used in this correction have been made available as a new, updated GitHub release to reflect the updated dataset and figures.

  20. Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

    PubMed

    Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

    2014-01-01

    Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution dataset mergers, such as the one exemplified here, can serve as a baseline towards comprehensive species distribution datasets.

  1. MA130301GT catalogue of Martian impact craters and advanced evaluation of crater detection algorithms using diverse topography and image datasets

    NASA Astrophysics Data System (ADS)

    Salamunićcar, Goran; Lončarić, Sven; Pina, Pedro; Bandeira, Lourenço; Saraiva, José

    2011-01-01

    Recently, all the craters from the major currently available manually assembled catalogues have been merged into the catalogue with 57 633 known Martian impact craters (MA57633GT). In addition, the work on crater detection algorithm (CDA), developed to search for still uncatalogued impact craters using 1/128° MOLA data, resulted in MA115225GT. In parallel with this work another CDA has been developed which resulted in the Stepinski catalogue containing 75 919 craters (MA75919T). The new MA130301GT catalogue presented in this paper is the result of: (1) overall merger of MA115225GT and MA75919T; (2) 2042 additional craters found using Shen-Castan based CDA from the previous work and 1/128° MOLA data; and (3) 3129 additional craters found using CDA for optical images from the previous work and selected regions of 1/256° MDIM, 1/256° THEMIS-DIR, and 1/256° MOC datasets. All craters from MA130301GT are manually aligned with all used datasets. For all the craters that originate from the used catalogues (Barlow, Rodionova, Boyce, Kuzmin, Stepinski) we integrated all the attributes available in these catalogues. With such an approach MA130301GT provides everything that was included in these catalogues, plus: (1) the correlation between various morphological descriptors from used catalogues; (2) the correlation between manually assigned attributes and automated depth/diameter measurements from MA75919T and our CDA; (3) surface dating which has been improved in resolution globally; (4) average errors and their standard deviations for manually and automatically assigned attributes such as position coordinates, diameter, depth/diameter ratio, etc.; and (5) positional accuracy of features in the used datasets according to the defined coordinate system referred to as MDIM 2.1, which incorporates 1232 globally distributed ground control points, while our catalogue contains 130 301 cross-references between each of the used datasets. Global completeness of MA130301GT is up to ˜ D≥2 km (it contains 85 783 such craters, while the smallest D is 0.924 km). This is a considerable improvement in comparison with the completeness of the Rodionova (˜10 km), Barlow (˜5 km) and Stepinski (˜3 km) catalogues. An accompanying result to the new catalogue is a contribution to the evaluation of CDAs - the following methods have been developed: (1) a new context-aware method for the advanced automated registration of craters with GT catalogues; (2) a new method for manual registration of newly found craters into GT catalogues; and (3) additional new accompanying methods for objective evaluation of CDAs using different datasets including optical images.

  2. ISRUC-Sleep: A comprehensive public dataset for sleep researchers.

    PubMed

    Khalighi, Sirvan; Sousa, Teresa; Santos, José Moutinho; Nunes, Urbano

    2016-02-01

    To facilitate the performance comparison of new methods for sleep patterns analysis, datasets with quality content, publicly-available, are very important and useful. We introduce an open-access comprehensive sleep dataset, called ISRUC-Sleep. The data were obtained from human adults, including healthy subjects, subjects with sleep disorders, and subjects under the effect of sleep medication. Each recording was randomly selected between PSG recordings that were acquired by the Sleep Medicine Centre of the Hospital of Coimbra University (CHUC). The dataset comprises three groups of data: (1) data concerning 100 subjects, with one recording session per subject; (2) data gathered from 8 subjects; two recording sessions were performed per subject, and (3) data collected from one recording session related to 10 healthy subjects. The polysomnography (PSG) recordings, associated with each subject, were visually scored by two human experts. Comparing the existing sleep-related public datasets, ISRUC-Sleep provides data of a reasonable number of subjects with different characteristics such as: data useful for studies involving changes in the PSG signals over time; and data of healthy subjects useful for studies involving comparison of healthy subjects with the patients, suffering from sleep disorders. This dataset was created aiming to complement existing datasets by providing easy-to-apply data collection with some characteristics not covered yet. ISRUC-Sleep can be useful for analysis of new contributions: (i) in biomedical signal processing; (ii) in development of ASSC methods; and (iii) on sleep physiology studies. To evaluate and compare new contributions, which use this dataset as a benchmark, results of applying a subject-independent automatic sleep stage classification (ASSC) method on ISRUC-Sleep dataset are presented. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  3. Patterns, biases and prospects in the distribution and diversity of Neotropical snakes.

    PubMed

    Guedes, Thaís B; Sawaya, Ricardo J; Zizka, Alexander; Laffan, Shawn; Faurby, Søren; Pyron, R Alexander; Bérnils, Renato S; Jansen, Martin; Passos, Paulo; Prudente, Ana L C; Cisneros-Heredia, Diego F; Braz, Henrique B; Nogueira, Cristiano de C; Antonelli, Alexandre; Meiri, Shai

    2018-01-01

    We generated a novel database of Neotropical snakes (one of the world's richest herpetofauna) combining the most comprehensive, manually compiled distribution dataset with publicly available data. We assess, for the first time, the diversity patterns for all Neotropical snakes as well as sampling density and sampling biases. We compiled three databases of species occurrences: a dataset downloaded from the Global Biodiversity Information Facility (GBIF), a verified dataset built through taxonomic work and specialized literature, and a combined dataset comprising a cleaned version of the GBIF dataset merged with the verified dataset. Neotropics, Behrmann projection equivalent to 1° × 1°. Specimens housed in museums during the last 150 years. Squamata: Serpentes. Geographical information system (GIS). The combined dataset provides the most comprehensive distribution database for Neotropical snakes to date. It contains 147,515 records for 886 species across 12 families, representing 74% of all species of snakes, spanning 27 countries in the Americas. Species richness and phylogenetic diversity show overall similar patterns. Amazonia is the least sampled Neotropical region, whereas most well-sampled sites are located near large universities and scientific collections. We provide a list and updated maps of geographical distribution of all snake species surveyed. The biodiversity metrics of Neotropical snakes reflect patterns previously documented for other vertebrates, suggesting that similar factors may determine the diversity of both ectothermic and endothermic animals. We suggest conservation strategies for high-diversity areas and sampling efforts be directed towards Amazonia and poorly known species.

  4. The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments Riparian Buffer for the Conterminous United States: 2010 US Census Housing Unit and Population Density

    EPA Pesticide Factsheets

    This dataset represents the population and housing unit density within individual, local NHDPlusV2 catchments and upstream, contributing watersheds riparian buffers based on 2010 US Census data. Densities are calculated for every block group and watershed averages are calculated for every local NHDPlusV2 catchment(see Data Sources for links to NHDPlusV2 data and Census Data). This data set is derived from The TIGER/Line Files and related database (.dbf) files for the conterminous USA. It was downloaded as Block Group-Level Census 2010 SF1 Data in File Geodatabase Format (ArcGIS version 10.0). The landscape raster (LR) was produced based on the data compiled from the questions asked of all people and about every housing unit. The (block-group population / block group area) and (block-group housing units / block group area) were summarized by local catchment and by watershed to produce local catchment-level and watershed-level metrics as a continuous data type (see Data Structure and Attribute Information for a description).

  5. A CCA+ICA based model for multi-task brain imaging data fusion and its application to schizophrenia.

    PubMed

    Sui, Jing; Adali, Tülay; Pearlson, Godfrey; Yang, Honghui; Sponheim, Scott R; White, Tonya; Calhoun, Vince D

    2010-05-15

    Collection of multiple-task brain imaging data from the same subject has now become common practice in medical imaging studies. In this paper, we propose a simple yet effective model, "CCA+ICA", as a powerful tool for multi-task data fusion. This joint blind source separation (BSS) model takes advantage of two multivariate methods: canonical correlation analysis and independent component analysis, to achieve both high estimation accuracy and to provide the correct connection between two datasets in which sources can have either common or distinct between-dataset correlation. In both simulated and real fMRI applications, we compare the proposed scheme with other joint BSS models and examine the different modeling assumptions. The contrast images of two tasks: sensorimotor (SM) and Sternberg working memory (SB), derived from a general linear model (GLM), were chosen to contribute real multi-task fMRI data, both of which were collected from 50 schizophrenia patients and 50 healthy controls. When examining the relationship with duration of illness, CCA+ICA revealed a significant negative correlation with temporal lobe activation. Furthermore, CCA+ICA located sensorimotor cortex as the group-discriminative regions for both tasks and identified the superior temporal gyrus in SM and prefrontal cortex in SB as task-specific group-discriminative brain networks. In summary, we compared the new approach to some competitive methods with different assumptions, and found consistent results regarding each of their hypotheses on connecting the two tasks. Such an approach fills a gap in existing multivariate methods for identifying biomarkers from brain imaging data.

  6. Long-term personality data collection in support of spaceflight and analogue research.

    PubMed

    Musson, David M; Helmreich, Robert L

    2005-06-01

    This is a review of past and present research into personality and performance at the University of Texas (UT) Human Factors Research Project. Specifically, personality trait data collected from astronauts, pilots, Antarctic personnel, and other groups over a 15-yr period is discussed with particular emphasis on research in space and space analogue environments. The UT Human Factors Research Project conducts studies in personality and group dynamics in aviation, space, and medicine. Current studies include personality determinants of professional cultures, team effectiveness in both medicine and aviation, and personality predictors of long-term astronaut performance. The Project also studies the design and effectiveness of behavioral strategies used to minimize error and maximize team performance in safety-critical work settings. A multi-year personality and performance dataset presents many opportunities for research, including long-term and follow-up studies of human performance, analyses of trends in recruiting and attrition, and the ability to adapt research design to operational changes and methodological advances. Special problems posed by such long-duration projects include issues of confidentiality and security, as well as practical limitations imposed by current peer-review and short-term funding practices. Practical considerations for ongoing dataset management include consistency of assessment instruments over time, variations in data acquisition from one year to the next, and dealing with changes in theory and practice that occur over the life of the project. A fundamental change in how research into human performance is funded would be required to ensure the ongoing development of such long-duration research databases.

  7. Stereotypes Possess Heterogeneous Directionality: A Theoretical and Empirical Exploration of Stereotype Structure and Content

    PubMed Central

    Cox, William T. L.; Devine, Patricia G.

    2015-01-01

    We advance a theory-driven approach to stereotype structure, informed by connectionist theories of cognition. Whereas traditional models define or tacitly assume that stereotypes possess inherently Group → Attribute activation directionality (e.g., Black activates criminal), our model predicts heterogeneous stereotype directionality. Alongside the classically studied Group → Attribute stereotypes, some stereotypes should be bidirectional (i.e., Group ⇄ Attribute) and others should have Attribute → Group unidirectionality (e.g., fashionable activates gay). We tested this prediction in several large-scale studies with human participants (NCombined = 4,817), assessing stereotypic inferences among various groups and attributes. Supporting predictions, we found heterogeneous directionality both among the stereotype links related to a given social group and also between the links of different social groups. These efforts yield rich datasets that map the networks of stereotype links related to several social groups. We make these datasets publicly available, enabling other researchers to explore a number of questions related to stereotypes and stereotyping. Stereotype directionality is an understudied feature of stereotypes and stereotyping with widespread implications for the development, measurement, maintenance, expression, and change of stereotypes, stereotyping, prejudice, and discrimination. PMID:25811181

  8. Stereotypes possess heterogeneous directionality: a theoretical and empirical exploration of stereotype structure and content.

    PubMed

    Cox, William T L; Devine, Patricia G

    2015-01-01

    We advance a theory-driven approach to stereotype structure, informed by connectionist theories of cognition. Whereas traditional models define or tacitly assume that stereotypes possess inherently Group → Attribute activation directionality (e.g., Black activates criminal), our model predicts heterogeneous stereotype directionality. Alongside the classically studied Group → Attribute stereotypes, some stereotypes should be bidirectional (i.e., Group ⇄ Attribute) and others should have Attribute → Group unidirectionality (e.g., fashionable activates gay). We tested this prediction in several large-scale studies with human participants (NCombined = 4,817), assessing stereotypic inferences among various groups and attributes. Supporting predictions, we found heterogeneous directionality both among the stereotype links related to a given social group and also between the links of different social groups. These efforts yield rich datasets that map the networks of stereotype links related to several social groups. We make these datasets publicly available, enabling other researchers to explore a number of questions related to stereotypes and stereotyping. Stereotype directionality is an understudied feature of stereotypes and stereotyping with widespread implications for the development, measurement, maintenance, expression, and change of stereotypes, stereotyping, prejudice, and discrimination.

  9. Blockmodels for connectome analysis

    NASA Astrophysics Data System (ADS)

    Moyer, Daniel; Gutman, Boris; Prasad, Gautam; Faskowitz, Joshua; Ver Steeg, Greg; Thompson, Paul

    2015-12-01

    In the present work we study a family of generative network model and its applications for modeling the human connectome. We introduce a minor but novel variant of the Mixed Membership Stochastic Blockmodel and apply it and two other related model to two human connectome datasets (ADNI and a Bipolar Disorder dataset) with both control and diseased subjects. We further provide a simple generative classifier that, alongside more discriminating methods, provides evidence that blockmodels accurately summarize tractography count networks with respect to a disease classification task.

  10. Final Report on the Creation of the Wind Integration National Dataset (WIND) Toolkit and API: October 1, 2013 - September 30, 2015

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hodge, Bri-Mathias

    2016-04-08

    The primary objective of this work was to create a state-of-the-art national wind resource data set and to provide detailed wind plant output data for specific sites based on that data set. Corresponding retrospective wind forecasts were also included at all selected locations. The combined information from these activities was used to create the Wind Integration National Dataset (WIND), and an extraction tool was developed to allow web-based data access.

  11. Challenges and Opportunities for Developing Capacity in Earth Observations for Agricultural Monitoring: The GEOGLAM Experience

    NASA Astrophysics Data System (ADS)

    Whitcraft, A. K.; Di Bella, C. M.; Becker Reshef, I.; Deshayes, M.; Justice, C. O.

    2015-12-01

    Since 2011, the Group on Earth Observations Global Agricultural Monitoring (GEOGLAM) Initiative has been working to strengthen the international community's capacity to use Earth observation (EO) data to derive timely, accurate, and transparent information on agriculture, with the goals of reducing market volatility and promoting food security. GEOGLAM aims to develop capacity for EO-based agricultural monitoring at multiple scales, from national to regional to global. This is accomplished through training workshops, developing and transferring of best-practices, establishing networks of broad and sustainable institutional support, and designing or adapting tools and methodologies to fit localized contexts. Over the past four years, capacity development activities in the context of GEOGLAM have spanned all agriculture-containing continents, with much more work to be done, particularly in the domains of promoting access to large, computationally-costly datasets. This talk will detail GEOGLAM's experiences, challenges, and opportunities surrounding building international collaboration, ensuring institutional buy-in, and developing sustainable programs.

  12. Chairmanship of the Neptune/Pluto outer planets science working group

    NASA Astrophysics Data System (ADS)

    Stern, S. Alan

    1993-11-01

    The Outer Planets Science Working Group (OPSWG) is the NASA Solar System Exploration Division (SSED) scientific steering committee for the Outer Solar System missions. OPSWG consists of 19 members and is chaired by Dr. S. Alan Stern. This proposal summarizes the FY93 activities of OPSWG, describes a set of objectives for OPSWG in FY94, and outlines the SWG's activities for FY95. As chair of OPSWG, Dr. Stern will be responsible for: organizing priorities, setting agendas, conducting meetings of the Outer Planets SWG; reporting the results of OPSWG's work to SSED; supporting those activities relating to OPSWG work, such as briefings to the SSES, COMPLEX, and OSS; supporting the JPL/SAIC Pluto study team; and other tasks requested by SSED. As the Scientific Working Group (SWG) for Jupiter and the planets beyond, OPSWG is the SSED SWG chartered to study and develop mission plans for all missions to the giant planets, Pluto, and other distant objects in the remote outer solar system. In that role, OPSWG is responsible for: defining and prioritizing scientific objectives for missions to these bodies; defining and documenting the scientific goals and rationale behind such missions; defining and prioritizing the datasets to be obtained in these missions; defining and prioritizing measurement objectives for these missions; defining and documenting the scientific rationale for strawman instrument payloads; defining and prioritizing the scientific requirements for orbital tour and flyby encounter trajectories; defining cruise science opportunities plan; providing technical feedback to JPL and SSED on the scientific capabilities of engineering studies for these missions; providing documentation to SSED concerning the scientific goals, objectives, and rationale for the mission; interfacing with other SSED and OSS committees at the request of SSED's Director or those committee chairs; providing input to SSED concerning the structure and content of the Announcement of Opportunity for payload and scientific team selection for such missions; and providing other technical or programmatic inputs concerning outer solar system missions at the request of the Director of SSED.

  13. Chairmanship of the Neptune/Pluto outer planets science working group

    NASA Technical Reports Server (NTRS)

    Stern, S. Alan

    1993-01-01

    The Outer Planets Science Working Group (OPSWG) is the NASA Solar System Exploration Division (SSED) scientific steering committee for the Outer Solar System missions. OPSWG consists of 19 members and is chaired by Dr. S. Alan Stern. This proposal summarizes the FY93 activities of OPSWG, describes a set of objectives for OPSWG in FY94, and outlines the SWG's activities for FY95. As chair of OPSWG, Dr. Stern will be responsible for: organizing priorities, setting agendas, conducting meetings of the Outer Planets SWG; reporting the results of OPSWG's work to SSED; supporting those activities relating to OPSWG work, such as briefings to the SSES, COMPLEX, and OSS; supporting the JPL/SAIC Pluto study team; and other tasks requested by SSED. As the Scientific Working Group (SWG) for Jupiter and the planets beyond, OPSWG is the SSED SWG chartered to study and develop mission plans for all missions to the giant planets, Pluto, and other distant objects in the remote outer solar system. In that role, OPSWG is responsible for: defining and prioritizing scientific objectives for missions to these bodies; defining and documenting the scientific goals and rationale behind such missions; defining and prioritizing the datasets to be obtained in these missions; defining and prioritizing measurement objectives for these missions; defining and documenting the scientific rationale for strawman instrument payloads; defining and prioritizing the scientific requirements for orbital tour and flyby encounter trajectories; defining cruise science opportunities plan; providing technical feedback to JPL and SSED on the scientific capabilities of engineering studies for these missions; providing documentation to SSED concerning the scientific goals, objectives, and rationale for the mission; interfacing with other SSED and OSS committees at the request of SSED's Director or those committee chairs; providing input to SSED concerning the structure and content of the Announcement of Opportunity for payload and scientific team selection for such missions; and providing other technical or programmatic inputs concerning outer solar system missions at the request of the Director of SSED.

  14. Imaging the Western Iberia Seismic Structure from the Crust to the Upper Mantle from Ambient Noise Tomography

    NASA Astrophysics Data System (ADS)

    Silveira, Graça; Kiselev, Sergey; Stutzmann, Eleonore; Schimmel, Martin; Haned, Abderrahmane; Dias, Nuno; Morais, Iolanda; Custódio, Susana

    2015-04-01

    Ambient Noise Tomography (ANT) is now widely used to image the subsurface seismic structure, with a resolution mainly dependent on the seismic network coverage. Most of these studies are limited to Rayleigh waves for periods shorter than 40/45 s and, as a consequence, they can image only the crust or, at most, the uppermost mantle. Recently, some studies successfully showed that this analysis could be extended to longer periods, thus allowing a deeper probing. In this work we present the combination of two complementary datasets. The first was obtained from the analysis of ambient noise in the period range 5-50 sec, for Western Iberia, using a dense temporary seismic network that operated between 2010 and 2012. The second one was computed for a global study, in the period range 30-250 sec, from analysis of 150 stations of the global networks GEOSCOPE and GSN. In both datasets, the Empirical Green Functions are computed by phase cross-correlation. The ambient noise phase cross-correlations are stacked using the time-frequency domain phase weighted stack (Schimmel et al. 2011, Geoph. J. Int., 184, 494-506). A bootstrap approach is used to measure the group velocities between pairs of stations and to estimate the corresponding error. We observed a good agreement between the dispersion measurements on both short period and long period datasets for most of the grid nodes. They are then inverted to obtain the 3D S-wave model from the crust to the upper mantle, using a bayesian approach. A simulated annealing method is applied, in which the number of splines that describes the model is adapted within the inversion. We compare the S-wave velocity model at some selected profiles with the S-wave velocity models gathered from Ps and Sp receiver functions joint inversion. Both results, issued from ambient noise tomography and body wave's analysis for the crust and upper mantle are consistent. This work is supported by project AQUAREL (PTDC/CTEGIX/116819/2010) and is a contribution to project QuakeLoc-PT (PTDC/GEO-FIQ/3522/2012).

  15. Long-term records of global radiation, carbon and water fluxes derived from multi-satellite data and a process-based model

    NASA Astrophysics Data System (ADS)

    Ryu, Youngryel; Jiang, Chongya

    2016-04-01

    To gain insights about the underlying impacts of global climate change on terrestrial ecosystem fluxes, we present a long-term (1982-2015) global radiation, carbon and water fluxes products by integrating multi-satellite data with a process-based model, the Breathing Earth System Simulator (BESS). BESS is a coupled processed model that integrates radiative transfer in the atmosphere and canopy, photosynthesis (GPP), and evapotranspiration (ET). BESS was designed most sensitive to the variables that can be quantified reliably, fully taking advantages of remote sensing atmospheric and land products. Originally, BESS entirely relied on MODIS as input variables to produce global GPP and ET during the MODIS era. This study extends the work to provide a series of long-term products from 1982 to 2015 by incorporating AVHRR data. In addition to GPP and ET, more land surface processes related datasets are mapped to facilitate the discovery of the ecological variations and changes. The CLARA-A1 cloud property datasets, the TOMS aerosol datasets, along with the GLASS land surface albedo datasets, were input to a look-up table derived from an atmospheric radiative transfer model to produce direct and diffuse components of visible and near infrared radiation datasets. Theses radiation components together with the LAI3g datasets and the GLASS land surface albedo datasets, were used to calculate absorbed radiation through a clumping corrected two-stream canopy radiative transfer model. ECMWF ERA interim air temperature data were downscaled by using ALP-II land surface temperature dataset and a region-dependent regression model. The spatial and seasonal variations of CO2 concentration were accounted by OCO-2 datasets, whereas NOAA's global CO2 growth rates data were used to describe interannual variations. All these remote sensing based datasets are used to run the BESS. Daily fluxes in 1/12 degree were computed and then aggregated to half-month interval to match with the spatial-temporal resolution of LAI3g dataset. The BESS GPP and ET products were compared to other independent datasets including MPI-BGC and CLM. Overall, the BESS products show good agreement with the other two datasets, indicating a compelling potential for bridging remote sensing and land surface models.

  16. Global relationships in river hydromorphology

    NASA Astrophysics Data System (ADS)

    Pavelsky, T.; Lion, C.; Allen, G. H.; Durand, M. T.; Schumann, G.; Beighley, E.; Yang, X.

    2017-12-01

    Since the widespread adoption of digital elevation models (DEMs) in the 1980s, most global and continental-scale analysis of river flow characteristics has been focused on measurements derived from DEMs such as drainage area, elevation, and slope. These variables (especially drainage area) have been related to other quantities of interest such as river width, depth, and velocity via empirical relationships that often take the form of power laws. More recently, a number of groups have developed more direct measurements of river location and some aspects of planform geometry from optical satellite imagery on regional, continental, and global scales. However, these satellite-derived datasets often lack many of the qualities that make DEM=derived datasets attractive, including robust network topology. Here, we present analysis of a dataset that combines the Global River Widths from Landsat (GRWL) database of river location, width, and braiding index with a river database extracted from the Shuttle Radar Topography Mission DEM and the HydroSHEDS dataset. Using these combined tools, we present a dataset that includes measurements of river width, slope, braiding index, upstream drainage area, and other variables. The dataset is available everywhere that both datasets are available, which includes all continental areas south of 60N with rivers sufficiently large to be observed with Landsat imagery. We use the dataset to examine patterns and frequencies of river form across continental and global scales as well as global relationships among variables including width, slope, and drainage area. The results demonstrate the complex relationships among different dimensions of river hydromorphology at the global scale.

  17. Automated grouping of action potentials of human embryonic stem cell-derived cardiomyocytes.

    PubMed

    Gorospe, Giann; Zhu, Renjun; Millrod, Michal A; Zambidis, Elias T; Tung, Leslie; Vidal, Rene

    2014-09-01

    Methods for obtaining cardiomyocytes from human embryonic stem cells (hESCs) are improving at a significant rate. However, the characterization of these cardiomyocytes (CMs) is evolving at a relatively slower rate. In particular, there is still uncertainty in classifying the phenotype (ventricular-like, atrial-like, nodal-like, etc.) of an hESC-derived cardiomyocyte (hESC-CM). While previous studies identified the phenotype of a CM based on electrophysiological features of its action potential, the criteria for classification were typically subjective and differed across studies. In this paper, we use techniques from signal processing and machine learning to develop an automated approach to discriminate the electrophysiological differences between hESC-CMs. Specifically, we propose a spectral grouping-based algorithm to separate a population of CMs into distinct groups based on the similarity of their action potential shapes. We applied this method to a dataset of optical maps of cardiac cell clusters dissected from human embryoid bodies. While some of the nine cell clusters in the dataset are presented with just one phenotype, the majority of the cell clusters are presented with multiple phenotypes. The proposed algorithm is generally applicable to other action potential datasets and could prove useful in investigating the purification of specific types of CMs from an electrophysiological perspective.

  18. Automated Grouping of Action Potentials of Human Embryonic Stem Cell-Derived Cardiomyocytes

    PubMed Central

    Gorospe, Giann; Zhu, Renjun; Millrod, Michal A.; Zambidis, Elias T.; Tung, Leslie; Vidal, René

    2015-01-01

    Methods for obtaining cardiomyocytes from human embryonic stem cells (hESCs) are improving at a significant rate. However, the characterization of these cardiomyocytes is evolving at a relatively slower rate. In particular, there is still uncertainty in classifying the phenotype (ventricular-like, atrial-like, nodal-like, etc.) of an hESC-derived cardiomyocyte (hESC-CM). While previous studies identified the phenotype of a cardiomyocyte based on electrophysiological features of its action potential, the criteria for classification were typically subjective and differed across studies. In this paper, we use techniques from signal processing and machine learning to develop an automated approach to discriminate the electrophysiological differences between hESC-CMs. Specifically, we propose a spectral grouping-based algorithm to separate a population of cardiomyocytes into distinct groups based on the similarity of their action potential shapes. We applied this method to a dataset of optical maps of cardiac cell clusters dissected from human embryoid bodies (hEBs). While some of the 9 cell clusters in the dataset presented with just one phenotype, the majority of the cell clusters presented with multiple phenotypes. The proposed algorithm is generally applicable to other action potential datasets and could prove useful in investigating the purification of specific types of cardiomyocytes from an electrophysiological perspective. PMID:25148658

  19. Igloo-Plot: a tool for visualization of multidimensional datasets.

    PubMed

    Kuntal, Bhusan K; Ghosh, Tarini Shankar; Mande, Sharmila S

    2014-01-01

    Advances in science and technology have resulted in an exponential growth of multivariate (or multi-dimensional) datasets which are being generated from various research areas especially in the domain of biological sciences. Visualization and analysis of such data (with the objective of uncovering the hidden patterns therein) is an important and challenging task. We present a tool, called Igloo-Plot, for efficient visualization of multidimensional datasets. The tool addresses some of the key limitations of contemporary multivariate visualization and analysis tools. The visualization layout, not only facilitates an easy identification of clusters of data-points having similar feature compositions, but also the 'marker features' specific to each of these clusters. The applicability of the various functionalities implemented herein is demonstrated using several well studied multi-dimensional datasets. Igloo-Plot is expected to be a valuable resource for researchers working in multivariate data mining studies. Igloo-Plot is available for download from: http://metagenomics.atc.tcs.com/IglooPlot/. Copyright © 2014 Elsevier Inc. All rights reserved.

  20. Comparing Methods to Assess Intraobserver Measurement Error of 3D Craniofacial Landmarks Using Geometric Morphometrics Through a Digitizer Arm.

    PubMed

    Menéndez, Lumila Paula

    2017-05-01

    Intraobserver error (INTRA-OE) is the difference between repeated measurements of the same variable made by the same observer. The objective of this work was to evaluate INTRA-OE from 3D landmarks registered with a Microscribe, in different datasets: (A) the 3D coordinates, (B) linear measurements calculated from A, and (C) the six-first principal component axes. INTRA-OE was analyzed by digitizing 42 landmarks from 23 skulls in three events two weeks apart from each other. Systematic error was tested through repeated measures ANOVA (ANOVA-RM), while random error through intraclass correlation coefficient. Results showed that the largest differences between the three observations were found in the first dataset. Some anatomical points like nasion, ectoconchion, temporosphenoparietal, asterion, and temporomandibular presented the highest INTRA-OE. In the second dataset, local distances had higher INTRA-OE than global distances while the third dataset showed the lowest INTRA-OE. © 2016 American Academy of Forensic Sciences.

  1. Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data

    PubMed Central

    Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge

    2016-01-01

    This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers’ perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper. PMID:27376300

  2. Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data.

    PubMed

    Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge

    2016-07-01

    This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers' perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper.

  3. Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation.

    PubMed

    Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi

    2015-01-01

    Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it.

  4. Classification of foods by transferring knowledge from ImageNet dataset

    NASA Astrophysics Data System (ADS)

    Heravi, Elnaz J.; Aghdam, Hamed H.; Puig, Domenec

    2017-03-01

    Automatic classification of foods is a way to control food intake and tackle with obesity. However, it is a challenging problem since foods are highly deformable and complex objects. Results on ImageNet dataset have revealed that Convolutional Neural Network has a great expressive power to model natural objects. Nonetheless, it is not trivial to train a ConvNet from scratch for classification of foods. This is due to the fact that ConvNets require large datasets and to our knowledge there is not a large public dataset of food for this purpose. Alternative solution is to transfer knowledge from trained ConvNets to the domain of foods. In this work, we study how transferable are state-of-art ConvNets to the task of food classification. We also propose a method for transferring knowledge from a bigger ConvNet to a smaller ConvNet by keeping its accuracy similar to the bigger ConvNet. Our experiments on UECFood256 datasets show that Googlenet, VGG and residual networks produce comparable results if we start transferring knowledge from appropriate layer. In addition, we show that our method is able to effectively transfer knowledge to the smaller ConvNet using unlabeled samples.

  5. Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

    PubMed Central

    Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi

    2015-01-01

    Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133

  6. Evolution of Rhizaria: new insights from phylogenomic analysis of uncultivated protists.

    PubMed

    Burki, Fabien; Kudryavtsev, Alexander; Matz, Mikhail V; Aglyamova, Galina V; Bulman, Simon; Fiers, Mark; Keeling, Patrick J; Pawlowski, Jan

    2010-12-02

    Recent phylogenomic analyses have revolutionized our view of eukaryote evolution by revealing unexpected relationships between and within the eukaryotic supergroups. However, for several groups of uncultivable protists, only the ribosomal RNA genes and a handful of proteins are available, often leading to unresolved evolutionary relationships. A striking example concerns the supergroup Rhizaria, which comprises several groups of uncultivable free-living protists such as radiolarians, foraminiferans and gromiids, as well as the parasitic plasmodiophorids and haplosporids. Thus far, the relationships within this supergroup have been inferred almost exclusively from rRNA, actin, and polyubiquitin genes, and remain poorly resolved. To address this, we have generated large Expressed Sequence Tag (EST) datasets for 5 species of Rhizaria belonging to 3 important groups: Acantharea (Astrolonche sp., Phyllostaurus sp.), Phytomyxea (Spongospora subterranea, Plasmodiophora brassicae) and Gromiida (Gromia sphaerica). 167 genes were selected for phylogenetic analyses based on the representation of at least one rhizarian species for each gene. Concatenation of these genes produced a supermatrix composed of 36,735 amino acid positions, including 10 rhizarians, 9 stramenopiles, and 9 alveolates. Phylogenomic analyses of this large dataset revealed a strongly supported clade grouping Foraminifera and Acantharea. The position of this clade within Rhizaria was sensitive to the method employed and the taxon sampling: Maximum Likelihood (ML) and Bayesian analyses using empirical model of evolution favoured an early divergence, whereas the CAT model and ML analyses with fast-evolving sites or the foraminiferan species Reticulomyxa filosa removed suggested a derived position, closely related to Gromia and Phytomyxea. In contrast to what has been previously reported, our analyses also uncovered the presence of the rhizarian-specific polyubiquitin insertion in Acantharea. Finally, this work reveals another possible rhizarian signature in the 60S ribosomal protein L10a. Our study provides new insights into the evolution of Rhizaria based on phylogenomic analyses of ESTs from three groups of previously under-sampled protists. It was enabled through the application of a recently developed method of transcriptome analysis, requiring very small amount of starting material. Our study illustrates the potential of this method to elucidate the early evolution of eukaryotes by providing large amount of data for uncultivable free-living and parasitic protists.

  7. Astrobites as a Pedagogical Tool in Classrooms

    NASA Astrophysics Data System (ADS)

    Khullar, Gourav; Tsang, Benny Tsz Ho; Sanders, Nathan; Kohler, Susanna; Shipp, Nora; Astrobites Collaboration

    2018-06-01

    Astrobites is a graduate-student organization that publishes an online astrophysical literature blog (astrobites.org), and has published brief and accessible summaries of more than 1600 articles from the astrophysical literature since its founding in 2010. Our graduate-student generated content is widely being utilised as a pedagogical tool to bring current research into the classroom of higher education. We aim to study the effectiveness of Astrobites in teaching of current research, via the AAS Education & Professional Development Mini-Grant funded in Fall 2017. This talk gives an overview of the functioning of Astrobites, our past pedagogical initiatives, as well as a brief description of the grant proposal. We describe the workings of our teaching workshop at the 231st AAS Meeting in January 2018, as well as a 10-educator focus group that has been assembled to conduct a post-workshop follow-up that serves as a dataset for our research study. We present here a brief analysis of the workshop, the focus group and preliminary inferences.

  8. New insights about host response to smallpox using microarray data.

    PubMed

    Esteves, Gustavo H; Simoes, Ana C Q; Souza, Estevao; Dias, Rodrigo A; Ospina, Raydonal; Venancio, Thiago M

    2007-08-24

    Smallpox is a lethal disease that was endemic in many parts of the world until eradicated by massive immunization. Due to its lethality, there are serious concerns about its use as a bioweapon. Here we analyze publicly available microarray data to further understand survival of smallpox infected macaques, using systems biology approaches. Our goal is to improve the knowledge about the progression of this disease. We used KEGG pathways annotations to define groups of genes (or modules), and subsequently compared them to macaque survival times. This technique provided additional insights about the host response to this disease, such as increased expression of the cytokines and ECM receptors in the individuals with higher survival times. These results could indicate that these gene groups could influence an effective response from the host to smallpox. Macaques with higher survival times clearly express some specific pathways previously unidentified using regular gene-by-gene approaches. Our work also shows how third party analysis of public datasets can be important to support new hypotheses to relevant biological problems.

  9. CLustre: semi-automated lineament clustering for palaeo-glacial reconstruction

    NASA Astrophysics Data System (ADS)

    Smith, Mike; Anders, Niels; Keesstra, Saskia

    2016-04-01

    Palaeo glacial reconstructions, or "inversions", using evidence from the palimpsest landscape are increasingly being undertaken with larger and larger databases. Predominant in landform evidence is the lineament (or drumlin) where the biggest datasets number in excess of 50,000 individual forms. One stage in the inversion process requires the identification of lineaments that are generically similar and then their subsequent interpretation in to a coherent chronology of events. Here we present CLustre, a semi-authomated algorithm that clusters lineaments using a locally adaptive, region growing, method. This is initially tested using 1,500 model runs on a synthetic dataset, before application to two case studies (where manual clustering has been undertaken by independent researchers): (1) Dubawnt Lake, Canada and (2) Victoria island, Canada. Results using the synthetic data show that classifications are robust in most scenarios, although specific cases of cross-cutting lineaments may lead to incorrect clusters. Application to the case studies showed a very good match to existing published work, with differences related to limited numbers of unclassified lineaments and parallel cross-cutting lineaments. The value in CLustre comes from the semi-automated, objective, application of a classification method that is repeatable. Once classified, summary statistics of lineament groups can be calculated and then used in the inversion.

  10. EarthChem: International Collaboration for Solid Earth Geochemistry in Geoinformatics

    NASA Astrophysics Data System (ADS)

    Walker, J. D.; Lehnert, K. A.; Hofmann, A. W.; Sarbas, B.; Carlson, R. W.

    2005-12-01

    The current on-line information systems for igneous rock geochemistry - PetDB, GEOROC, and NAVDAT - convincingly demonstrate the value of rigorous scientific data management of geochemical data for research and education. The next generation of hypothesis formulation and testing can be vastly facilitated by enhancing these electronic resources through integration of available datasets, expansion of data coverage in location, time, and tectonic setting, timely updates with new data, and through intuitive and efficient access and data analysis tools for the broader geosciences community. PetDB, GEOROC, and NAVDAT have therefore formed the EarthChem consortium (www.earthchem.org) as a international collaborative effort to address these needs and serve the larger earth science community by facilitating the compilation, communication, serving, and visualization of geochemical data, and their integration with other geological, geochronological, geophysical, and geodetic information to maximize their scientific application. We report on the status of and future plans for EarthChem activities. EarthChem's development plan includes: (1) expanding the functionality of the web portal to become a `one-stop shop for geochemical data' with search capability across databases, standardized and integrated data output, generally applicable tools for data quality assessment, and data analysis/visualization including plotting methods and an information-rich map interface; and (2) expanding data holdings by generating new datasets as identified and prioritized through community outreach, and facilitating data contributions from the community by offering web-based data submission capability and technical assistance for design, implementation, and population of new databases and their integration with all EarthChem data holdings. Such federated databases and datasets will retain their identity within the EarthChem system. We also plan on working with publishers to ease the assimilation of geochemical data into the EarthChem database. As a community resource, EarthChem will address user concerns and respond to broad scientific and educational needs. EarthChem will hold yearly workshops, town hall meetings, and/or exhibits at major meetings. The group has established a two-tier committee structure to help ease the communication and coordination of database and IT issues between existing data management projects, and to receive feedback and support from individuals and groups from the larger geosciences community.

  11. EST analysis in Ginkgo biloba: an assessment of conserved developmental regulators and gymnosperm specific genes

    PubMed Central

    Brenner, Eric D; Katari, Manpreet S; Stevenson, Dennis W; Rudd, Stephen A; Douglas, Andrew W; Moss, Walter N; Twigg, Richard W; Runko, Suzan J; Stellari, Giulia M; McCombie, WR; Coruzzi, Gloria M

    2005-01-01

    Background Ginkgo biloba L. is the only surviving member of one of the oldest living seed plant groups with medicinal, spiritual and horticultural importance worldwide. As an evolutionary relic, it displays many characters found in the early, extinct seed plants and extant cycads. To establish a molecular base to understand the evolution of seeds and pollen, we created a cDNA library and EST dataset from the reproductive structures of male (microsporangiate), female (megasporangiate), and vegetative organs (leaves) of Ginkgo biloba. Results RNA from newly emerged male and female reproductive organs and immature leaves was used to create three distinct cDNA libraries from which 6,434 ESTs were generated. These 6,434 ESTs from Ginkgo biloba were clustered into 3,830 unigenes. A comparison of our Ginkgo unigene set against the fully annotated genomes of rice and Arabidopsis, and all available ESTs in Genbank revealed that 256 Ginkgo unigenes match only genes among the gymnosperms and non-seed plants – many with multiple matches to genes in non-angiosperm plants. Conversely, another group of unigenes in Gingko had highly significant homology to transcription factors in angiosperms involved in development, including MADS box genes as well as post-transcriptional regulators. Several of the conserved developmental genes found in Ginkgo had top BLAST homology to cycad genes. We also note here the presence of ESTs in G. biloba similar to genes that to date have only been found in gymnosperms and an additional 22 Ginkgo genes common only to genes from cycads. Conclusion Our analysis of an EST dataset from G. biloba revealed genes potentially unique to gymnosperms. Many of these genes showed homology to fully sequenced clones from our cycad EST dataset found in common only with gymnosperms. Other Ginkgo ESTs are similar to developmental regulators in higher plants. This work sets the stage for future studies on Ginkgo to better understand seed and pollen evolution, and to resolve the ambiguous phylogenetic relationship of G. biloba among the gymnosperms. PMID:16225698

  12. EnviroAtlas - Pittsburgh, PA - Domestic Water Use per Day by U.S. Census Block Group

    EPA Pesticide Factsheets

    As included in this EnviroAtlas dataset, the community level domestic water use was calculated using locally available water use data per capita in gallons of water per day (GPD), distributed dasymetrically, and summarized by census block group. Domestic water use, as defined in this case, is intended to represent residential indoor and outdoor water use (e.g., cooking hygiene, landscaping, pools, etc.) for primary residences (i.e., excluding second homes and tourism rentals). For the purposes of this metric, these publicly-supplied estimates are also applied and considered representative of local self-supplied water use. Domestic water demand was calculated and applied using the Pennsylvania Department of Environmental Protection (PADEP) PWS Service Areas layer, population served per provider, and average water use per provider. Within the EnviroAtlas study area, there are 43 service providers with 2010-2013 estimates ranging from 34 to 102 GPD.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can

  13. EnviroAtlas - Phoenix, AZ - Domestic Water Demand per Day by U.S. Census Block Group

    EPA Pesticide Factsheets

    As included in this EnviroAtlas dataset, community level domestic water demand is calculated using locally available water use data per capita in gallons of water per day (GPD), distributed dasymetrically, and summarized by census block group. Domestic water use, as defined in this case, is intended to represent residential indoor and outdoor water use (e.g., cooking hygiene, landscaping, pools, etc.) for primary residences (i.e., excluding second homes and tourism rentals). For the purposes of this metric, these publicly-supplied estimates are also applied and considered representative of local self-supplied water use. Within the EnviroAtlas Phoenix boundary, there are 53 service providers with 2000-2009 water use estimates ranging from 108 to 366 GPD.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  14. Breaking barriers and halting rupture: the 2016 Amatrice-Visso-Castelluccio earthquake sequence, central Italy

    NASA Astrophysics Data System (ADS)

    Gregory, L. C.; Walters, R. J.; Wedmore, L. N. J.; Craig, T. J.; McCaffrey, K. J. W.; Wilkinson, M. W.; Livio, F.; Michetti, A.; Goodall, H.; Li, Z.; Chen, J.; De Martini, P. M.

    2017-12-01

    In 2016 the Central Italian Apennines was struck by a sequence of normal faulting earthquakes that ruptured in three separate events on the 24th August (Mw 6.2), the 26th Oct (Mw 6.1), and the 30th Oct (Mw 6.6). We reveal the complex nature of the individual events and the time-evolution of the sequence using multiple datasets. We will present an overview of the results from field geology, satellite geodesy, GNSS (including low-cost short baseline installations), and terrestrial laser scanning (TLS). Sequences of earthquakes of mid to high magnitude 6 are common in historical and seismological records in Italy and other similar tectonic settings globally. Multi-fault rupture during these sequences can occur in seconds, as in the M 6.9 1980 Irpinia earthquake, or can span days, months, or years (e.g. the 1703 Norcia-L'Aquila sequence). It is critical to determine why the causative faults in the 2016 sequence did not rupture simultaneously, and how this relates to fault segmentation and structural barriers. This is the first sequence of this kind to be observed using modern geodetic techniques, and only with all of the datasets combined can we begin to understand how and why the sequence evolved in time and space. We show that earthquake rupture both broke through structural barriers that were thought to exist, but was also inhibited by a previously unknown structure. We will also discuss the logistical challenges in generating datasets on the time-evolving sequence, and show how rapid response and international collaboration within the Open EMERGEO Working Group was critical for gaining a complete picture of the ongoing activity.

  15. Measuring Quality of Healthcare Outcomes in Type 2 Diabetes from Routine Data: a Seven-nation Survey Conducted by the IMIA Primary Health Care Working Group.

    PubMed

    Hinton, W; Liyanage, H; McGovern, A; Liaw, S-T; Kuziemsky, C; Munro, N; de Lusignan, S

    2017-08-01

    Background: The Institute of Medicine framework defines six dimensions of quality for healthcare systems: (1) safety, (2) effectiveness, (3) patient centeredness, (4) timeliness of care, (5) efficiency, and (6) equity. Large health datasets provide an opportunity to assess quality in these areas. Objective: To perform an international comparison of the measurability of the delivery of these aims, in people with type 2 diabetes mellitus (T2DM) from large datasets. Method: We conducted a survey to assess healthcare outcomes data quality of existing databases and disseminated this through professional networks. We examined the data sources used to collect the data, frequency of data uploads, and data types used for identifying people with T2DM. We compared data completeness across the six areas of healthcare quality, using selected measures pertinent to T2DM management. Results: We received 14 responses from seven countries (Australia, Canada, Italy, the Netherlands, Norway, Portugal, Turkey and the UK). Most databases reported frequent data uploads and would be capable of near real time analysis of healthcare quality.The majority of recorded data related to safety (particularly medication adverse events) and treatment efficacy (glycaemic control and microvascular disease). Data potentially measuring equity was less well recorded. Recording levels were lowest for patient-centred care, timeliness of care, and system efficiency, with the majority of databases containing no data in these areas. Databases using primary care sources had higher data quality across all areas measured. Conclusion: Data quality could be improved particularly in the areas of patient-centred care, timeliness, and efficiency. Primary care derived datasets may be most suited to healthcare quality assessment. Georg Thieme Verlag KG Stuttgart.

  16. Comparing performance of multinomial logistic regression and discriminant analysis for monitoring access to care for acute myocardial infarction.

    PubMed

    Hossain, Monir; Wright, Steven; Petersen, Laura A

    2002-04-01

    One way to monitor patient access to emergent health care services is to use patient characteristics to predict arrival time at the hospital after onset of symptoms. This predicted arrival time can then be compared with actual arrival time to allow monitoring of access to services. Predicted arrival time could also be used to estimate potential effects of changes in health care service availability, such as closure of an emergency department or an acute care hospital. Our goal was to determine the best statistical method for prediction of arrival intervals for patients with acute myocardial infarction (AMI) symptoms. We compared the performance of multinomial logistic regression (MLR) and discriminant analysis (DA) models. Models for MLR and DA were developed using a dataset of 3,566 male veterans hospitalized with AMI in 81 VA Medical Centers in 1994-1995 throughout the United States. The dataset was randomly divided into a training set (n = 1,846) and a test set (n = 1,720). Arrival times were grouped into three intervals on the basis of treatment considerations: <6 hours, 6-12 hours, and >12 hours. One model for MLR and two models for DA were developed using the training dataset. One DA model had equal prior probabilities, and one DA model had proportional prior probabilities. Predictive performance of the models was compared using the test (n = 1,720) dataset. Using the test dataset, the proportions of patients in the three arrival time groups were 60.9% for <6 hours, 10.3% for 6-12 hours, and 28.8% for >12 hours after symptom onset. Whereas the overall predictive performance by MLR and DA with proportional priors was higher, the DA models with equal priors performed much better in the smaller groups. Correct classifications were 62.6% by MLR, 62.4% by DA using proportional prior probabilities, and 48.1% using equal prior probabilities of the groups. The misclassifications by MLR for the three groups were 9.5%, 100.0%, 74.2% for each time interval, respectively. Misclassifications by DA models were 9.8%, 100.0%, and 74.4% for the model with proportional priors and 47.6%, 79.5%, and 51.0% for the model with equal priors. The choice of MLR or DA with proportional priors, or DA with equal priors for monitoring time intervals of predicted hospital arrival time for a population should depend on the consequences of misclassification errors.

  17. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets.

    PubMed

    Heath, Allison P; Greenway, Matthew; Powell, Raymond; Spring, Jonathan; Suarez, Rafael; Hanley, David; Bandlamudi, Chai; McNerney, Megan E; White, Kevin P; Grossman, Robert L

    2014-01-01

    As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it. Bionimbus is an open source cloud-computing platform that is based primarily upon OpenStack, which manages on-demand virtual machines that provide the required computational resources, and GlusterFS, which is a high-performance clustered file system. Bionimbus also includes Tukey, which is a portal, and associated middleware that provides a single entry point and a single sign on for the various Bionimbus resources; and Yates, which automates the installation, configuration, and maintenance of the software infrastructure required. Bionimbus is used by a variety of projects to process genomics and phenotypic data. For example, it is used by an acute myeloid leukemia resequencing project at the University of Chicago. The project requires several computational pipelines, including pipelines for quality control, alignment, variant calling, and annotation. For each sample, the alignment step requires eight CPUs for about 12 h. BAM file sizes ranged from 5 GB to 10 GB for each sample. Most members of the research community have difficulty downloading large genomics datasets and obtaining sufficient storage and computer resources to manage and analyze the data. Cloud computing platforms, such as Bionimbus, with data commons that contain large genomics datasets, are one choice for broadening access to research data in genomics. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

  18. Accurate and scalable social recommendation using mixed-membership stochastic block models.

    PubMed

    Godoy-Lorite, Antonia; Guimerà, Roger; Moore, Cristopher; Sales-Pardo, Marta

    2016-12-13

    With increasing amounts of information available, modeling and predicting user preferences-for books or articles, for example-are becoming more important. We present a collaborative filtering model, with an associated scalable algorithm, that makes accurate predictions of users' ratings. Like previous approaches, we assume that there are groups of users and of items and that the rating a user gives an item is determined by their respective group memberships. However, we allow each user and each item to belong simultaneously to mixtures of different groups and, unlike many popular approaches such as matrix factorization, we do not assume that users in each group prefer a single group of items. In particular, we do not assume that ratings depend linearly on a measure of similarity, but allow probability distributions of ratings to depend freely on the user's and item's groups. The resulting overlapping groups and predicted ratings can be inferred with an expectation-maximization algorithm whose running time scales linearly with the number of observed ratings. Our approach enables us to predict user preferences in large datasets and is considerably more accurate than the current algorithms for such large datasets.

  19. Accurate and scalable social recommendation using mixed-membership stochastic block models

    PubMed Central

    Godoy-Lorite, Antonia; Moore, Cristopher

    2016-01-01

    With increasing amounts of information available, modeling and predicting user preferences—for books or articles, for example—are becoming more important. We present a collaborative filtering model, with an associated scalable algorithm, that makes accurate predictions of users’ ratings. Like previous approaches, we assume that there are groups of users and of items and that the rating a user gives an item is determined by their respective group memberships. However, we allow each user and each item to belong simultaneously to mixtures of different groups and, unlike many popular approaches such as matrix factorization, we do not assume that users in each group prefer a single group of items. In particular, we do not assume that ratings depend linearly on a measure of similarity, but allow probability distributions of ratings to depend freely on the user’s and item’s groups. The resulting overlapping groups and predicted ratings can be inferred with an expectation-maximization algorithm whose running time scales linearly with the number of observed ratings. Our approach enables us to predict user preferences in large datasets and is considerably more accurate than the current algorithms for such large datasets. PMID:27911773

  20. Data Shared Lasso: A Novel Tool to Discover Uplift.

    PubMed

    Gross, Samuel M; Tibshirani, Robert

    2016-09-01

    A model is presented for the supervised learning problem where the observations come from a fixed number of pre-specified groups, and the regression coefficients may vary sparsely between groups. The model spans the continuum between individual models for each group and one model for all groups. The resulting algorithm is designed with a high dimensional framework in mind. The approach is applied to a sentiment analysis dataset to show its efficacy and interpretability. One particularly useful application is for finding sub-populations in a randomized trial for which an intervention (treatment) is beneficial, often called the uplift problem. Some new concepts are introduced that are useful for uplift analysis. The value is demonstrated in an application to a real world credit card promotion dataset. In this example, although sending the promotion has a very small average effect, by targeting a particular subgroup with the promotion one can obtain a 15% increase in the proportion of people who purchase the new credit card.

  1. Data Shared Lasso: A Novel Tool to Discover Uplift

    PubMed Central

    Gross, Samuel M.; Tibshirani, Robert

    2017-01-01

    A model is presented for the supervised learning problem where the observations come from a fixed number of pre-specified groups, and the regression coefficients may vary sparsely between groups. The model spans the continuum between individual models for each group and one model for all groups. The resulting algorithm is designed with a high dimensional framework in mind. The approach is applied to a sentiment analysis dataset to show its efficacy and interpretability. One particularly useful application is for finding sub-populations in a randomized trial for which an intervention (treatment) is beneficial, often called the uplift problem. Some new concepts are introduced that are useful for uplift analysis. The value is demonstrated in an application to a real world credit card promotion dataset. In this example, although sending the promotion has a very small average effect, by targeting a particular subgroup with the promotion one can obtain a 15% increase in the proportion of people who purchase the new credit card. PMID:29056802

  2. A web Accessible Framework for Discovery, Visualization and Dissemination of Polar Data

    NASA Astrophysics Data System (ADS)

    Kirsch, P. J.; Breen, P.; Barnes, T. D.

    2007-12-01

    A web accessible information framework, currently under development within the Physical Sciences Division of the British Antarctic Survey is described. The datasets accessed are generally heterogeneous in nature from fields including space physics, meteorology, atmospheric chemistry, ice physics, and oceanography. Many of these are returned in near real time over a 24/7 limited bandwidth link from remote Antarctic Stations and ships. The requirement is to provide various user groups - each with disparate interests and demands - a system incorporating a browsable and searchable catalogue; bespoke data summary visualization, metadata access facilities and download utilities. The system allows timely access to raw and processed datasets through an easily navigable discovery interface. Once discovered, a summary of the dataset can be visualized in a manner prescribed by the particular projects and user communities or the dataset may be downloaded, subject to accessibility restrictions that may exist. In addition, access to related ancillary information including software, documentation, related URL's and information concerning non-electronic media (of particular relevance to some legacy datasets) is made directly available having automatically been associated with a dataset during the discovery phase. Major components of the framework include the relational database containing the catalogue, the organizational structure of the systems holding the data - enabling automatic updates of the system catalogue and real-time access to data -, the user interface design, and administrative and data management scripts allowing straightforward incorporation of utilities, datasets and system maintenance.

  3. A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis.

    PubMed

    Kim, Seongsoon; Park, Donghyeon; Choi, Yonghwa; Lee, Kyubum; Kim, Byounggun; Jeon, Minji; Kim, Jihye; Tan, Aik Choon; Kang, Jaewoo

    2018-01-05

    With the development of artificial intelligence (AI) technology centered on deep-learning, the computer has evolved to a point where it can read a given text and answer a question based on the context of the text. Such a specific task is known as the task of machine comprehension. Existing machine comprehension tasks mostly use datasets of general texts, such as news articles or elementary school-level storybooks. However, no attempt has been made to determine whether an up-to-date deep learning-based machine comprehension model can also process scientific literature containing expert-level knowledge, especially in the biomedical domain. This study aims to investigate whether a machine comprehension model can process biomedical articles as well as general texts. Since there is no dataset for the biomedical literature comprehension task, our work includes generating a large-scale question answering dataset using PubMed and manually evaluating the generated dataset. We present an attention-based deep neural model tailored to the biomedical domain. To further enhance the performance of our model, we used a pretrained word vector and biomedical entity type embedding. We also developed an ensemble method of combining the results of several independent models to reduce the variance of the answers from the models. The experimental results showed that our proposed deep neural network model outperformed the baseline model by more than 7% on the new dataset. We also evaluated human performance on the new dataset. The human evaluation result showed that our deep neural model outperformed humans in comprehension by 22% on average. In this work, we introduced a new task of machine comprehension in the biomedical domain using a deep neural model. Since there was no large-scale dataset for training deep neural models in the biomedical domain, we created the new cloze-style datasets Biomedical Knowledge Comprehension Title (BMKC_T) and Biomedical Knowledge Comprehension Last Sentence (BMKC_LS) (together referred to as BioMedical Knowledge Comprehension) using the PubMed corpus. The experimental results showed that the performance of our model is much higher than that of humans. We observed that our model performed consistently better regardless of the degree of difficulty of a text, whereas humans have difficulty when performing biomedical literature comprehension tasks that require expert level knowledge. ©Seongsoon Kim, Donghyeon Park, Yonghwa Choi, Kyubum Lee, Byounggun Kim, Minji Jeon, Jihye Kim, Aik Choon Tan, Jaewoo Kang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 05.01.2018.

  4. Analysis of hyper-spectral AVIRIS image data over a mixed-conifer forest in Maine

    NASA Technical Reports Server (NTRS)

    Lawrence, William T.; Shimabukuro, Yosio E.; Gao, Bo-Cai

    1993-01-01

    An introduction to some of the potential uses of hyperspectral data for ecosystem analysis is presented. The examples given are derived from a digital dataset acquired over a sub-boreal forest in central Maine in 1990 by the NASA-JPL Airborne Visible and Infrared Imaging Spectrometer (AVIRIS) instrument gathers data from 400 to 2500 nm in 224 channels at bandwidths of approximately 10 nm. As a preview to the uses of the hyperspectral data, several products from this dataset were extracted. They range from the traditional false color composite made from simulated Thematic Mapper bands and the well known normalized difference vegetation index to much more exotic products such as fractions of vegetation, soil and shade based on linear spectral mixing models and estimates of the leaf water content at the landscape level derived using spectrum-matching techniques. Our research and that of many others indicates that the hyperspectral datasets carry much important information which is only beginning to be understood. This analysis gives an initial indication of the utility of hyperspectral data. Much work still remains to be done in algorithm development and in understanding the physics behind the complex information signal carried in the hyperspectral datasets. This work must be carried out to provide the fullest science support for high spectral resolution data to be acquired by many of the instruments to be launched as part of the Earth Observing System program in the mid-1990's.

  5. Localized Segment Based Processing for Automatic Building Extraction from LiDAR Data

    NASA Astrophysics Data System (ADS)

    Parida, G.; Rajan, K. S.

    2017-05-01

    The current methods of object segmentation and extraction and classification of aerial LiDAR data is manual and tedious task. This work proposes a technique for object segmentation out of LiDAR data. A bottom-up geometric rule based approach was used initially to devise a way to segment buildings out of the LiDAR datasets. For curved wall surfaces, comparison of localized surface normals was done to segment buildings. The algorithm has been applied to both synthetic datasets as well as real world dataset of Vaihingen, Germany. Preliminary results show successful segmentation of the buildings objects from a given scene in case of synthetic datasets and promissory results in case of real world data. The advantages of the proposed work is non-dependence on any other form of data required except LiDAR. It is an unsupervised method of building segmentation, thus requires no model training as seen in supervised techniques. It focuses on extracting the walls of the buildings to construct the footprint, rather than focussing on roof. The focus on extracting the wall to reconstruct the buildings from a LiDAR scene is crux of the method proposed. The current segmentation approach can be used to get 2D footprints of the buildings, with further scope to generate 3D models. Thus, the proposed method can be used as a tool to get footprints of buildings in urban landscapes, helping in urban planning and the smart cities endeavour.

  6. Global and Hemispheric Temperature Anomalies: Land and Marine Instrumental Records (1850 - 2015)

    DOE Data Explorer

    Jones, P. D. [Climatic Research Unit (CRU), University of East Anglia, Norwich, United Kingdom; Parker, D. E. [Hadley Centre for Climate Prediction and Research, Berkshire, United Kingdom; Osborn, T. J. [Climatic Research Unit (CRU), University of East Anglia, Norwich, United Kingdom; Briffa, K. R. [Climatic Research Unit (CRU), University of East Anglia, Norwich, United Kingdom

    2016-05-01

    These global and hemispheric temperature anomaly time series, which incorporate land and marine data, are continually updated and expanded by P. Jones of the Climatic Research Unit (CRU) with help from colleagues at the CRU and other institutions. Some of the earliest work in producing these temperature series dates back to Jones et al. (1986a,b,c), Jones (1988, 1994), and Jones and Briffa (1992). Most of the discussion of methods given here has been gleaned from the Frequently Asked Questions section of the CRU temperature data web pages. Users are encouraged to visit the CRU Web site for the most comprehensive overview of these data (the "HadCRUT4" dataset), other associated datasets, and the most recent literature references to the work of Jones et al.

  7. Phylogenomic analyses support the monophyly of Excavata and resolve relationships among eukaryotic "supergroups".

    PubMed

    Hampl, Vladimir; Hug, Laura; Leigh, Jessica W; Dacks, Joel B; Lang, B Franz; Simpson, Alastair G B; Roger, Andrew J

    2009-03-10

    Nearly all of eukaryotic diversity has been classified into 6 suprakingdom-level groups (supergroups) based on molecular and morphological/cell-biological evidence; these are Opisthokonta, Amoebozoa, Archaeplastida, Rhizaria, Chromalveolata, and Excavata. However, molecular phylogeny has not provided clear evidence that either Chromalveolata or Excavata is monophyletic, nor has it resolved the relationships among the supergroups. To establish the affinities of Excavata, which contains parasites of global importance and organisms regarded previously as primitive eukaryotes, we conducted a phylogenomic analysis of a dataset of 143 proteins and 48 taxa, including 19 excavates. Previous phylogenomic studies have not included all major subgroups of Excavata, and thus have not definitively addressed their interrelationships. The enigmatic flagellate Andalucia is sister to typical jakobids. Jakobids (including Andalucia), Euglenozoa and Heterolobosea form a major clade that we name Discoba. Analyses of the complete dataset group Discoba with the mitochondrion-lacking excavates or "metamonads" (diplomonads, parabasalids, and Preaxostyla), but not with the final excavate group, Malawimonas. This separation likely results from a long-branch attraction artifact. Gradual removal of rapidly-evolving taxa from the dataset leads to moderate bootstrap support (69%) for the monophyly of all Excavata, and 90% support once all metamonads are removed. Most importantly, Excavata robustly emerges between unikonts (Amoebozoa + Opisthokonta) and "megagrouping" of Archaeplastida, Rhizaria, and chromalveolates. Our analyses indicate that Excavata forms a monophyletic suprakingdom-level group that is one of the 3 primary divisions within eukaryotes, along with unikonts and a megagroup of Archaeplastida, Rhizaria, and the chromalveolate lineages.

  8. Big Data Challenges Indexing Large-Volume, Heterogeneous EO Datasets for Effective Data Discovery

    NASA Astrophysics Data System (ADS)

    Waterfall, Alison; Bennett, Victoria; Donegan, Steve; Juckes, Martin; Kershaw, Phil; Petrie, Ruth; Stephens, Ag; Wilson, Antony

    2016-08-01

    This paper describes the importance and challenges faced in making Earth Observation datasets discoverable and accessible by the widest possible user base. Concentrating on data discovery, it details work that is being undertaken by the Centre for Environmental Data Analysis (CEDA), to ensure that the datasets held within its archive are discoverable and searchable. One aspect of this is in indexing the data using controlled vocabularies, based on a Simple Knowledge Organization System (SKOS) ontology, and hosted in a vocabulary server, to ensure that a consistent understanding and approach to a faceted search of the data can be achieved via a variety of different routes. This approach will be illustrated using the example of the development of the ESA CCI Open Data Portal.

  9. Development and validation of a prognostic nomogram for colorectal cancer after radical resection based on individual patient data from three large-scale phase III trials

    PubMed Central

    Akiyoshi, Takashi; Maeda, Hiromichi; Kashiwabara, Kosuke; Kanda, Mitsuro; Mayanagi, Shuhei; Aoyama, Toru; Hamada, Chikuma; Sadahiro, Sotaro; Fukunaga, Yosuke; Ueno, Masashi; Sakamoto, Junichi; Saji, Shigetoyo; Yoshikawa, Takaki

    2017-01-01

    Background Few prediction models have so far been developed and assessed for the prognosis of patients who undergo curative resection for colorectal cancer (CRC). Materials and Methods We prepared a clinical dataset including 5,530 patients who participated in three major randomized controlled trials as a training dataset and 2,263 consecutive patients who were treated at a cancer-specialized hospital as a validation dataset. All subjects underwent radical resection for CRC which was histologically diagnosed to be adenocarcinoma. The main outcomes that were predicted were the overall survival (OS) and disease free survival (DFS). The identification of the variables in this nomogram was based on a Cox regression analysis and the model performance was evaluated by Harrell's c-index. The calibration plot and its slope were also studied. For the external validation assessment, risk group stratification was employed. Results The multivariate Cox model identified variables; sex, age, pathological T and N factor, tumor location, size, lymphnode dissection, postoperative complications and adjuvant chemotherapy. The c-index was 0.72 (95% confidence interval [CI] 0.66-0.77) for the OS and 0.74 (95% CI 0.69-0.78) for the DFS. The proposed stratification in the risk groups demonstrated a significant distinction between the Kaplan–Meier curves for OS and DFS in the external validation dataset. Conclusions We established a clinically reliable nomogram to predict the OS and DFS in patients with CRC using large scale and reliable independent patient data from phase III randomized controlled trials. The external validity was also confirmed on the practical dataset. PMID:29228760

  10. Exploring Genetic Divergence in a Species-Rich Insect Genus Using 2790 DNA Barcodes

    PubMed Central

    Lin, Xiaolong; Stur, Elisabeth; Ekrem, Torbjørn

    2015-01-01

    DNA barcoding using a fragment of the mitochondrial cytochrome c oxidase subunit 1 gene (COI) has proven to be successful for species-level identification in many animal groups. However, most studies have been focused on relatively small datasets or on large datasets of taxonomically high-ranked groups. We explore the quality of DNA barcodes to delimit species in the diverse chironomid genus Tanytarsus (Diptera: Chironomidae) by using different analytical tools. The genus Tanytarsus is the most species-rich taxon of tribe Tanytarsini (Diptera: Chironomidae) with more than 400 species worldwide, some of which can be notoriously difficult to identify to species-level using morphology. Our dataset, based on sequences generated from own material and publicly available data in BOLD, consist of 2790 DNA barcodes with a fragment length of at least 500 base pairs. A neighbor joining tree of this dataset comprises 131 well separated clusters representing 121 morphological species of Tanytarsus: 77 named, 16 unnamed and 28 unidentified theoretical species. For our geographically widespread dataset, DNA barcodes unambiguously discriminate 94.6% of the Tanytarsus species recognized through prior morphological study. Deep intraspecific divergences exist in some species complexes, and need further taxonomic studies using appropriate nuclear markers as well as morphological and ecological data to be resolved. The DNA barcodes cluster into 120–242 molecular operational taxonomic units (OTUs) depending on whether Objective Clustering, Automatic Barcode Gap Discovery (ABGD), Generalized Mixed Yule Coalescent model (GMYC), Poisson Tree Process (PTP), subjective evaluation of the neighbor joining tree or Barcode Index Numbers (BINs) are used. We suggest that a 4–5% threshold is appropriate to delineate species of Tanytarsus non-biting midges. PMID:26406595

  11. A comparative analysis reveals weak relationships between ecological factors and beta diversity of stream insect metacommunities at two spatial levels.

    PubMed

    Heino, Jani; Melo, Adriano S; Bini, Luis Mauricio; Altermatt, Florian; Al-Shami, Salman A; Angeler, David G; Bonada, Núria; Brand, Cecilia; Callisto, Marcos; Cottenie, Karl; Dangles, Olivier; Dudgeon, David; Encalada, Andrea; Göthe, Emma; Grönroos, Mira; Hamada, Neusa; Jacobsen, Dean; Landeiro, Victor L; Ligeiro, Raphael; Martins, Renato T; Miserendino, María Laura; Md Rawi, Che Salmah; Rodrigues, Marciel E; Roque, Fabio de Oliveira; Sandin, Leonard; Schmera, Denes; Sgarbi, Luciano F; Simaika, John P; Siqueira, Tadeu; Thompson, Ross M; Townsend, Colin R

    2015-03-01

    The hypotheses that beta diversity should increase with decreasing latitude and increase with spatial extent of a region have rarely been tested based on a comparative analysis of multiple datasets, and no such study has focused on stream insects. We first assessed how well variability in beta diversity of stream insect metacommunities is predicted by insect group, latitude, spatial extent, altitudinal range, and dataset properties across multiple drainage basins throughout the world. Second, we assessed the relative roles of environmental and spatial factors in driving variation in assemblage composition within each drainage basin. Our analyses were based on a dataset of 95 stream insect metacommunities from 31 drainage basins distributed around the world. We used dissimilarity-based indices to quantify beta diversity for each metacommunity and, subsequently, regressed beta diversity on insect group, latitude, spatial extent, altitudinal range, and dataset properties (e.g., number of sites and percentage of presences). Within each metacommunity, we used a combination of spatial eigenfunction analyses and partial redundancy analysis to partition variation in assemblage structure into environmental, shared, spatial, and unexplained fractions. We found that dataset properties were more important predictors of beta diversity than ecological and geographical factors across multiple drainage basins. In the within-basin analyses, environmental and spatial variables were generally poor predictors of variation in assemblage composition. Our results revealed deviation from general biodiversity patterns because beta diversity did not show the expected decreasing trend with latitude. Our results also call for reconsideration of just how predictable stream assemblages are along ecological gradients, with implications for environmental assessment and conservation decisions. Our findings may also be applicable to other dynamic systems where predictability is low.

  12. Walkability Index

    EPA Pesticide Factsheets

    The Walkability Index dataset characterizes every Census 2010 block group in the U.S. based on its relative walkability. Walkability depends upon characteristics of the built environment that influence the likelihood of walking being used as a mode of travel. The Walkability Index is based on the EPA's previous data product, the Smart Location Database (SLD). Block group data from the SLD was the only input into the Walkability Index, and consisted of four variables from the SLD weighted in a formula to create the new Walkability Index. This dataset shares the SLD's block group boundary definitions from Census 2010. The methodology describing the process of creating the Walkability Index can be found in the documents located at ftp://newftp.epa.gov/EPADataCommons/OP/WalkabilityIndex.zip. You can also learn more about the Smart Location Database at https://edg.epa.gov/data/Public/OP/Smart_Location_DB_v02b.zip.

  13. Employed family physician satisfaction and commitment to their practice, work group, and health care organization.

    PubMed

    Karsh, Ben-Tzion; Beasley, John W; Brown, Roger L

    2010-04-01

    Test a model of family physician job satisfaction and commitment. Data were collected from 1,482 family physicians in a Midwest state during 2000-2001. The sampling frame came from the membership listing of the state's family physician association, and the analyzed dataset included family physicians employed by large multispecialty group practices. A cross-sectional survey was used to collect data about physician working conditions, job satisfaction, commitment, and demographic variables. The response rate was 47 percent. Different variables predicted the different measures of satisfaction and commitment. Satisfaction with one's health care organization (HCO) was most strongly predicted by the degree to which physicians perceived that management valued and recognized them and by the extent to which physicians perceived the organization's goals to be compatible with their own. Satisfaction with one's workgroup was most strongly predicted by the social relationship with members of the workgroup; satisfaction with one's practice was most strongly predicted by relationships with patients. Commitment to one's workgroup was predicted by relationships with one's workgroup. Commitment to one's HCO was predicted by relationships with management of the HCO. Social relationships are stronger predictors of employed family physician satisfaction and commitment than staff support, job control, income, or time pressure.

  14. The Incidence and Influencing Factors of College Student Term-Time Working in China

    ERIC Educational Resources Information Center

    Guo, Fei

    2017-01-01

    As the labor market pressure for college graduates keeps rising in the past decade, working while attending college becomes increasingly popular among undergraduate students in China. With a nationally representative dataset of 6,977 students from 49 institutions, this study examines the incidence and influencing factors on undergraduate student…

  15. CHARMe Commentary metadata for Climate Science: collecting, linking and sharing user feedback on climate datasets

    NASA Astrophysics Data System (ADS)

    Blower, Jon; Lawrence, Bryan; Kershaw, Philip; Nagni, Maurizio

    2014-05-01

    The research process can be thought of as an iterative activity, initiated based on prior domain knowledge, as well on a number of external inputs, and producing a range of outputs including datasets, studies and peer reviewed publications. These outputs may describe the problem under study, the methodology used, the results obtained, etc. In any new publication, the author may cite or comment other papers or datasets in order to support their research hypothesis. However, as their work progresses, the researcher may draw from many other latent channels of information. These could include for example, a private conversation following a lecture or during a social dinner; an opinion expressed concerning some significant event such as an earthquake or for example a satellite failure. In addition, other sources of information of grey literature are important public such as informal papers such as the arxiv deposit, reports and studies. The climate science community is not an exception to this pattern; the CHARMe project, funded under the European FP7 framework, is developing an online system for collecting and sharing user feedback on climate datasets. This is to help users judge how suitable such climate data are for an intended application. The user feedback could be comments about assessments, citations, or provenance of the dataset, or other information such as descriptions of uncertainty or data quality. We define this as a distinct category of metadata called Commentary or C-metadata. We link C-metadata with target climate datasets using a Linked Data approach via the Open Annotation data model. In the context of Linked Data, C-metadata plays the role of a resource which, depending on its nature, may be accessed as simple text or as more structured content. The project is implementing a range of software tools to create, search or visualize C-metadata including a JavaScript plugin enabling this functionality to be integrated in situ with data provider portals. Since commentary metadata may originate from a range of sources, moderation of this information will become a crucial issue. If the project is successful, expert human moderation (analogous to peer-review) will become impracticable as annotation numbers increase, and some combination of algorithmic and crowd-sourced evaluation of commentary metadata will be necessary. To that end, future work will need to extend work under development to enable access control and checking of inputs, to deal with scale.

  16. HLA mismatches and hematopoietic cell transplantation: structural simulations assess the impact of changes in peptide binding specificity on transplant outcome

    PubMed Central

    Yanover, Chen; Petersdorf, Effie W.; Malkki, Mari; Gooley, Ted; Spellman, Stephen; Velardi, Andrea; Bardy, Peter; Madrigal, Alejandro; Bignon, Jean-Denis; Bradley, Philip

    2013-01-01

    The success of hematopoietic cell transplantation from an unrelated donor depends in part on the degree of Human Histocompatibility Leukocyte Antigen (HLA) matching between donor and patient. We present a structure-based analysis of HLA mismatching, focusing on individual amino acid mismatches and their effect on peptide binding specificity. Using molecular modeling simulations of HLA-peptide interactions, we find evidence that amino acid mismatches predicted to perturb peptide binding specificity are associated with higher risk of mortality in a large and diverse dataset of patient-donor pairs assembled by the International Histocompatibility Working Group in Hematopoietic Cell Transplantation consortium. This analysis may represent a first step toward sequence-based prediction of relative risk for HLA allele mismatches. PMID:24482668

  17. Quantifying density cues in grouping displays.

    PubMed

    Machilsen, Bart; Wagemans, Johan; Demeyer, Maarten

    2016-09-01

    Perceptual grouping processes are typically studied using sparse displays of spatially separated elements. Unless the grouping cue of interest is a proximity cue, researchers will want to ascertain that such a cue is absent from the display. Various solutions to this problem have been employed in the literature; however, no validation of these methods exists. Here, we test a number of local density metrics both through their performance as constrained ideal observer models, and through a comparison with a large dataset of human detection trials. We conclude that for the selection of stimuli without a density cue, the Voronoi density metric is preferable, especially if combined with a measurement of the distance to each element's nearest neighbor. We offer the entirety of the dataset as a benchmark for the evaluation of future, possibly improved, metrics. With regard to human processes of grouping by proximity, we found observers to be insensitive to target groupings that are more sparse than the surrounding distractor elements, and less sensitive to regularity cues in element positioning than to local clusterings of target elements. Copyright © 2015 Elsevier Ltd. All rights reserved.

  18. Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.

    PubMed

    Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi

    2015-09-01

    Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the conduct of more complex research. Copyright © 2015 Elsevier Ltd and National Safety Council. All rights reserved.

  19. Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

    PubMed

    Spjuth, Ola; Willighagen, Egon L; Guha, Rajarshi; Eklund, Martin; Wikberg, Jarl Es

    2010-06-30

    QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.

  20. The DataBridge: A System For Optimizing The Use Of Dark Data From The Long Tail Of Science

    NASA Astrophysics Data System (ADS)

    Lander, H.; Rajasekar, A.

    2015-12-01

    The DataBridge is a National Science Foundation funded collaborative project (OCI-1247652, OCI-1247602, OCI-1247663) designed to assist in the discovery of dark data sets from the long tail of science. The DataBridge aims to to build queryable communities of datasets using sociometric network analysis. This approach is being tested to evaluate the ability to leverage various forms of metadata to facilitate discovery of new knowledge. Each dataset in the Databridge has an associated name space used as a first level partitioning. In addition to testing known algorithms for SNA community building, the DataBridge project has built a message-based platform that allows users to provide their own algorithms for each of the stages in the community building process. The stages are: Signature Generation (SG): An SG algorithm creates a metadata signature for a dataset. Signature algorithms might use text metadata provided by the dataset creator or derive metadata. Relevance Algorithm (RA): An RA compares a pair of datasets and produces a similarity value between 0 and 1 for the two datasets. Sociometric Network Analysis (SNA): The SNA will operate on a similarity matrix produced by an RA to partition all of the datasets in the name space into a set of clusters. These clusters represent communities of closely related datasets. The DataBridge also includes a web application that produces a visual representation of the clustering. Future work includes a more complete application that will allow different types of searching of the network of datasets. The DataBridge approach is relevant to geoscience research and informatics. In this presentation we will outline the project, illustrate the deployment of the approach, and discuss other potential applications and next steps for the research such as applying this approach to models. In addition we will explore the relevance of DataBridge to other geoscience projects such as various EarthCube Building Blocks and DIBBS projects.

  1. Towards interoperable and reproducible QSAR analyses: Exchange of datasets

    PubMed Central

    2010-01-01

    Background QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. Results We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Conclusions Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community. PMID:20591161

  2. A strategy for evaluating pathway analysis methods.

    PubMed

    Yu, Chenggang; Woo, Hyung Jun; Yu, Xueping; Oyama, Tatsuya; Wallqvist, Anders; Reifman, Jaques

    2017-10-13

    Researchers have previously developed a multitude of methods designed to identify biological pathways associated with specific clinical or experimental conditions of interest, with the aim of facilitating biological interpretation of high-throughput data. Before practically applying such pathway analysis (PA) methods, we must first evaluate their performance and reliability, using datasets where the pathways perturbed by the conditions of interest have been well characterized in advance. However, such 'ground truths' (or gold standards) are often unavailable. Furthermore, previous evaluation strategies that have focused on defining 'true answers' are unable to systematically and objectively assess PA methods under a wide range of conditions. In this work, we propose a novel strategy for evaluating PA methods independently of any gold standard, either established or assumed. The strategy involves the use of two mutually complementary metrics, recall and discrimination. Recall measures the consistency of the perturbed pathways identified by applying a particular analysis method to an original large dataset and those identified by the same method to a sub-dataset of the original dataset. In contrast, discrimination measures specificity-the degree to which the perturbed pathways identified by a particular method to a dataset from one experiment differ from those identifying by the same method to a dataset from a different experiment. We used these metrics and 24 datasets to evaluate six widely used PA methods. The results highlighted the common challenge in reliably identifying significant pathways from small datasets. Importantly, we confirmed the effectiveness of our proposed dual-metric strategy by showing that previous comparative studies corroborate the performance evaluations of the six methods obtained by our strategy. Unlike any previously proposed strategy for evaluating the performance of PA methods, our dual-metric strategy does not rely on any ground truth, either established or assumed, of the pathways perturbed by a specific clinical or experimental condition. As such, our strategy allows researchers to systematically and objectively evaluate pathway analysis methods by employing any number of datasets for a variety of conditions.

  3. Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study.

    PubMed

    Nie, Zhi; Vairavan, Srinivasan; Narayan, Vaibhav A; Ye, Jieping; Li, Qingqin S

    2018-01-01

    Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70-0.78 and 0.72-0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.

  4. “Teens are from Mars, Adults are from Venus”: Analyzing and Predicting Age Groups with Behavioral Characteristics in Instagram

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Han, Kyungsik; Lee, Sanghack; Jang, Jin

    We present behavioral characteristics of teens and adults in Instagram and prediction of them from their behaviors. Based on two independently created datasets from user profiles and tags, we identify teens and adults, and carry out comparative analyses on their online behaviors. Our study reveals: (1) significant behavioral differences between two age groups; (2) the empirical evidence of classifying teens and adults with up to 82% accuracy, using traditional predictive models, while two baseline methods achieve 68% at best; and (3) the robustness of our models by achieving 76%—81% when tested against an independent dataset obtained without using user profilesmore » or tags.« less

  5. Air Quality uFIND: User-oriented Tool Set for Air Quality Data Discovery and Access

    NASA Astrophysics Data System (ADS)

    Hoijarvi, K.; Robinson, E. M.; Husar, R. B.; Falke, S. R.; Schultz, M. G.; Keating, T. J.

    2012-12-01

    Historically, there have been major impediments to seamless and effective data usage encountered by both data providers and users. Over the last five years, the international Air Quality (AQ) Community has worked through forums such as the Group on Earth Observations AQ Community of Practice, the ESIP AQ Working Group, and the Task Force on Hemispheric Transport of Air Pollution to converge on data format standards (e.g., netCDF), data access standards (e.g., Open Geospatial Consortium Web Coverage Services), metadata standards (e.g., ISO 19115), as well as other conventions (e.g., CF Naming Convention) in order to build an Air Quality Data Network. The centerpiece of the AQ Data Network is the web service-based tool set: user-oriented Filtering and Identification of Networked Data. The purpose of uFIND is to provide rich and powerful facilities for the user to: a) discover and choose a desired dataset by navigation through the multi-dimensional metadata space using faceted search, b) seamlessly access and browse datasets, and c) use uFINDs facilities as a web service for mashups with other AQ applications and portals. In a user-centric information system such as uFIND, the user experience is improved by metadata that includes the general fields for discovery as well as community-specific metadata to narrow the search beyond space, time and generic keyword searches. However, even with the community-specific additions, the ISO 19115 records were formed in compliance with the standard, so that other standards-based search interface could leverage this additional information. To identify the fields necessary for metadata discovery we started with the ISO 19115 Core Metadata fields and fields that were needed for a Catalog Service for the Web (CSW) Record. This fulfilled two goals - one to create valid ISO 19115 records and the other to be able to retrieve the records through a Catalog Service for the Web query. Beyond the required set of fields, the AQ Community added additional fields using a combination of keywords and ISO 19115 fields. These extensions allow discovery by measurement platform or observed phenomena. Beyond discovery metadata, the AQ records include service identification objects that allow standards-based clients, such as some brokers, to access the data found via OGC WCS or WMS data access protocols. uFIND, is one such smart client, this combination of discovery and access metadata allows the user to preview each registered dataset through spatial and temporal views; observe the data access and usage pattern and also find links to dataset-specific metadata directly in uFIND. The AQ data providers also benefit from this architecture since their data products are easier to find and re-use, enhancing the relevance and importance of their products. Finally, the earth science community at large benefits from the Service Oriented Architecture of uFIND, since it is a service itself and allows service-based interfacing with providers and users of the metadata, allowing uFIND facets to be further refined for a particular AQ application or completely repurposed for other Earth Science domains that use the same set of data access and metadata standards.

  6. The Lunar Source Disk: Old Lunar Datasets on a New CD-ROM

    NASA Astrophysics Data System (ADS)

    Hiesinger, H.

    1998-01-01

    A compilation of previously published datasets on CD-ROM is presented. This Lunar Source Disk is intended to be a first step in the improvement/expansion of the Lunar Consortium Disk, in order to create an "image-cube"-like data pool that can be easily accessed and might be useful for a variety of future lunar investigations. All datasets were transformed to a standard map projection that allows direct comparison of different types of information on a pixel-by pixel basis. Lunar observations have a long history and have been important to mankind for centuries, notably since the work of Plutarch and Galileo. As a consequence of centuries of lunar investigations, knowledge of the characteristics and properties of the Moon has accumulated over time. However, a side effect of this accumulation is that it has become more and more complicated for scientists to review all the datasets obtained through different techniques, to interpret them properly, to recognize their weaknesses and strengths in detail, and to combine them synoptically in geologic interpretations. Such synoptic geologic interpretations are crucial for the study of planetary bodies through remote-sensing data in order to avoid misinterpretation. In addition, many of the modem datasets, derived from Earth-based telescopes as well as from spacecraft missions, are acquired at different geometric and radiometric conditions. These differences make it challenging to compare or combine datasets directly or to extract information from different datasets on a pixel-by-pixel basis. Also, as there is no convention for the presentation of lunar datasets, different authors choose different map projections, depending on the location of the investigated areas and their personal interests. Insufficient or incomplete information on the map parameters used by different authors further complicates the reprojection of these datasets to a standard geometry. The goal of our efforts was to transfer previously published lunar datasets to a selected standard geometry in order to create an "image-cube"-like data pool for further interpretation. The starting point was a number of datasets on a CD-ROM published by the Lunar Consortium. The task of creating an uniform data pool was further complicated by some missing or wrong references and keys on the Lunar Consortium CD as well as erroneous reproduction of some datasets in the literature.

  7. Comparison of work related fatal injuries in the United States, Australia, and New Zealand: method and overall findings

    PubMed Central

    Feyer, A; Williamson, A; Stout, N; Driscoll, T; Usher, H; Langley, J

    2001-01-01

    Objectives—To compare the extent, distribution, and nature of fatal occupational injury in New Zealand, Australia, and the United States. Setting—Workplaces in New Zealand, Australia, and the United States. Methods—Data collections based on vital records were used to compare overall rates and distribution of fatal injuries covering the period 1989–92 in Australia and the United States, and 1985–94 in New Zealand. Household labour force data (Australia and the United States) and census data (New Zealand) provided denominator data for calculation of rates. Case definition, case inclusion criteria, and classification of occupation and industry were harmonised across the three datasets. Results—New Zealand had the highest average annual rate (4.9/100 000), Australia an intermediate rate (3.8/100 000), and the United States the lowest rate (3.2/100 000) of fatal occupational injury. Much of the difference between countries was accounted for by differences in industry distribution. In each country, male workers, older workers, and those working in agriculture, forestry and fishing, in mining and in construction, were consistently at higher risk. Intentional fatal injury was more common in the United States, being rare in both Australia and New Zealand. This difference is likely to be reflected in the more common incidence of work related fatal injuries for sales workers in the United States compared with Australia and New Zealand. Conclusions—The present results contrasted with those obtained by a recent study that used published omnibus statistics, both in terms of absolute rates and relative ranking of the three countries. Such differences underscore the importance of using like datasets for international comparisons. The consistency of high risk areas across comparable data from comparable nations provides clear targets for further attention. At this stage, however, it is unclear whether the same specific occupations and/or hazards are contributing to the aggregated industry and occupation group rates reported here. PMID:11289530

  8. Automatic Beam Path Analysis of Laser Wakefield Particle Acceleration Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rubel, Oliver; Geddes, Cameron G.R.; Cormier-Michel, Estelle

    2009-10-19

    Numerical simulations of laser wakefield particle accelerators play a key role in the understanding of the complex acceleration process and in the design of expensive experimental facilities. As the size and complexity of simulation output grows, an increasingly acute challenge is the practical need for computational techniques that aid in scientific knowledge discovery. To that end, we present a set of data-understanding algorithms that work in concert in a pipeline fashion to automatically locate and analyze high energy particle bunches undergoing acceleration in very large simulation datasets. These techniques work cooperatively by first identifying features of interest in individual timesteps,more » then integrating features across timesteps, and based on the information derived perform analysis of temporally dynamic features. This combination of techniques supports accurate detection of particle beams enabling a deeper level of scientific understanding of physical phenomena than hasbeen possible before. By combining efficient data analysis algorithms and state-of-the-art data management we enable high-performance analysis of extremely large particle datasets in 3D. We demonstrate the usefulness of our methods for a variety of 2D and 3D datasets and discuss the performance of our analysis pipeline.« less

  9. e-Bitter: Bitterant Prediction by the Consensus Voting From the Machine-Learning Methods

    PubMed Central

    Zheng, Suqing; Jiang, Mengying; Zhao, Chengwei; Zhu, Rui; Hu, Zhicheng; Xu, Yong; Lin, Fu

    2018-01-01

    In-silico bitterant prediction received the considerable attention due to the expensive and laborious experimental-screening of the bitterant. In this work, we collect the fully experimental dataset containing 707 bitterants and 592 non-bitterants, which is distinct from the fully or partially hypothetical non-bitterant dataset used in the previous works. Based on this experimental dataset, we harness the consensus votes from the multiple machine-learning methods (e.g., deep learning etc.) combined with the molecular fingerprint to build the bitter/bitterless classification models with five-fold cross-validation, which are further inspected by the Y-randomization test and applicability domain analysis. One of the best consensus models affords the accuracy, precision, specificity, sensitivity, F1-score, and Matthews correlation coefficient (MCC) of 0.929, 0.918, 0.898, 0.954, 0.936, and 0.856 respectively on our test set. For the automatic prediction of bitterant, a graphic program “e-Bitter” is developed for the convenience of users via the simple mouse click. To our best knowledge, it is for the first time to adopt the consensus model for the bitterant prediction and develop the first free stand-alone software for the experimental food scientist. PMID:29651416

  10. Computational approaches for predicting biomedical research collaborations.

    PubMed

    Zhang, Qing; Yu, Hong

    2014-01-01

    Biomedical research is increasingly collaborative, and successful collaborations often produce high impact work. Computational approaches can be developed for automatically predicting biomedical research collaborations. Previous works of collaboration prediction mainly explored the topological structures of research collaboration networks, leaving out rich semantic information from the publications themselves. In this paper, we propose supervised machine learning approaches to predict research collaborations in the biomedical field. We explored both the semantic features extracted from author research interest profile and the author network topological features. We found that the most informative semantic features for author collaborations are related to research interest, including similarity of out-citing citations, similarity of abstracts. Of the four supervised machine learning models (naïve Bayes, naïve Bayes multinomial, SVMs, and logistic regression), the best performing model is logistic regression with an ROC ranging from 0.766 to 0.980 on different datasets. To our knowledge we are the first to study in depth how research interest and productivities can be used for collaboration prediction. Our approach is computationally efficient, scalable and yet simple to implement. The datasets of this study are available at https://github.com/qingzhanggithub/medline-collaboration-datasets.

  11. e-Bitter: Bitterant Prediction by the Consensus Voting From the Machine-learning Methods

    NASA Astrophysics Data System (ADS)

    Zheng, Suqing; Jiang, Mengying; Zhao, Chengwei; Zhu, Rui; Hu, Zhicheng; Xu, Yong; Lin, Fu

    2018-03-01

    In-silico bitterant prediction received the considerable attention due to the expensive and laborious experimental-screening of the bitterant. In this work, we collect the fully experimental dataset containing 707 bitterants and 592 non-bitterants, which is distinct from the fully or partially hypothetical non-bitterant dataset used in the previous works. Based on this experimental dataset, we harness the consensus votes from the multiple machine-learning methods (e.g., deep learning etc.) combined with the molecular fingerprint to build the bitter/bitterless classification models with five-fold cross-validation, which are further inspected by the Y-randomization test and applicability domain analysis. One of the best consensus models affords the accuracy, precision, specificity, sensitivity, F1-score, and Matthews correlation coefficient (MCC) of 0.929, 0.918, 0.898, 0.954, 0.936, and 0.856 respectively on our test set. For the automatic prediction of bitterant, a graphic program “e-Bitter” is developed for the convenience of users via the simple mouse click. To our best knowledge, it is for the first time to adopt the consensus model for the bitterant prediction and develop the first free stand-alone software for the experimental food scientist.

  12. EnviroAtlas - Average Annual Precipitation 1981-2010 by HUC12 for the Conterminous United States

    EPA Pesticide Factsheets

    This EnviroAtlas dataset provides the average annual precipitation by 12-digit Hydrologic Unit (HUC). The values were estimated from maps produced by the PRISM Climate Group, Oregon State University. The original data was at the scale of 800 m grid cells representing average precipitation from 1981-2010 in mm. The data was converted to inches of precipitation and then zonal statistics were estimated for a final value of average annual precipitation for each 12 digit HUC. For more information about the original dataset please refer to the PRISM website at http://www.prism.oregonstate.edu/. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).

  13. Multi-site genetic analysis of diffusion images and voxelwise heritability analysis: A pilot project of the ENIGMA–DTI working group

    PubMed Central

    Jahanshad, Neda; Kochunov, Peter; Sprooten, Emma; Mandl, René C.; Nichols, Thomas E.; Almassy, Laura; Blangero, John; Brouwer, Rachel M.; Curran, Joanne E.; de Zubicaray, Greig I.; Duggirala, Ravi; Fox, Peter T.; Hong, L. Elliot; Landman, Bennett A.; Martin, Nicholas G.; McMahon, Katie L.; Medland, Sarah E.; Mitchell, Braxton D.; Olvera, Rene L.; Peterson, Charles P.; Starr, John M.; Sussmann, Jessika E.; Toga, Arthur W.; Wardlaw, Joanna M.; Wright, Margaret J.; Hulshoff Pol, Hilleke E.; Bastin, Mark E.; McIntosh, Andrew M.; Deary, Ian J.; Thompson, Paul M.; Glahn, David C.

    2013-01-01

    The ENIGMA (Enhancing NeuroImaging Genetics through Meta-Analysis) Consortium was set up to analyze brain measures and genotypes from multiple sites across the world to improve the power to detect genetic variants that influence the brain. Diffusion tensor imaging (DTI) yields quantitative measures sensitive to brain development and degeneration, and some common genetic variants may be associated with white matter integrity or connectivity. DTI measures, such as the fractional anisotropy (FA) of water diffusion, may be useful for identifying genetic variants that influence brain microstructure. However, genome-wide association studies (GWAS) require large populations to obtain sufficient power to detect and replicate significant effects, motivating a multi-site consortium effort. As part of an ENIGMA–DTI working group, we analyzed high-resolution FA images from multiple imaging sites across North America, Australia, and Europe, to address the challenge of harmonizing imaging data collected at multiple sites. Four hundred images of healthy adults aged 18–85 from four sites were used to create a template and corresponding skeletonized FA image as a common reference space. Using twin and pedigree samples of different ethnicities, we used our common template to evaluate the heritability of tract-derived FA measures. We show that our template is reliable for integrating multiple datasets by combining results through meta-analysis and unifying the data through exploratory mega-analyses. Our results may help prioritize regions of the FA map that are consistently influenced by additive genetic factors for future genetic discovery studies. Protocols and templates are publicly available at (http://enigma.loni.ucla.edu/ongoing/dti-working-group/). PMID:23629049

  14. Crystal cryocooling distorts conformational heterogeneity in a model Michaelis complex of DHFR

    PubMed Central

    Keedy, Daniel A.; van den Bedem, Henry; Sivak, David A.; Petsko, Gregory A.; Ringe, Dagmar; Wilson, Mark A.; Fraser, James S.

    2014-01-01

    Summary Most macromolecular X-ray structures are determined from cryocooled crystals, but it is unclear whether cryocooling distorts functionally relevant flexibility. Here we compare independently acquired pairs of high-resolution datasets of a model Michaelis complex of dihydrofolate reductase (DHFR), collected by separate groups at both room and cryogenic temperatures. These datasets allow us to isolate the differences between experimental procedures and between temperatures. Our analyses of multiconformer models and time-averaged ensembles suggest that cryocooling suppresses and otherwise modifies sidechain and mainchain conformational heterogeneity, quenching dynamic contact networks. Despite some idiosyncratic differences, most changes from room temperature to cryogenic temperature are conserved, and likely reflect temperature-dependent solvent remodeling. Both cryogenic datasets point to additional conformations not evident in the corresponding room-temperature datasets, suggesting that cryocooling does not merely trap pre-existing conformational heterogeneity. Our results demonstrate that crystal cryocooling consistently distorts the energy landscape of DHFR, a paragon for understanding functional protein dynamics. PMID:24882744

  15. A formal concept analysis approach to consensus clustering of multi-experiment expression data

    PubMed Central

    2014-01-01

    Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407

  16. Evaluation and comparison of bioinformatic tools for the enrichment analysis of metabolomics data.

    PubMed

    Marco-Ramell, Anna; Palau-Rodriguez, Magali; Alay, Ania; Tulipani, Sara; Urpi-Sarda, Mireia; Sanchez-Pla, Alex; Andres-Lacueva, Cristina

    2018-01-02

    Bioinformatic tools for the enrichment of 'omics' datasets facilitate interpretation and understanding of data. To date few are suitable for metabolomics datasets. The main objective of this work is to give a critical overview, for the first time, of the performance of these tools. To that aim, datasets from metabolomic repositories were selected and enriched data were created. Both types of data were analysed with these tools and outputs were thoroughly examined. An exploratory multivariate analysis of the most used tools for the enrichment of metabolite sets, based on a non-metric multidimensional scaling (NMDS) of Jaccard's distances, was performed and mirrored their diversity. Codes (identifiers) of the metabolites of the datasets were searched in different metabolite databases (HMDB, KEGG, PubChem, ChEBI, BioCyc/HumanCyc, LipidMAPS, ChemSpider, METLIN and Recon2). The databases that presented more identifiers of the metabolites of the dataset were PubChem, followed by METLIN and ChEBI. However, these databases had duplicated entries and might present false positives. The performance of over-representation analysis (ORA) tools, including BioCyc/HumanCyc, ConsensusPathDB, IMPaLA, MBRole, MetaboAnalyst, Metabox, MetExplore, MPEA, PathVisio and Reactome and the mapping tool KEGGREST, was examined. Results were mostly consistent among tools and between real and enriched data despite the variability of the tools. Nevertheless, a few controversial results such as differences in the total number of metabolites were also found. Disease-based enrichment analyses were also assessed, but they were not found to be accurate probably due to the fact that metabolite disease sets are not up-to-date and the difficulty of predicting diseases from a list of metabolites. We have extensively reviewed the state-of-the-art of the available range of tools for metabolomic datasets, the completeness of metabolite databases, the performance of ORA methods and disease-based analyses. Despite the variability of the tools, they provided consistent results independent of their analytic approach. However, more work on the completeness of metabolite and pathway databases is required, which strongly affects the accuracy of enrichment analyses. Improvements will be translated into more accurate and global insights of the metabolome.

  17. The impact of integrating WorldView-2 sensor and environmental variables in estimating plantation forest species aboveground biomass and carbon stocks in uMgeni Catchment, South Africa

    NASA Astrophysics Data System (ADS)

    Dube, Timothy; Mutanga, Onisimo

    2016-09-01

    Reliable and accurate mapping and extraction of key forest indicators of ecosystem development and health, such as aboveground biomass (AGB) and aboveground carbon stocks (AGCS) is critical in understanding forests contribution to the local, regional and global carbon cycle. This information is critical in assessing forest contribution towards ecosystem functioning and services, as well as their conservation status. This work aimed at assessing the applicability of the high resolution 8-band WorldView-2 multispectral dataset together with environmental variables in quantifying AGB and aboveground carbon stocks for three forest plantation species i.e. Eucalyptus dunii (ED), Eucalyptus grandis (EG) and Pinus taeda (PT) in uMgeni Catchment, South Africa. Specifically, the strength of the Worldview-2 sensor in terms of its improved imaging agilities is examined as an independent dataset and in conjunction with selected environmental variables. The results have demonstrated that the integration of high resolution 8-band Worldview-2 multispectral data with environmental variables provide improved AGB and AGCS estimates, when compared to the use of spectral data as an independent dataset. The use of integrated datasets yielded a high R2 value of 0.88 and RMSEs of 10.05 t ha-1 and 5.03 t C ha-1 for E. dunii AGB and carbon stocks; whereas the use of spectral data as an independent dataset yielded slightly weaker results, producing an R2 value of 0.73 and an RMSE of 18.57 t ha-1 and 09.29 t C ha-1. Similarly, high accurate results (R2 value of 0.73 and RMSE values of 27.30 t ha-1 and 13.65 t C ha-1) were observed from the estimation of inter-species AGB and carbon stocks. Overall, the findings of this work have shown that the integration of new generation multispectral datasets with environmental variables provide a robust toolset required for the accurate and reliable retrieval of forest aboveground biomass and carbon stocks in densely forested terrestrial ecosystems.

  18. Applying a 2D based CAD scheme for detecting micro-calcification clusters using digital breast tomosynthesis images: an assessment

    NASA Astrophysics Data System (ADS)

    Park, Sang Cheol; Zheng, Bin; Wang, Xiao-Hui; Gur, David

    2008-03-01

    Digital breast tomosynthesis (DBT) has emerged as a promising imaging modality for screening mammography. However, visually detecting micro-calcification clusters depicted on DBT images is a difficult task. Computer-aided detection (CAD) schemes for detecting micro-calcification clusters depicted on mammograms can achieve high performance and the use of CAD results can assist radiologists in detecting subtle micro-calcification clusters. In this study, we compared the performance of an available 2D based CAD scheme with one that includes a new grouping and scoring method when applied to both projection and reconstructed DBT images. We selected a dataset involving 96 DBT examinations acquired on 45 women. Each DBT image set included 11 low dose projection images and a varying number of reconstructed image slices ranging from 18 to 87. In this dataset 20 true-positive micro-calcification clusters were visually detected on the projection images and 40 were visually detected on the reconstructed images, respectively. We first applied the CAD scheme that was previously developed in our laboratory to the DBT dataset. We then tested a new grouping method that defines an independent cluster by grouping the same cluster detected on different projection or reconstructed images. We then compared four scoring methods to assess the CAD performance. The maximum sensitivity level observed for the different grouping and scoring methods were 70% and 88% for the projection and reconstructed images with a maximum false-positive rate of 4.0 and 15.9 per examination, respectively. This preliminary study demonstrates that (1) among the maximum, the minimum or the average CAD generated scores, using the maximum score of the grouped cluster regions achieved the highest performance level, (2) the histogram based scoring method is reasonably effective in reducing false-positive detections on the projection images but the overall CAD sensitivity is lower due to lower signal-to-noise ratio, and (3) CAD achieved higher sensitivity and higher false-positive rate (per examination) on the reconstructed images. We concluded that without changing the detection threshold or performing pre-filtering to possibly increase detection sensitivity, current CAD schemes developed and optimized for 2D mammograms perform relatively poorly and need to be re-optimized using DBT datasets and new grouping and scoring methods need to be incorporated into the schemes if these are to be used on the DBT examinations.

  19. How Art Works: The National Endowment for the Arts' Five-Year Research Agenda, with a System Map and Measurement Model. Appendix A & B

    ERIC Educational Resources Information Center

    National Endowment for the Arts, 2012

    2012-01-01

    This paper presents two appendices supporting the "How Art Works: The National Endowment for the Arts' Five-Year Research Agenda, with a System Map and Measurement Model" report. In Appendix A, brief descriptions of relevant studies and datasets for each node in the "How Art Works" system map are presented. This appendix is meant to supply…

  20. COVARIANCE ESTIMATION USING CONJUGATE GRADIENT FOR 3D CLASSIFICATION IN CRYO-EM.

    PubMed

    Andén, Joakim; Katsevich, Eugene; Singer, Amit

    2015-04-01

    Classifying structural variability in noisy projections of biological macromolecules is a central problem in Cryo-EM. In this work, we build on a previous method for estimating the covariance matrix of the three-dimensional structure present in the molecules being imaged. Our proposed method allows for incorporation of contrast transfer function and non-uniform distribution of viewing angles, making it more suitable for real-world data. We evaluate its performance on a synthetic dataset and an experimental dataset obtained by imaging a 70S ribosome complex.

Top