Sample records for generated clinical datasets

  1. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.

    PubMed

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-07-02

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.

  2. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare

    PubMed Central

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-01-01

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a “data modeler” tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets. PMID:26147731

  3. Development of a Core Clinical Dataset to Characterize Serious Illness, Injuries, and Resource Requirements for Acute Medical Responses to Public Health Emergencies.

    PubMed

    Murphy, David J; Rubinson, Lewis; Blum, James; Isakov, Alexander; Bhagwanjee, Statish; Cairns, Charles B; Cobb, J Perren; Sevransky, Jonathan E

    2015-11-01

    In developed countries, public health systems have become adept at rapidly identifying the etiology and impact of public health emergencies. However, within the time course of clinical responses, shortfalls in readily analyzable patient-level data limit capabilities to understand clinical course, predict outcomes, ensure resource availability, and evaluate the effectiveness of diagnostic and therapeutic strategies for seriously ill and injured patients. To be useful in the timeline of a public health emergency, multi-institutional clinical investigation systems must be in place to rapidly collect, analyze, and disseminate detailed clinical information regarding patients across prehospital, emergency department, and acute care hospital settings, including ICUs. As an initial step to near real-time clinical learning during public health emergencies, we sought to develop an "all-hazards" core dataset to characterize serious illness and injuries and the resource requirements for acute medical response across the care continuum. A multidisciplinary panel of clinicians, public health professionals, and researchers with expertise in public health emergencies. Group consensus process. The consensus process included regularly scheduled conference calls, electronic communications, and an in-person meeting to generate candidate variables. Candidate variables were then reviewed by the group to meet the competing criteria of utility and feasibility resulting in the core dataset. The 40-member panel generated 215 candidate variables for potential dataset inclusion. The final dataset includes 140 patient-level variables in the domains of demographics and anthropometrics (7), prehospital (11), emergency department (13), diagnosis (8), severity of illness (54), medications and interventions (38), and outcomes (9). The resulting all-hazard core dataset for seriously ill and injured persons provides a foundation to facilitate rapid collection, analyses, and dissemination of information necessary for clinicians, public health officials, and policymakers to optimize public health emergency response. Further work is needed to validate the effectiveness of the dataset in a variety of emergency settings.

  4. WE-G-207-06: 3D Fluoroscopic Image Generation From Patient-Specific 4DCBCT-Based Motion Models Derived From Physical Phantom and Clinical Patient Images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dhou, S; Cai, W; Hurwitz, M

    2015-06-15

    Purpose: Respiratory-correlated cone-beam CT (4DCBCT) images acquired immediately prior to treatment have the potential to represent patient motion patterns and anatomy during treatment, including both intra- and inter-fractional changes. We develop a method to generate patient-specific motion models based on 4DCBCT images acquired with existing clinical equipment and used to generate time varying volumetric images (3D fluoroscopic images) representing motion during treatment delivery. Methods: Motion models are derived by deformably registering each 4DCBCT phase to a reference phase, and performing principal component analysis (PCA) on the resulting displacement vector fields. 3D fluoroscopic images are estimated by optimizing the resulting PCAmore » coefficients iteratively through comparison of the cone-beam projections simulating kV treatment imaging and digitally reconstructed radiographs generated from the motion model. Patient and physical phantom datasets are used to evaluate the method in terms of tumor localization error compared to manually defined ground truth positions. Results: 4DCBCT-based motion models were derived and used to generate 3D fluoroscopic images at treatment time. For the patient datasets, the average tumor localization error and the 95th percentile were 1.57 and 3.13 respectively in subsets of four patient datasets. For the physical phantom datasets, the average tumor localization error and the 95th percentile were 1.14 and 2.78 respectively in two datasets. 4DCBCT motion models are shown to perform well in the context of generating 3D fluoroscopic images due to their ability to reproduce anatomical changes at treatment time. Conclusion: This study showed the feasibility of deriving 4DCBCT-based motion models and using them to generate 3D fluoroscopic images at treatment time in real clinical settings. 4DCBCT-based motion models were found to account for the 3D non-rigid motion of the patient anatomy during treatment and have the potential to localize tumor and other patient anatomical structures at treatment time even when inter-fractional changes occur. This project was supported, in part, through a Master Research Agreement with Varian Medical Systems, Inc., Palo Alto, CA. The project was also supported, in part, by Award Number R21CA156068 from the National Cancer Institute.« less

  5. Effect of a Noise-Optimized Second-Generation Monoenergetic Algorithm on Image Noise and Conspicuity of Hypervascular Liver Tumors: An In Vitro and In Vivo Study.

    PubMed

    Marin, Daniele; Ramirez-Giraldo, Juan Carlos; Gupta, Sonia; Fu, Wanyi; Stinnett, Sandra S; Mileto, Achille; Bellini, Davide; Patel, Bhavik; Samei, Ehsan; Nelson, Rendon C

    2016-06-01

    The purpose of this study is to investigate whether the reduction in noise using a second-generation monoenergetic algorithm can improve the conspicuity of hypervascular liver tumors on dual-energy CT (DECT) images of the liver. An anthropomorphic liver phantom in three body sizes and iodine-containing inserts simulating hypervascular lesions was imaged with DECT and single-energy CT at various energy levels (80-140 kV). In addition, a retrospective clinical study was performed in 31 patients with 66 hypervascular liver tumors who underwent DECT during the late hepatic arterial phase. Datasets at energy levels ranging from 40 to 80 keV were reconstructed using first- and second-generation monoenergetic algorithms. Noise, tumor-to-liver contrast-to-noise ratio (CNR), and CNR with a noise constraint (CNRNC) set with a maximum noise increase of 50% were calculated and compared among the different reconstructed datasets. The maximum CNR for the second-generation monoenergetic algorithm, which was attained at 40 keV in both phantom and clinical datasets, was statistically significantly higher than the maximum CNR for the first-generation monoenergetic algorithm (p < 0.001) or single-energy CT acquisitions across a wide range of kilovoltage values. With the second-generation monoenergetic algorithm, the optimal CNRNC occurred at 55 keV, corresponding to lower energy levels compared with first-generation algorithm (predominantly at 70 keV). Patient body size did not substantially affect the selection of the optimal energy level to attain maximal CNR and CNRNC using the second-generation monoenergetic algorithm. A noise-optimized second-generation monoenergetic algorithm significantly improves the conspicuity of hypervascular liver tumors.

  6. A curated compendium of monocyte transcriptome datasets of relevance to human monocyte immunobiology research

    PubMed Central

    Rinchai, Darawan; Boughorbel, Sabri; Presnell, Scott; Quinn, Charlie; Chaussabel, Damien

    2016-01-01

    Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp. PMID:27158452

  7. Standardized data sharing in a paediatric oncology research network--a proof-of-concept study.

    PubMed

    Hochedlinger, Nina; Nitzlnader, Michael; Falgenhauer, Markus; Welte, Stefan; Hayn, Dieter; Koumakis, Lefteris; Potamias, George; Tsiknakis, Manolis; Saraceno, Davide; Rinaldi, Eugenia; Ladenstein, Ruth; Schreier, Günter

    2015-01-01

    Data that has been collected in the course of clinical trials are potentially valuable for additional scientific research questions in so called secondary use scenarios. This is of particular importance in rare disease areas like paediatric oncology. If data from several research projects need to be connected, so called Core Datasets can be used to define which information needs to be extracted from every involved source system. In this work, the utility of the Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM) as a format for Core Datasets was evaluated and a web tool was developed which received Source ODM XML files and--via Extensible Stylesheet Language Transformation (XSLT)--generated standardized Core Dataset ODM XML files. Using this tool, data from different source systems were extracted and pooled for joined analysis in a proof-of-concept study, facilitating both, basic syntactic and semantic interoperability.

  8. Nursing Theory, Terminology, and Big Data: Data-Driven Discovery of Novel Patterns in Archival Randomized Clinical Trial Data.

    PubMed

    Monsen, Karen A; Kelechi, Teresa J; McRae, Marion E; Mathiason, Michelle A; Martin, Karen S

    The growth and diversification of nursing theory, nursing terminology, and nursing data enable a convergence of theory- and data-driven discovery in the era of big data research. Existing datasets can be viewed through theoretical and terminology perspectives using visualization techniques in order to reveal new patterns and generate hypotheses. The Omaha System is a standardized terminology and metamodel that makes explicit the theoretical perspective of the nursing discipline and enables terminology-theory testing research. The purpose of this paper is to illustrate the approach by exploring a large research dataset consisting of 95 variables (demographics, temperature measures, anthropometrics, and standardized instruments measuring quality of life and self-efficacy) from a theory-based perspective using the Omaha System. Aims were to (a) examine the Omaha System dataset to understand the sample at baseline relative to Omaha System problem terms and outcome measures, (b) examine relationships within the normalized Omaha System dataset at baseline in predicting adherence, and (c) examine relationships within the normalized Omaha System dataset at baseline in predicting incident venous ulcer. Variables from a randomized clinical trial of a cryotherapy intervention for the prevention of venous ulcers were mapped onto Omaha System terms and measures to derive a theoretical framework for the terminology-theory testing study. The original dataset was recoded using the mapping to create an Omaha System dataset, which was then examined using visualization to generate hypotheses. The hypotheses were tested using standard inferential statistics. Logistic regression was used to predict adherence and incident venous ulcer. Findings revealed novel patterns in the psychosocial characteristics of the sample that were discovered to be drivers of both adherence (Mental health Behavior: OR = 1.28, 95% CI [1.02, 1.60]; AUC = .56) and incident venous ulcer (Mental health Behavior: OR = 0.65, 95% CI [0.45, 0.93]; Neuro-musculo-skeletal function Status: OR = 0.69, 95% CI [0.47, 1.00]; male: OR = 3.08, 95% CI [1.15, 8.24]; not married: OR = 2.70, 95% CI [1.00, 7.26]; AUC = .76). The Omaha System was employed as ontology, nursing theory, and terminology to bridge data and theory and may be considered a data-driven theorizing methodology. Novel findings suggest a relationship between psychosocial factors and incident venous ulcer outcomes. There is potential to employ this method in further research, which is needed to generate and test hypotheses from other datasets to extend scientific investigations from existing data.

  9. Expert Consensus Contouring Guidelines for Intensity Modulated Radiation Therapy in Esophageal and Gastroesophageal Junction Cancer.

    PubMed

    Wu, Abraham J; Bosch, Walter R; Chang, Daniel T; Hong, Theodore S; Jabbour, Salma K; Kleinberg, Lawrence R; Mamon, Harvey J; Thomas, Charles R; Goodman, Karyn A

    2015-07-15

    Current guidelines for esophageal cancer contouring are derived from traditional 2-dimensional fields based on bony landmarks, and they do not provide sufficient anatomic detail to ensure consistent contouring for more conformal radiation therapy techniques such as intensity modulated radiation therapy (IMRT). Therefore, we convened an expert panel with the specific aim to derive contouring guidelines and generate an atlas for the clinical target volume (CTV) in esophageal or gastroesophageal junction (GEJ) cancer. Eight expert academically based gastrointestinal radiation oncologists participated. Three sample cases were chosen: a GEJ cancer, a distal esophageal cancer, and a mid-upper esophageal cancer. Uniform computed tomographic (CT) simulation datasets and accompanying diagnostic positron emission tomographic/CT images were distributed to each expert, and the expert was instructed to generate gross tumor volume (GTV) and CTV contours for each case. All contours were aggregated and subjected to quantitative analysis to assess the degree of concordance between experts and to generate draft consensus contours. The panel then refined these contours to generate the contouring atlas. The κ statistics indicated substantial agreement between panelists for each of the 3 test cases. A consensus CTV atlas was generated for the 3 test cases, each representing common anatomic presentations of esophageal cancer. The panel agreed on guidelines and principles to facilitate the generalizability of the atlas to individual cases. This expert panel successfully reached agreement on contouring guidelines for esophageal and GEJ IMRT and generated a reference CTV atlas. This atlas will serve as a reference for IMRT contours for clinical practice and prospective trial design. Subsequent patterns of failure analyses of clinical datasets using these guidelines may require modification in the future. Copyright © 2015 Elsevier Inc. All rights reserved.

  10. Expert consensus contouring guidelines for IMRT in esophageal and gastroesophageal junction cancer

    PubMed Central

    Wu, Abraham J.; Bosch, Walter R.; Chang, Daniel T.; Hong, Theodore S.; Jabbour, Salma K.; Kleinberg, Lawrence R.; Mamon, Harvey J.; Thomas, Charles R.; Goodman, Karyn A.

    2015-01-01

    Purpose/Objective(s) Current guidelines for esophageal cancer contouring are derived from traditional two-dimensional fields based on bony landmarks, and do not provide sufficient anatomical detail to ensure consistent contouring for more conformal radiotherapy techniques such as intensity-modulated radiation therapy (IMRT). Therefore, we convened an expert panel with the specific aim to derive contouring guidelines and generate an atlas for the clinical target volume (CTV) in esophageal or gastroesophageal junction (GEJ) cancer. Methods and Materials Eight expert academically-based gastrointestinal radiation oncologists participated. Three sample cases were chosen: a GEJ cancer, a distal esophageal cancer, and a mid-upper esophageal cancer. Uniform CT simulation datasets and an accompanying diagnostic PET-CT were distributed to each expert, and he/she was instructed to generate gross tumor volume (GTV) and CTV contours for each case. All contours were aggregated and subjected to quantitative analysis to assess the degree of concordance between experts and generate draft consensus contours. The panel then refined these contours to generate the contouring atlas. Results Kappa statistics indicated substantial agreement between panelists for each of the three test cases. A consensus CTV atlas was generated for the three test cases, each representing common anatomic presentations of esophageal cancer. The panel agreed on guidelines and principles to facilitate the generalizability of the atlas to individual cases. Conclusions This expert panel successfully reached agreement on contouring guidelines for esophageal and GEJ IMRT and generated a reference CTV atlas. This atlas will serve as a reference for IMRT contours for clinical practice and prospective trial design. Subsequent patterns of failure analyses of clinical datasets utilizing these guidelines may require modification in the future. PMID:26104943

  11. Expert Consensus Contouring Guidelines for Intensity Modulated Radiation Therapy in Esophageal and Gastroesophageal Junction Cancer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wu, Abraham J., E-mail: wua@mskcc.org; Bosch, Walter R.; Chang, Daniel T.

    Purpose/Objective(s): Current guidelines for esophageal cancer contouring are derived from traditional 2-dimensional fields based on bony landmarks, and they do not provide sufficient anatomic detail to ensure consistent contouring for more conformal radiation therapy techniques such as intensity modulated radiation therapy (IMRT). Therefore, we convened an expert panel with the specific aim to derive contouring guidelines and generate an atlas for the clinical target volume (CTV) in esophageal or gastroesophageal junction (GEJ) cancer. Methods and Materials: Eight expert academically based gastrointestinal radiation oncologists participated. Three sample cases were chosen: a GEJ cancer, a distal esophageal cancer, and a mid-upper esophagealmore » cancer. Uniform computed tomographic (CT) simulation datasets and accompanying diagnostic positron emission tomographic/CT images were distributed to each expert, and the expert was instructed to generate gross tumor volume (GTV) and CTV contours for each case. All contours were aggregated and subjected to quantitative analysis to assess the degree of concordance between experts and to generate draft consensus contours. The panel then refined these contours to generate the contouring atlas. Results: The κ statistics indicated substantial agreement between panelists for each of the 3 test cases. A consensus CTV atlas was generated for the 3 test cases, each representing common anatomic presentations of esophageal cancer. The panel agreed on guidelines and principles to facilitate the generalizability of the atlas to individual cases. Conclusions: This expert panel successfully reached agreement on contouring guidelines for esophageal and GEJ IMRT and generated a reference CTV atlas. This atlas will serve as a reference for IMRT contours for clinical practice and prospective trial design. Subsequent patterns of failure analyses of clinical datasets using these guidelines may require modification in the future.« less

  12. Supervised learning based multimodal MRI brain tumour segmentation using texture features from supervoxels.

    PubMed

    Soltaninejad, Mohammadreza; Yang, Guang; Lambrou, Tryphon; Allinson, Nigel; Jones, Timothy L; Barrick, Thomas R; Howe, Franklyn A; Ye, Xujiong

    2018-04-01

    Accurate segmentation of brain tumour in magnetic resonance images (MRI) is a difficult task due to various tumour types. Using information and features from multimodal MRI including structural MRI and isotropic (p) and anisotropic (q) components derived from the diffusion tensor imaging (DTI) may result in a more accurate analysis of brain images. We propose a novel 3D supervoxel based learning method for segmentation of tumour in multimodal MRI brain images (conventional MRI and DTI). Supervoxels are generated using the information across the multimodal MRI dataset. For each supervoxel, a variety of features including histograms of texton descriptor, calculated using a set of Gabor filters with different sizes and orientations, and first order intensity statistical features are extracted. Those features are fed into a random forests (RF) classifier to classify each supervoxel into tumour core, oedema or healthy brain tissue. The method is evaluated on two datasets: 1) Our clinical dataset: 11 multimodal images of patients and 2) BRATS 2013 clinical dataset: 30 multimodal images. For our clinical dataset, the average detection sensitivity of tumour (including tumour core and oedema) using multimodal MRI is 86% with balanced error rate (BER) 7%; while the Dice score for automatic tumour segmentation against ground truth is 0.84. The corresponding results of the BRATS 2013 dataset are 96%, 2% and 0.89, respectively. The method demonstrates promising results in the segmentation of brain tumour. Adding features from multimodal MRI images can largely increase the segmentation accuracy. The method provides a close match to expert delineation across all tumour grades, leading to a faster and more reproducible method of brain tumour detection and delineation to aid patient management. Copyright © 2018 Elsevier B.V. All rights reserved.

  13. A hybrid approach for fusing 4D-MRI temporal information with 3D-CT for the study of lung and lung tumor motion.

    PubMed

    Yang, Y X; Teo, S-K; Van Reeth, E; Tan, C H; Tham, I W K; Poh, C L

    2015-08-01

    Accurate visualization of lung motion is important in many clinical applications, such as radiotherapy of lung cancer. Advancement in imaging modalities [e.g., computed tomography (CT) and MRI] has allowed dynamic imaging of lung and lung tumor motion. However, each imaging modality has its advantages and disadvantages. The study presented in this paper aims at generating synthetic 4D-CT dataset for lung cancer patients by combining both continuous three-dimensional (3D) motion captured by 4D-MRI and the high spatial resolution captured by CT using the authors' proposed approach. A novel hybrid approach based on deformable image registration (DIR) and finite element method simulation was developed to fuse a static 3D-CT volume (acquired under breath-hold) and the 3D motion information extracted from 4D-MRI dataset, creating a synthetic 4D-CT dataset. The study focuses on imaging of lung and lung tumor. Comparing the synthetic 4D-CT dataset with the acquired 4D-CT dataset of six lung cancer patients based on 420 landmarks, accurate results (average error <2 mm) were achieved using the authors' proposed approach. Their hybrid approach achieved a 40% error reduction (based on landmarks assessment) over using only DIR techniques. The synthetic 4D-CT dataset generated has high spatial resolution, has excellent lung details, and is able to show movement of lung and lung tumor over multiple breathing cycles.

  14. Isosemantic rendering of clinical information using formal ontologies and RDF.

    PubMed

    Martínez-Costa, Catalina; Bosca, Diego; Legaz-García, Mari Carmen; Tao, Cui; Fernández Breis, Jesualdo Tomás; Schulz, Stefan; Chute, Christopher G

    2013-01-01

    The generation of a semantic clinical infostructure requires linking ontologies, clinical models and terminologies [1]. Here we describe an approach that would permit data coming from different sources and represented in different standards to be queried in a homogeneous and integrated way. Our assumption is that data providers should be able to agree and share the meaning of the data they want to exchange and to exploit. We will describe how Clinical Element Model (CEM) and OpenEHR datasets can be jointly exploited in Semantic Web environments.

  15. Automatic brain tissue segmentation based on graph filter.

    PubMed

    Kong, Youyong; Chen, Xiaopeng; Wu, Jiasong; Zhang, Pinzheng; Chen, Yang; Shu, Huazhong

    2018-05-09

    Accurate segmentation of brain tissues from magnetic resonance imaging (MRI) is of significant importance in clinical applications and neuroscience research. Accurate segmentation is challenging due to the tissue heterogeneity, which is caused by noise, bias filed and partial volume effects. To overcome this limitation, this paper presents a novel algorithm for brain tissue segmentation based on supervoxel and graph filter. Firstly, an effective supervoxel method is employed to generate effective supervoxels for the 3D MRI image. Secondly, the supervoxels are classified into different types of tissues based on filtering of graph signals. The performance is evaluated on the BrainWeb 18 dataset and the Internet Brain Segmentation Repository (IBSR) 18 dataset. The proposed method achieves mean dice similarity coefficient (DSC) of 0.94, 0.92 and 0.90 for the segmentation of white matter (WM), grey matter (GM) and cerebrospinal fluid (CSF) for BrainWeb 18 dataset, and mean DSC of 0.85, 0.87 and 0.57 for the segmentation of WM, GM and CSF for IBSR18 dataset. The proposed approach can well discriminate different types of brain tissues from the brain MRI image, which has high potential to be applied for clinical applications.

  16. Nuclear Receptor Signaling Atlas: Opening Access to the Biology of Nuclear Receptor Signaling Pathways.

    PubMed

    Becnel, Lauren B; Darlington, Yolanda F; Ochsner, Scott A; Easton-Marks, Jeremy R; Watkins, Christopher M; McOwiti, Apollo; Kankanamge, Wasula H; Wise, Michael W; DeHart, Michael; Margolis, Ronald N; McKenna, Neil J

    2015-01-01

    Signaling pathways involving nuclear receptors (NRs), their ligands and coregulators, regulate tissue-specific transcriptomes in diverse processes, including development, metabolism, reproduction, the immune response and neuronal function, as well as in their associated pathologies. The Nuclear Receptor Signaling Atlas (NURSA) is a Consortium focused around a Hub website (www.nursa.org) that annotates and integrates diverse 'omics datasets originating from the published literature and NURSA-funded Data Source Projects (NDSPs). These datasets are then exposed to the scientific community on an Open Access basis through user-friendly data browsing and search interfaces. Here, we describe the redesign of the Hub, version 3.0, to deploy "Web 2.0" technologies and add richer, more diverse content. The Molecule Pages, which aggregate information relevant to NR signaling pathways from myriad external databases, have been enhanced to include resources for basic scientists, such as post-translational modification sites and targeting miRNAs, and for clinicians, such as clinical trials. A portal to NURSA's Open Access, PubMed-indexed journal Nuclear Receptor Signaling has been added to facilitate manuscript submissions. Datasets and information on reagents generated by NDSPs are available, as is information concerning periodic new NDSP funding solicitations. Finally, the new website integrates the Transcriptomine analysis tool, which allows for mining of millions of richly annotated public transcriptomic data points in the field, providing an environment for dataset re-use and citation, bench data validation and hypothesis generation. We anticipate that this new release of the NURSA database will have tangible, long term benefits for both basic and clinical research in this field.

  17. SurvExpress: an online biomarker validation tool and database for cancer gene expression data using survival analysis.

    PubMed

    Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G; Treviño, Victor

    2013-01-01

    Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R.

  18. SurvExpress: An Online Biomarker Validation Tool and Database for Cancer Gene Expression Data Using Survival Analysis

    PubMed Central

    Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G.; Treviño, Victor

    2013-01-01

    Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R. PMID:24066126

  19. The SysteMHC Atlas project

    PubMed Central

    Shao, Wenguang; Pedrioli, Patrick G A; Wolski, Witold; Scurtescu, Cristian; Schmid, Emanuel; Courcelles, Mathieu; Schuster, Heiko; Kowalewski, Daniel; Marino, Fabio; Arlehamn, Cecilia S L; Vaughan, Kerrie; Peters, Bjoern; Sette, Alessandro; Ottenhoff, Tom H M; Meijgaarden, Krista E; Nieuwenhuizen, Natalie; Kaufmann, Stefan H E; Schlapbach, Ralph; Castle, John C; Nesvizhskii, Alexey I; Nielsen, Morten; Deutsch, Eric W; Campbell, David S; Moritz, Robert L; Zubarev, Roman A; Ytterberg, Anders Jimmy; Purcell, Anthony W; Marcilla, Miguel; Paradela, Alberto; Wang, Qi; Costello, Catherine E; Ternette, Nicola; van Veelen, Peter A; van Els, Cécile A C M; de Souza, Gustavo A; Sollid, Ludvig M; Admon, Arie; Stevanovic, Stefan; Rammensee, Hans-Georg; Thibault, Pierre; Perreault, Claude; Bassani-Sternberg, Michal

    2018-01-01

    Abstract Mass spectrometry (MS)-based immunopeptidomics investigates the repertoire of peptides presented at the cell surface by major histocompatibility complex (MHC) molecules. The broad clinical relevance of MHC-associated peptides, e.g. in precision medicine, provides a strong rationale for the large-scale generation of immunopeptidomic datasets and recent developments in MS-based peptide analysis technologies now support the generation of the required data. Importantly, the availability of diverse immunopeptidomic datasets has resulted in an increasing need to standardize, store and exchange this type of data to enable better collaborations among researchers, to advance the field more efficiently and to establish quality measures required for the meaningful comparison of datasets. Here we present the SysteMHC Atlas (https://systemhcatlas.org), a public database that aims at collecting, organizing, sharing, visualizing and exploring immunopeptidomic data generated by MS. The Atlas includes raw mass spectrometer output files collected from several laboratories around the globe, a catalog of context-specific datasets of MHC class I and class II peptides, standardized MHC allele-specific peptide spectral libraries consisting of consensus spectra calculated from repeat measurements of the same peptide sequence, and links to other proteomics and immunology databases. The SysteMHC Atlas project was created and will be further expanded using a uniform and open computational pipeline that controls the quality of peptide identifications and peptide annotations. Thus, the SysteMHC Atlas disseminates quality controlled immunopeptidomic information to the public domain and serves as a community resource toward the generation of a high-quality comprehensive map of the human immunopeptidome and the support of consistent measurement of immunopeptidomic sample cohorts. PMID:28985418

  20. External validation of a publicly available computer assisted diagnostic tool for mammographic mass lesions with two high prevalence research datasets.

    PubMed

    Benndorf, Matthias; Burnside, Elizabeth S; Herda, Christoph; Langer, Mathias; Kotter, Elmar

    2015-08-01

    Lesions detected at mammography are described with a highly standardized terminology: the breast imaging-reporting and data system (BI-RADS) lexicon. Up to now, no validated semantic computer assisted classification algorithm exists to interactively link combinations of morphological descriptors from the lexicon to a probabilistic risk estimate of malignancy. The authors therefore aim at the external validation of the mammographic mass diagnosis (MMassDx) algorithm. A classification algorithm like MMassDx must perform well in a variety of clinical circumstances and in datasets that were not used to generate the algorithm in order to ultimately become accepted in clinical routine. The MMassDx algorithm uses a naïve Bayes network and calculates post-test probabilities of malignancy based on two distinct sets of variables, (a) BI-RADS descriptors and age ("descriptor model") and (b) BI-RADS descriptors, age, and BI-RADS assessment categories ("inclusive model"). The authors evaluate both the MMassDx (descriptor) and MMassDx (inclusive) models using two large publicly available datasets of mammographic mass lesions: the digital database for screening mammography (DDSM) dataset, which contains two subsets from the same examinations-a medio-lateral oblique (MLO) view and cranio-caudal (CC) view dataset-and the mammographic mass (MM) dataset. The DDSM contains 1220 mass lesions and the MM dataset contains 961 mass lesions. The authors evaluate discriminative performance using area under the receiver-operating-characteristic curve (AUC) and compare this to the BI-RADS assessment categories alone (i.e., the clinical performance) using the DeLong method. The authors also evaluate whether assigned probabilistic risk estimates reflect the lesions' true risk of malignancy using calibration curves. The authors demonstrate that the MMassDx algorithms show good discriminatory performance. AUC for the MMassDx (descriptor) model in the DDSM data is 0.876/0.895 (MLO/CC view) and AUC for the MMassDx (inclusive) model in the DDSM data is 0.891/0.900 (MLO/CC view). AUC for the MMassDx (descriptor) model in the MM data is 0.862 and AUC for the MMassDx (inclusive) model in the MM data is 0.900. In all scenarios, MMassDx performs significantly better than clinical performance, P < 0.05 each. The authors furthermore demonstrate that the MMassDx algorithm systematically underestimates the risk of malignancy in the DDSM and MM datasets, especially when low probabilities of malignancy are assigned. The authors' results reveal that the MMassDx algorithms have good discriminatory performance but less accurate calibration when tested on two independent validation datasets. Improvement in calibration and testing in a prospective clinical population will be important steps in the pursuit of translation of these algorithms to the clinic.

  1. Semantics-based plausible reasoning to extend the knowledge coverage of medical knowledge bases for improved clinical decision support.

    PubMed

    Mohammadhassanzadeh, Hossein; Van Woensel, William; Abidi, Samina Raza; Abidi, Syed Sibte Raza

    2017-01-01

    Capturing complete medical knowledge is challenging-often due to incomplete patient Electronic Health Records (EHR), but also because of valuable, tacit medical knowledge hidden away in physicians' experiences. To extend the coverage of incomplete medical knowledge-based systems beyond their deductive closure, and thus enhance their decision-support capabilities, we argue that innovative, multi-strategy reasoning approaches should be applied. In particular, plausible reasoning mechanisms apply patterns from human thought processes, such as generalization, similarity and interpolation, based on attributional, hierarchical, and relational knowledge. Plausible reasoning mechanisms include inductive reasoning , which generalizes the commonalities among the data to induce new rules, and analogical reasoning , which is guided by data similarities to infer new facts. By further leveraging rich, biomedical Semantic Web ontologies to represent medical knowledge, both known and tentative, we increase the accuracy and expressivity of plausible reasoning, and cope with issues such as data heterogeneity, inconsistency and interoperability. In this paper, we present a Semantic Web-based, multi-strategy reasoning approach, which integrates deductive and plausible reasoning and exploits Semantic Web technology to solve complex clinical decision support queries. We evaluated our system using a real-world medical dataset of patients with hepatitis, from which we randomly removed different percentages of data (5%, 10%, 15%, and 20%) to reflect scenarios with increasing amounts of incomplete medical knowledge. To increase the reliability of the results, we generated 5 independent datasets for each percentage of missing values, which resulted in 20 experimental datasets (in addition to the original dataset). The results show that plausibly inferred knowledge extends the coverage of the knowledge base by, on average, 2%, 7%, 12%, and 16% for datasets with, respectively, 5%, 10%, 15%, and 20% of missing values. This expansion in the KB coverage allowed solving complex disease diagnostic queries that were previously unresolvable, without losing the correctness of the answers. However, compared to deductive reasoning, data-intensive plausible reasoning mechanisms yield a significant performance overhead. We observed that plausible reasoning approaches, by generating tentative inferences and leveraging domain knowledge of experts, allow us to extend the coverage of medical knowledge bases, resulting in improved clinical decision support. Second, by leveraging OWL ontological knowledge, we are able to increase the expressivity and accuracy of plausible reasoning methods. Third, our approach is applicable to clinical decision support systems for a range of chronic diseases.

  2. Lessons learned in the generation of biomedical research datasets using Semantic Open Data technologies.

    PubMed

    Legaz-García, María del Carmen; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2015-01-01

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources. Such heterogeneity makes difficult not only the generation of research-oriented dataset but also its exploitation. In recent years, the Open Data paradigm has proposed new ways for making data available in ways that sharing and integration are facilitated. Open Data approaches may pursue the generation of content readable only by humans and by both humans and machines, which are the ones of interest in our work. The Semantic Web provides a natural technological space for data integration and exploitation and offers a range of technologies for generating not only Open Datasets but also Linked Datasets, that is, open datasets linked to other open datasets. According to the Berners-Lee's classification, each open dataset can be given a rating between one and five stars attending to can be given to each dataset. In the last years, we have developed and applied our SWIT tool, which automates the generation of semantic datasets from heterogeneous data sources. SWIT produces four stars datasets, given that fifth one can be obtained by being the dataset linked from external ones. In this paper, we describe how we have applied the tool in two projects related to health care records and orthology data, as well as the major lessons learned from such efforts.

  3. Iterative dataset optimization in automated planning: Implementation for breast and rectal cancer radiotherapy.

    PubMed

    Fan, Jiawei; Wang, Jiazhou; Zhang, Zhen; Hu, Weigang

    2017-06-01

    To develop a new automated treatment planning solution for breast and rectal cancer radiotherapy. The automated treatment planning solution developed in this study includes selection of the iterative optimized training dataset, dose volume histogram (DVH) prediction for the organs at risk (OARs), and automatic generation of clinically acceptable treatment plans. The iterative optimized training dataset is selected by an iterative optimization from 40 treatment plans for left-breast and rectal cancer patients who received radiation therapy. A two-dimensional kernel density estimation algorithm (noted as two parameters KDE) which incorporated two predictive features was implemented to produce the predicted DVHs. Finally, 10 additional new left-breast treatment plans are re-planned using the Pinnacle 3 Auto-Planning (AP) module (version 9.10, Philips Medical Systems) with the objective functions derived from the predicted DVH curves. Automatically generated re-optimized treatment plans are compared with the original manually optimized plans. By combining the iterative optimized training dataset methodology and two parameters KDE prediction algorithm, our proposed automated planning strategy improves the accuracy of the DVH prediction. The automatically generated treatment plans using the dose derived from the predicted DVHs can achieve better dose sparing for some OARs without compromising other metrics of plan quality. The proposed new automated treatment planning solution can be used to efficiently evaluate and improve the quality and consistency of the treatment plans for intensity-modulated breast and rectal cancer radiation therapy. © 2017 American Association of Physicists in Medicine.

  4. Residential load and rooftop PV generation: an Australian distribution network dataset

    NASA Astrophysics Data System (ADS)

    Ratnam, Elizabeth L.; Weller, Steven R.; Kellett, Christopher M.; Murray, Alan T.

    2017-09-01

    Despite the rapid uptake of small-scale solar photovoltaic (PV) systems in recent years, public availability of generation and load data at the household level remains very limited. Moreover, such data are typically measured using bi-directional meters recording only PV generation in excess of residential load rather than recording generation and load separately. In this paper, we report a publicly available dataset consisting of load and rooftop PV generation for 300 de-identified residential customers in an Australian distribution network, with load centres covering metropolitan Sydney and surrounding regional areas. The dataset spans a 3-year period, with separately reported measurements of load and PV generation at 30-min intervals. Following a detailed description of the dataset, we identify several means by which anomalous records (e.g. due to inverter failure) are identified and excised. With the resulting 'clean' dataset, we identify key customer-specific and aggregated characteristics of rooftop PV generation and residential load.

  5. Using network projections to explore co-incidence and context in large clinical datasets: Application to homelessness among U.S. Veterans.

    PubMed

    Pettey, Warren B P; Toth, Damon J A; Redd, Andrew; Carter, Marjorie E; Samore, Matthew H; Gundlapalli, Adi V

    2016-06-01

    Network projections of data can provide an efficient format for data exploration of co-incidence in large clinical datasets. We present and explore the utility of a network projection approach to finding patterns in health care data that could be exploited to prevent homelessness among U.S. Veterans. We divided Veteran ICD-9-CM (ICD9) data into two time periods (0-59 and 60-364days prior to the first evidence of homelessness) and then used Pajek social network analysis software to visualize these data as three different networks. A multi-relational network simultaneously displayed the magnitude of ties between the most frequent ICD9 pairings. A new association network visualized ICD9 pairings that greatly increased or decreased. A signed, subtraction network visualized the presence, absence, and magnitude difference between ICD9 associations by time period. A cohort of 9468 U.S. Veterans was identified as having administrative evidence of homelessness and visits in both time periods. They were seen in 222,599 outpatient visits that generated 484,339 ICD9 codes (average of 11.4 (range 1-23) visits and 2.2 (range 1-60) ICD9 codes per visit). Using the three network projection methods, we were able to show distinct differences in the pattern of co-morbidities in the two time periods. In the more distant time period preceding homelessness, the network was dominated by routine health maintenance visits and physical ailment diagnoses. In the 59days immediately prior to the homelessness identification, alcohol related diagnoses along with economic circumstances such as unemployment, legal circumstances, along with housing instability were noted. Network visualizations of large clinical datasets traditionally treated as tabular and difficult to manipulate reveal rich, previously hidden connections between data variables related to homelessness. A key feature is the ability to visualize changes in variables with temporality and in proximity to the event of interest. These visualizations lend support to cognitive tasks such as exploration of large clinical datasets as a prelude to hypothesis generation. Published by Elsevier Inc.

  6. A hybrid approach for fusing 4D-MRI temporal information with 3D-CT for the study of lung and lung tumor motion

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yang, Y. X.; Van Reeth, E.; Poh, C. L., E-mail: clpoh@ntu.edu.sg

    2015-08-15

    Purpose: Accurate visualization of lung motion is important in many clinical applications, such as radiotherapy of lung cancer. Advancement in imaging modalities [e.g., computed tomography (CT) and MRI] has allowed dynamic imaging of lung and lung tumor motion. However, each imaging modality has its advantages and disadvantages. The study presented in this paper aims at generating synthetic 4D-CT dataset for lung cancer patients by combining both continuous three-dimensional (3D) motion captured by 4D-MRI and the high spatial resolution captured by CT using the authors’ proposed approach. Methods: A novel hybrid approach based on deformable image registration (DIR) and finite elementmore » method simulation was developed to fuse a static 3D-CT volume (acquired under breath-hold) and the 3D motion information extracted from 4D-MRI dataset, creating a synthetic 4D-CT dataset. Results: The study focuses on imaging of lung and lung tumor. Comparing the synthetic 4D-CT dataset with the acquired 4D-CT dataset of six lung cancer patients based on 420 landmarks, accurate results (average error <2 mm) were achieved using the authors’ proposed approach. Their hybrid approach achieved a 40% error reduction (based on landmarks assessment) over using only DIR techniques. Conclusions: The synthetic 4D-CT dataset generated has high spatial resolution, has excellent lung details, and is able to show movement of lung and lung tumor over multiple breathing cycles.« less

  7. A virtual dosimetry audit - Towards transferability of gamma index analysis between clinical trial QA groups.

    PubMed

    Hussein, Mohammad; Clementel, Enrico; Eaton, David J; Greer, Peter B; Haworth, Annette; Ishikura, Satoshi; Kry, Stephen F; Lehmann, Joerg; Lye, Jessica; Monti, Angelo F; Nakamura, Mitsuhiro; Hurkmans, Coen; Clark, Catharine H

    2017-12-01

    Quality assurance (QA) for clinical trials is important. Lack of compliance can affect trial outcome. Clinical trial QA groups have different methods of dose distribution verification and analysis, all with the ultimate aim of ensuring trial compliance. The aim of this study was to gain a better understanding of different processes to inform future dosimetry audit reciprocity. Six clinical trial QA groups participated. Intensity modulated treatment plans were generated for three different cases. A range of 17 virtual 'measurements' were generated by introducing a variety of simulated perturbations (such as MLC position deviations, dose differences, gantry rotation errors, Gaussian noise) to three different treatment plan cases. Participants were blinded to the 'measured' data details. Each group analysed the datasets using their own gamma index (γ) technique and using standardised parameters for passing criteria, lower dose threshold, γ normalisation and global γ. For the same virtual 'measured' datasets, different results were observed using local techniques. For the standardised γ, differences in the percentage of points passing with γ < 1 were also found, however these differences were less pronounced than for each clinical trial QA group's analysis. These variations may be due to different software implementations of γ. This virtual dosimetry audit has been an informative step in understanding differences in the verification of measured dose distributions between different clinical trial QA groups. This work lays the foundations for audit reciprocity between groups, particularly with more clinical trials being open to international recruitment. Copyright © 2017 Elsevier B.V. All rights reserved.

  8. An imprinted non-coding genomic cluster at 14q32 defines clinically relevant molecular subtypes in osteosarcoma across multiple independent datasets.

    PubMed

    Hill, Katherine E; Kelly, Andrew D; Kuijjer, Marieke L; Barry, William; Rattani, Ahmed; Garbutt, Cassandra C; Kissick, Haydn; Janeway, Katherine; Perez-Atayde, Antonio; Goldsmith, Jeffrey; Gebhardt, Mark C; Arredouani, Mohamed S; Cote, Greg; Hornicek, Francis; Choy, Edwin; Duan, Zhenfeng; Quackenbush, John; Haibe-Kains, Benjamin; Spentzos, Dimitrios

    2017-05-15

    A microRNA (miRNA) collection on the imprinted 14q32 MEG3 region has been associated with outcome in osteosarcoma. We assessed the clinical utility of this miRNA set and their association with methylation status. We integrated coding and non-coding RNA data from three independent annotated clinical osteosarcoma cohorts (n = 65, n = 27, and n = 25) and miRNA and methylation data from one in vitro (19 cell lines) and one clinical (NCI Therapeutically Applicable Research to Generate Effective Treatments (TARGET) osteosarcoma dataset, n = 80) dataset. We used time-dependent receiver operating characteristic (tdROC) analysis to evaluate the clinical value of candidate miRNA profiles and machine learning approaches to compare the coding and non-coding transcriptional programs of high- and low-risk osteosarcoma tumors and high- versus low-aggressiveness cell lines. In the cell line and TARGET datasets, we also studied the methylation patterns of the MEG3 imprinting control region on 14q32 and their association with miRNA expression and tumor aggressiveness. In the tdROC analysis, miRNA sets on 14q32 showed strong discriminatory power for recurrence and survival in the three clinical datasets. High- or low-risk tumor classification was robust to using different microRNA sets or classification methods. Machine learning approaches showed that genome-wide miRNA profiles and miRNA regulatory networks were quite different between the two outcome groups and mRNA profiles categorized the samples in a manner concordant with the miRNAs, suggesting potential molecular subtypes. Further, miRNA expression patterns were reproducible in comparing high-aggressiveness versus low-aggressiveness cell lines. Methylation patterns in the MEG3 differentially methylated region (DMR) also distinguished high-aggressiveness from low-aggressiveness cell lines and were associated with expression of several 14q32 miRNAs in both the cell lines and the large TARGET clinical dataset. Within the limits of available CpG array coverage, we observed a potential methylation-sensitive regulation of the non-coding RNA cluster by CTCF, a known enhancer-blocking factor. Loss of imprinting/methylation changes in the 14q32 non-coding region defines reproducible previously unrecognized osteosarcoma subtypes with distinct transcriptional programs and biologic and clinical behavior. Future studies will define the precise relationship between 14q32 imprinting, non-coding RNA expression, genomic enhancer binding, and tumor aggressiveness, with possible therapeutic implications for both early- and advanced-stage patients.

  9. Method of generating features optimal to a dataset and classifier

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bruillard, Paul J.; Gosink, Luke J.; Jarman, Kenneth D.

    A method of generating features optimal to a particular dataset and classifier is disclosed. A dataset of messages is inputted and a classifier is selected. An algebra of features is encoded. Computable features that are capable of describing the dataset from the algebra of features are selected. Irredundant features that are optimal for the classifier and the dataset are selected.

  10. The SysteMHC Atlas project.

    PubMed

    Shao, Wenguang; Pedrioli, Patrick G A; Wolski, Witold; Scurtescu, Cristian; Schmid, Emanuel; Vizcaíno, Juan A; Courcelles, Mathieu; Schuster, Heiko; Kowalewski, Daniel; Marino, Fabio; Arlehamn, Cecilia S L; Vaughan, Kerrie; Peters, Bjoern; Sette, Alessandro; Ottenhoff, Tom H M; Meijgaarden, Krista E; Nieuwenhuizen, Natalie; Kaufmann, Stefan H E; Schlapbach, Ralph; Castle, John C; Nesvizhskii, Alexey I; Nielsen, Morten; Deutsch, Eric W; Campbell, David S; Moritz, Robert L; Zubarev, Roman A; Ytterberg, Anders Jimmy; Purcell, Anthony W; Marcilla, Miguel; Paradela, Alberto; Wang, Qi; Costello, Catherine E; Ternette, Nicola; van Veelen, Peter A; van Els, Cécile A C M; Heck, Albert J R; de Souza, Gustavo A; Sollid, Ludvig M; Admon, Arie; Stevanovic, Stefan; Rammensee, Hans-Georg; Thibault, Pierre; Perreault, Claude; Bassani-Sternberg, Michal; Aebersold, Ruedi; Caron, Etienne

    2018-01-04

    Mass spectrometry (MS)-based immunopeptidomics investigates the repertoire of peptides presented at the cell surface by major histocompatibility complex (MHC) molecules. The broad clinical relevance of MHC-associated peptides, e.g. in precision medicine, provides a strong rationale for the large-scale generation of immunopeptidomic datasets and recent developments in MS-based peptide analysis technologies now support the generation of the required data. Importantly, the availability of diverse immunopeptidomic datasets has resulted in an increasing need to standardize, store and exchange this type of data to enable better collaborations among researchers, to advance the field more efficiently and to establish quality measures required for the meaningful comparison of datasets. Here we present the SysteMHC Atlas (https://systemhcatlas.org), a public database that aims at collecting, organizing, sharing, visualizing and exploring immunopeptidomic data generated by MS. The Atlas includes raw mass spectrometer output files collected from several laboratories around the globe, a catalog of context-specific datasets of MHC class I and class II peptides, standardized MHC allele-specific peptide spectral libraries consisting of consensus spectra calculated from repeat measurements of the same peptide sequence, and links to other proteomics and immunology databases. The SysteMHC Atlas project was created and will be further expanded using a uniform and open computational pipeline that controls the quality of peptide identifications and peptide annotations. Thus, the SysteMHC Atlas disseminates quality controlled immunopeptidomic information to the public domain and serves as a community resource toward the generation of a high-quality comprehensive map of the human immunopeptidome and the support of consistent measurement of immunopeptidomic sample cohorts. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. Nuclear Receptor Signaling Atlas: Opening Access to the Biology of Nuclear Receptor Signaling Pathways

    PubMed Central

    Becnel, Lauren B.; Darlington, Yolanda F.; Ochsner, Scott A.; Easton-Marks, Jeremy R.; Watkins, Christopher M.; McOwiti, Apollo; Kankanamge, Wasula H.; Wise, Michael W.; DeHart, Michael; Margolis, Ronald N.; McKenna, Neil J.

    2015-01-01

    Signaling pathways involving nuclear receptors (NRs), their ligands and coregulators, regulate tissue-specific transcriptomes in diverse processes, including development, metabolism, reproduction, the immune response and neuronal function, as well as in their associated pathologies. The Nuclear Receptor Signaling Atlas (NURSA) is a Consortium focused around a Hub website (www.nursa.org) that annotates and integrates diverse ‘omics datasets originating from the published literature and NURSA-funded Data Source Projects (NDSPs). These datasets are then exposed to the scientific community on an Open Access basis through user-friendly data browsing and search interfaces. Here, we describe the redesign of the Hub, version 3.0, to deploy “Web 2.0” technologies and add richer, more diverse content. The Molecule Pages, which aggregate information relevant to NR signaling pathways from myriad external databases, have been enhanced to include resources for basic scientists, such as post-translational modification sites and targeting miRNAs, and for clinicians, such as clinical trials. A portal to NURSA’s Open Access, PubMed-indexed journal Nuclear Receptor Signaling has been added to facilitate manuscript submissions. Datasets and information on reagents generated by NDSPs are available, as is information concerning periodic new NDSP funding solicitations. Finally, the new website integrates the Transcriptomine analysis tool, which allows for mining of millions of richly annotated public transcriptomic data points in the field, providing an environment for dataset re-use and citation, bench data validation and hypothesis generation. We anticipate that this new release of the NURSA database will have tangible, long term benefits for both basic and clinical research in this field. PMID:26325041

  12. How Next-Generation Sequencing and Multiscale Data Analysis Will Transform Infectious Disease Management

    PubMed Central

    Pak, Theodore R.; Kasarskis, Andrew

    2015-01-01

    Recent reviews have examined the extent to which routine next-generation sequencing (NGS) on clinical specimens will improve the capabilities of clinical microbiology laboratories in the short term, but do not explore integrating NGS with clinical data from electronic medical records (EMRs), immune profiling data, and other rich datasets to create multiscale predictive models. This review introduces a range of “omics” and patient data sources relevant to managing infections and proposes 3 potentially disruptive applications for these data in the clinical workflow. The combined threats of healthcare-associated infections and multidrug-resistant organisms may be addressed by multiscale analysis of NGS and EMR data that is ideally updated and refined over time within each healthcare organization. Such data and analysis should form the cornerstone of future learning health systems for infectious disease. PMID:26251049

  13. Predicting health-related quality of life (EQ-5D-5 L) and capability wellbeing (ICECAP-A) in the context of opiate dependence using routine clinical outcome measures: CORE-OM, LDQ and TOP.

    PubMed

    Peak, Jasmine; Goranitis, Ilias; Day, Ed; Copello, Alex; Freemantle, Nick; Frew, Emma

    2018-05-30

    Economic evaluation normally requires information to be collected on outcome improvement using utility values. This is often not collected during the treatment of substance use disorders making cost-effectiveness evaluations of therapy difficult. One potential solution is the use of mapping to generate utility values from clinical measures. This study develops and evaluates mapping algorithms that could be used to predict the EuroQol-5D (EQ-5D-5 L) and the ICEpop CAPability measure for Adults (ICECAP-A) from the three commonly used clinical measures; the CORE-OM, the LDQ and the TOP measures. Models were estimated using pilot trial data of heroin users in opiate substitution treatment. In the trial the EQ-5D-5 L, ICECAP-A, CORE-OM, LDQ and TOP were administered at baseline, three and twelve month time intervals. Mapping was conducted using estimation and validation datasets. The normal estimation dataset, which comprised of baseline sample data, used ordinary least squares (OLS) and tobit regression methods. Data from the baseline and three month time periods were combined to create a pooled estimation dataset. Cluster and mixed regression methods were used to map from this dataset. Predictive accuracy of the models was assessed using the root mean square error (RMSE) and the mean absolute error (MAE). Algorithms were validated using sample data from the follow-up time periods. Mapping algorithms can be used to predict the ICECAP-A and the EQ-5D-5 L in the context of opiate dependence. Although both measures can be predicted, the ICECAP-A was better predicted by the clinical measures. There were no advantages of pooling the data. There were 6 chosen mapping algorithms, which had MAE scores ranging from 0.100 to 0.138 and RMSE scores ranging from 0.134 to 0.178. It is possible to predict the scores of the ICECAP-A and the EQ-5D-5 L with the use of mapping. In the context of opiate dependence, these algorithms provide the possibility of generating utility values from clinical measures and thus enabling economic evaluation of alternative therapy options. ISRCTN22608399 . Date of registration: 27/04/2012. Date of first randomisation: 14/08/2012.

  14. Using Common Table Expressions to Build a Scalable Boolean Query Generator for Clinical Data Warehouses

    PubMed Central

    Harris, Daniel R.; Henderson, Darren W.; Kavuluru, Ramakanth; Stromberg, Arnold J.; Johnson, Todd R.

    2015-01-01

    We present a custom, Boolean query generator utilizing common-table expressions (CTEs) that is capable of scaling with big datasets. The generator maps user-defined Boolean queries, such as those interactively created in clinical-research and general-purpose healthcare tools, into SQL. We demonstrate the effectiveness of this generator by integrating our work into the Informatics for Integrating Biology and the Bedside (i2b2) query tool and show that it is capable of scaling. Our custom generator replaces and outperforms the default query generator found within the Clinical Research Chart (CRC) cell of i2b2. In our experiments, sixteen different types of i2b2 queries were identified by varying four constraints: date, frequency, exclusion criteria, and whether selected concepts occurred in the same encounter. We generated non-trivial, random Boolean queries based on these 16 types; the corresponding SQL queries produced by both generators were compared by execution times. The CTE-based solution significantly outperformed the default query generator and provided a much more consistent response time across all query types (M=2.03, SD=6.64 vs. M=75.82, SD=238.88 seconds). Without costly hardware upgrades, we provide a scalable solution based on CTEs with very promising empirical results centered on performance gains. The evaluation methodology used for this provides a means of profiling clinical data warehouse performance. PMID:25192572

  15. Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples

    PubMed Central

    White, James Robert; Nagarajan, Niranjan; Pop, Mihai

    2009-01-01

    Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them. We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/. PMID:19360128

  16. Big Data Transforms Discovery-Utilization Therapeutics Continuum.

    PubMed

    Waldman, S A; Terzic, A

    2016-03-01

    Enabling omic technologies adopt a holistic view to produce unprecedented insights into the molecular underpinnings of health and disease, in part, by generating massive high-dimensional biological data. Leveraging these systems-level insights as an engine driving the healthcare evolution is maximized through integration with medical, demographic, and environmental datasets from individuals to populations. Big data analytics has accordingly emerged to add value to the technical aspects of storage, transfer, and analysis required for merging vast arrays of omic-, clinical-, and eco-datasets. In turn, this new field at the interface of biology, medicine, and information science is systematically transforming modern therapeutics across discovery, development, regulation, and utilization. © 2015 ASCPT.

  17. P-MartCancer: A New Online Platform to Access CPTAC Datasets and Enable New Analyses | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The November 1, 2017 issue of Cancer Research is dedicated to a collection of computational resource papers in genomics, proteomics, animal models, imaging, and clinical subjects for non-bioinformaticists looking to incorporate computing tools into their work. Scientists at Pacific Northwest National Laboratory have developed P-MartCancer, an open, web-based interactive software tool that enables statistical analyses of peptide or protein data generated from mass-spectrometry (MS)-based global proteomics experiments.

  18. How next-generation sequencing and multiscale data analysis will transform infectious disease management.

    PubMed

    Pak, Theodore R; Kasarskis, Andrew

    2015-12-01

    Recent reviews have examined the extent to which routine next-generation sequencing (NGS) on clinical specimens will improve the capabilities of clinical microbiology laboratories in the short term, but do not explore integrating NGS with clinical data from electronic medical records (EMRs), immune profiling data, and other rich datasets to create multiscale predictive models. This review introduces a range of "omics" and patient data sources relevant to managing infections and proposes 3 potentially disruptive applications for these data in the clinical workflow. The combined threats of healthcare-associated infections and multidrug-resistant organisms may be addressed by multiscale analysis of NGS and EMR data that is ideally updated and refined over time within each healthcare organization. Such data and analysis should form the cornerstone of future learning health systems for infectious disease. © The Author 2015. Published by Oxford University Press on behalf of the Infectious Diseases Society of America.

  19. Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: Automatic construction of onychomycosis datasets by region-based convolutional deep neural network.

    PubMed

    Han, Seung Seog; Park, Gyeong Hun; Lim, Woohyung; Kim, Myoung Shin; Na, Jung Im; Park, Ilwoo; Chang, Sung Eun

    2018-01-01

    Although there have been reports of the successful diagnosis of skin disorders using deep learning, unrealistically large clinical image datasets are required for artificial intelligence (AI) training. We created datasets of standardized nail images using a region-based convolutional neural network (R-CNN) trained to distinguish the nail from the background. We used R-CNN to generate training datasets of 49,567 images, which we then used to fine-tune the ResNet-152 and VGG-19 models. The validation datasets comprised 100 and 194 images from Inje University (B1 and B2 datasets, respectively), 125 images from Hallym University (C dataset), and 939 images from Seoul National University (D dataset). The AI (ensemble model; ResNet-152 + VGG-19 + feedforward neural networks) results showed test sensitivity/specificity/ area under the curve values of (96.0 / 94.7 / 0.98), (82.7 / 96.7 / 0.95), (92.3 / 79.3 / 0.93), (87.7 / 69.3 / 0.82) for the B1, B2, C, and D datasets. With a combination of the B1 and C datasets, the AI Youden index was significantly (p = 0.01) higher than that of 42 dermatologists doing the same assessment manually. For B1+C and B2+ D dataset combinations, almost none of the dermatologists performed as well as the AI. By training with a dataset comprising 49,567 images, we achieved a diagnostic accuracy for onychomycosis using deep learning that was superior to that of most of the dermatologists who participated in this study.

  20. A photogrammetric technique for generation of an accurate multispectral optical flow dataset

    NASA Astrophysics Data System (ADS)

    Kniaz, V. V.

    2017-06-01

    A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.

  1. Automatic segmentation and co-registration of gated CT angiography datasets: measuring abdominal aortic pulsatility

    NASA Astrophysics Data System (ADS)

    Wentz, Robert; Manduca, Armando; Fletcher, J. G.; Siddiki, Hassan; Shields, Raymond C.; Vrtiska, Terri; Spencer, Garrett; Primak, Andrew N.; Zhang, Jie; Nielson, Theresa; McCollough, Cynthia; Yu, Lifeng

    2007-03-01

    Purpose: To develop robust, novel segmentation and co-registration software to analyze temporally overlapping CT angiography datasets, with an aim to permit automated measurement of regional aortic pulsatility in patients with abdominal aortic aneurysms. Methods: We perform retrospective gated CT angiography in patients with abdominal aortic aneurysms. Multiple, temporally overlapping, time-resolved CT angiography datasets are reconstructed over the cardiac cycle, with aortic segmentation performed using a priori anatomic assumptions for the aorta and heart. Visual quality assessment is performed following automatic segmentation with manual editing. Following subsequent centerline generation, centerlines are cross-registered across phases, with internal validation of co-registration performed by examining registration at the regions of greatest diameter change (i.e. when the second derivative is maximal). Results: We have performed gated CT angiography in 60 patients. Automatic seed placement is successful in 79% of datasets, requiring either no editing (70%) or minimal editing (less than 1 minute; 12%). Causes of error include segmentation into adjacent, high-attenuating, nonvascular tissues; small segmentation errors associated with calcified plaque; and segmentation of non-renal, small paralumbar arteries. Internal validation of cross-registration demonstrates appropriate registration in our patient population. In general, we observed that aortic pulsatility can vary along the course of the abdominal aorta. Pulsation can also vary within an aneurysm as well as between aneurysms, but the clinical significance of these findings remain unknown. Conclusions: Visualization of large vessel pulsatility is possible using ECG-gated CT angiography, partial scan reconstruction, automatic segmentation, centerline generation, and coregistration of temporally resolved datasets.

  2. Generation of open biomedical datasets through ontology-driven transformation and integration processes.

    PubMed

    Carmen Legaz-García, María Del; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2016-06-03

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources, which makes difficult the integrated exploitation of such data. The Semantic Web paradigm offers a natural technological space for data integration and exploitation by generating content readable by machines. Linked Open Data is a Semantic Web initiative that promotes the publication and sharing of data in machine readable semantic formats. We present an approach for the transformation and integration of heterogeneous biomedical data with the objective of generating open biomedical datasets in Semantic Web formats. The transformation of the data is based on the mappings between the entities of the data schema and the ontological infrastructure that provides the meaning to the content. Our approach permits different types of mappings and includes the possibility of defining complex transformation patterns. Once the mappings are defined, they can be automatically applied to datasets to generate logically consistent content and the mappings can be reused in further transformation processes. The results of our research are (1) a common transformation and integration process for heterogeneous biomedical data; (2) the application of Linked Open Data principles to generate interoperable, open, biomedical datasets; (3) a software tool, called SWIT, that implements the approach. In this paper we also describe how we have applied SWIT in different biomedical scenarios and some lessons learned. We have presented an approach that is able to generate open biomedical repositories in Semantic Web formats. SWIT is able to apply the Linked Open Data principles in the generation of the datasets, so allowing for linking their content to external repositories and creating linked open datasets. SWIT datasets may contain data from multiple sources and schemas, thus becoming integrated datasets.

  3. Systems-based biological concordance and predictive reproducibility of gene set discovery methods in cardiovascular disease.

    PubMed

    Azuaje, Francisco; Zheng, Huiru; Camargo, Anyela; Wang, Haiying

    2011-08-01

    The discovery of novel disease biomarkers is a crucial challenge for translational bioinformatics. Demonstration of both their classification power and reproducibility across independent datasets are essential requirements to assess their potential clinical relevance. Small datasets and multiplicity of putative biomarker sets may explain lack of predictive reproducibility. Studies based on pathway-driven discovery approaches have suggested that, despite such discrepancies, the resulting putative biomarkers tend to be implicated in common biological processes. Investigations of this problem have been mainly focused on datasets derived from cancer research. We investigated the predictive and functional concordance of five methods for discovering putative biomarkers in four independently-generated datasets from the cardiovascular disease domain. A diversity of biosignatures was identified by the different methods. However, we found strong biological process concordance between them, especially in the case of methods based on gene set analysis. With a few exceptions, we observed lack of classification reproducibility using independent datasets. Partial overlaps between our putative sets of biomarkers and the primary studies exist. Despite the observed limitations, pathway-driven or gene set analysis can predict potentially novel biomarkers and can jointly point to biomedically-relevant underlying molecular mechanisms. Copyright © 2011 Elsevier Inc. All rights reserved.

  4. Knowledge mining from clinical datasets using rough sets and backpropagation neural network.

    PubMed

    Nahato, Kindie Biredagn; Harichandran, Khanna Nehemiah; Arputharaj, Kannan

    2015-01-01

    The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN) is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI) machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.

  5. OpenSHS: Open Smart Home Simulator.

    PubMed

    Alshammari, Nasser; Alshammari, Talal; Sedky, Mohamed; Champion, Justin; Bauer, Carolin

    2017-05-02

    This paper develops a new hybrid, open-source, cross-platform 3D smart home simulator, OpenSHS, for dataset generation. OpenSHS offers an opportunity for researchers in the field of the Internet of Things (IoT) and machine learning to test and evaluate their models. Following a hybrid approach, OpenSHS combines advantages from both interactive and model-based approaches. This approach reduces the time and efforts required to generate simulated smart home datasets. We have designed a replication algorithm for extending and expanding a dataset. A small sample dataset produced, by OpenSHS, can be extended without affecting the logical order of the events. The replication provides a solution for generating large representative smart home datasets. We have built an extensible library of smart devices that facilitates the simulation of current and future smart home environments. Our tool divides the dataset generation process into three distinct phases: first design: the researcher designs the initial virtual environment by building the home, importing smart devices and creating contexts; second, simulation: the participant simulates his/her context-specific events; and third, aggregation: the researcher applies the replication algorithm to generate the final dataset. We conducted a study to assess the ease of use of our tool on the System Usability Scale (SUS).

  6. OpenSHS: Open Smart Home Simulator

    PubMed Central

    Alshammari, Nasser; Alshammari, Talal; Sedky, Mohamed; Champion, Justin; Bauer, Carolin

    2017-01-01

    This paper develops a new hybrid, open-source, cross-platform 3D smart home simulator, OpenSHS, for dataset generation. OpenSHS offers an opportunity for researchers in the field of the Internet of Things (IoT) and machine learning to test and evaluate their models. Following a hybrid approach, OpenSHS combines advantages from both interactive and model-based approaches. This approach reduces the time and efforts required to generate simulated smart home datasets. We have designed a replication algorithm for extending and expanding a dataset. A small sample dataset produced, by OpenSHS, can be extended without affecting the logical order of the events. The replication provides a solution for generating large representative smart home datasets. We have built an extensible library of smart devices that facilitates the simulation of current and future smart home environments. Our tool divides the dataset generation process into three distinct phases: first design: the researcher designs the initial virtual environment by building the home, importing smart devices and creating contexts; second, simulation: the participant simulates his/her context-specific events; and third, aggregation: the researcher applies the replication algorithm to generate the final dataset. We conducted a study to assess the ease of use of our tool on the System Usability Scale (SUS). PMID:28468330

  7. LINC00472 expression is regulated by promoter methylation and associated with disease-free survival in patients with grade 2 breast cancer

    PubMed Central

    Shen, Yi; Wang, Zhanwei; Loo, Lenora WM; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A.; Katsaros, Dionyssios; Yu, Herbert

    2015-01-01

    Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management. PMID:26564482

  8. LINC00472 expression is regulated by promoter methylation and associated with disease-free survival in patients with grade 2 breast cancer.

    PubMed

    Shen, Yi; Wang, Zhanwei; Loo, Lenora W M; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A; Katsaros, Dionyssios; Yu, Herbert

    2015-12-01

    Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management.

  9. ProDaMa: an open source Python library to generate protein structure datasets.

    PubMed

    Armano, Giuliano; Manconi, Andrea

    2009-10-02

    The huge difference between the number of known sequences and known tertiary structures has justified the use of automated methods for protein analysis. Although a general methodology to solve these problems has not been yet devised, researchers are engaged in developing more accurate techniques and algorithms whose training plays a relevant role in determining their performance. From this perspective, particular importance is given to the training data used in experiments, and researchers are often engaged in the generation of specialized datasets that meet their requirements. To facilitate the task of generating specialized datasets we devised and implemented ProDaMa, an open source Python library than provides classes for retrieving, organizing, updating, analyzing, and filtering protein data. ProDaMa has been used to generate specialized datasets useful for secondary structure prediction and to develop a collaborative web application aimed at generating and sharing protein structure datasets. The library, the related database, and the documentation are freely available at the URL http://iasc.diee.unica.it/prodama.

  10. Multivariate Protein Signatures of Pre-Clinical Alzheimer's Disease in the Alzheimer's Disease Neuroimaging Initiative (ADNI) Plasma Proteome Dataset

    PubMed Central

    Johnstone, Daniel; Milward, Elizabeth A.; Berretta, Regina; Moscato, Pablo

    2012-01-01

    Background Recent Alzheimer's disease (AD) research has focused on finding biomarkers to identify disease at the pre-clinical stage of mild cognitive impairment (MCI), allowing treatment to be initiated before irreversible damage occurs. Many studies have examined brain imaging or cerebrospinal fluid but there is also growing interest in blood biomarkers. The Alzheimer's Disease Neuroimaging Initiative (ADNI) has generated data on 190 plasma analytes in 566 individuals with MCI, AD or normal cognition. We conducted independent analyses of this dataset to identify plasma protein signatures predicting pre-clinical AD. Methods and Findings We focused on identifying signatures that discriminate cognitively normal controls (n = 54) from individuals with MCI who subsequently progress to AD (n = 163). Based on p value, apolipoprotein E (APOE) showed the strongest difference between these groups (p = 2.3×10−13). We applied a multivariate approach based on combinatorial optimization ((α,β)-k Feature Set Selection), which retains information about individual participants and maintains the context of interrelationships between different analytes, to identify the optimal set of analytes (signature) to discriminate these two groups. We identified 11-analyte signatures achieving values of sensitivity and specificity between 65% and 86% for both MCI and AD groups, depending on whether APOE was included and other factors. Classification accuracy was improved by considering “meta-features,” representing the difference in relative abundance of two analytes, with an 8-meta-feature signature consistently achieving sensitivity and specificity both over 85%. Generating signatures based on longitudinal rather than cross-sectional data further improved classification accuracy, returning sensitivities and specificities of approximately 90%. Conclusions Applying these novel analysis approaches to the powerful and well-characterized ADNI dataset has identified sets of plasma biomarkers for pre-clinical AD. While studies of independent test sets are required to validate the signatures, these analyses provide a starting point for developing a cost-effective and minimally invasive test capable of diagnosing AD in its pre-clinical stages. PMID:22485168

  11. Quantifying the tibiofemoral joint space using x-ray tomosynthesis.

    PubMed

    Kalinosky, Benjamin; Sabol, John M; Piacsek, Kelly; Heckel, Beth; Gilat Schmidt, Taly

    2011-12-01

    Digital x-ray tomosynthesis (DTS) has the potential to provide 3D information about the knee joint in a load-bearing posture, which may improve diagnosis and monitoring of knee osteoarthritis compared with projection radiography, the current standard of care. Manually quantifying and visualizing the joint space width (JSW) from 3D tomosynthesis datasets may be challenging. This work developed a semiautomated algorithm for quantifying the 3D tibiofemoral JSW from reconstructed DTS images. The algorithm was validated through anthropomorphic phantom experiments and applied to three clinical datasets. A user-selected volume of interest within the reconstructed DTS volume was enhanced with 1D multiscale gradient kernels. The edge-enhanced volumes were divided by polarity into tibial and femoral edge maps and combined across kernel scales. A 2D connected components algorithm was performed to determine candidate tibial and femoral edges. A 2D joint space width map (JSW) was constructed to represent the 3D tibiofemoral joint space. To quantify the algorithm accuracy, an adjustable knee phantom was constructed, and eleven posterior-anterior (PA) and lateral DTS scans were acquired with the medial minimum JSW of the phantom set to 0-5 mm in 0.5 mm increments (VolumeRad™, GE Healthcare, Chalfont St. Giles, United Kingdom). The accuracy of the algorithm was quantified by comparing the minimum JSW in a region of interest in the medial compartment of the JSW map to the measured phantom setting for each trial. In addition, the algorithm was applied to DTS scans of a static knee phantom and the JSW map compared to values estimated from a manually segmented computed tomography (CT) dataset. The algorithm was also applied to three clinical DTS datasets of osteoarthritic patients. The algorithm segmented the JSW and generated a JSW map for all phantom and clinical datasets. For the adjustable phantom, the estimated minimum JSW values were plotted against the measured values for all trials. A linear fit estimated a slope of 0.887 (R² = 0.962) and a mean error across all trials of 0.34 mm for the PA phantom data. The estimated minimum JSW values for the lateral adjustable phantom acquisitions were found to have low correlation to the measured values (R² = 0.377), with a mean error of 2.13 mm. The error in the lateral adjustable-phantom datasets appeared to be caused by artifacts due to unrealistic features in the phantom bones. JSW maps generated by DTS and CT varied by a mean of 0.6 mm and 0.8 mm across the knee joint, for PA and lateral scans. The tibial and femoral edges were successfully segmented and JSW maps determined for PA and lateral clinical DTS datasets. A semiautomated method is presented for quantifying the 3D joint space in a 2D JSW map using tomosynthesis images. The proposed algorithm quantified the JSW across the knee joint to sub-millimeter accuracy for PA tomosynthesis acquisitions. Overall, the results suggest that x-ray tomosynthesis may be beneficial for diagnosing and monitoring disease progression or treatment of osteoarthritis by providing quantitative images of JSW in the load-bearing knee.

  12. Logic Learning Machine creates explicit and stable rules stratifying neuroblastoma patients

    PubMed Central

    2013-01-01

    Background Neuroblastoma is the most common pediatric solid tumor. About fifty percent of high risk patients die despite treatment making the exploration of new and more effective strategies for improving stratification mandatory. Hypoxia is a condition of low oxygen tension occurring in poorly vascularized areas of the tumor associated with poor prognosis. We had previously defined a robust gene expression signature measuring the hypoxic component of neuroblastoma tumors (NB-hypo) which is a molecular risk factor. We wanted to develop a prognostic classifier of neuroblastoma patients' outcome blending existing knowledge on clinical and molecular risk factors with the prognostic NB-hypo signature. Furthermore, we were interested in classifiers outputting explicit rules that could be easily translated into the clinical setting. Results Shadow Clustering (SC) technique, which leads to final models called Logic Learning Machine (LLM), exhibits a good accuracy and promises to fulfill the aims of the work. We utilized this algorithm to classify NB-patients on the bases of the following risk factors: Age at diagnosis, INSS stage, MYCN amplification and NB-hypo. The algorithm generated explicit classification rules in good agreement with existing clinical knowledge. Through an iterative procedure we identified and removed from the dataset those examples which caused instability in the rules. This workflow generated a stable classifier very accurate in predicting good and poor outcome patients. The good performance of the classifier was validated in an independent dataset. NB-hypo was an important component of the rules with a strength similar to that of tumor staging. Conclusions The novelty of our work is to identify stability, explicit rules and blending of molecular and clinical risk factors as the key features to generate classification rules for NB patients to be conveyed to the clinic and to be used to design new therapies. We derived, through LLM, a set of four stable rules identifying a new class of poor outcome patients that could benefit from new therapies potentially targeting tumor hypoxia or its consequences. PMID:23815266

  13. Logic Learning Machine creates explicit and stable rules stratifying neuroblastoma patients.

    PubMed

    Cangelosi, Davide; Blengio, Fabiola; Versteeg, Rogier; Eggert, Angelika; Garaventa, Alberto; Gambini, Claudio; Conte, Massimo; Eva, Alessandra; Muselli, Marco; Varesio, Luigi

    2013-01-01

    Neuroblastoma is the most common pediatric solid tumor. About fifty percent of high risk patients die despite treatment making the exploration of new and more effective strategies for improving stratification mandatory. Hypoxia is a condition of low oxygen tension occurring in poorly vascularized areas of the tumor associated with poor prognosis. We had previously defined a robust gene expression signature measuring the hypoxic component of neuroblastoma tumors (NB-hypo) which is a molecular risk factor. We wanted to develop a prognostic classifier of neuroblastoma patients' outcome blending existing knowledge on clinical and molecular risk factors with the prognostic NB-hypo signature. Furthermore, we were interested in classifiers outputting explicit rules that could be easily translated into the clinical setting. Shadow Clustering (SC) technique, which leads to final models called Logic Learning Machine (LLM), exhibits a good accuracy and promises to fulfill the aims of the work. We utilized this algorithm to classify NB-patients on the bases of the following risk factors: Age at diagnosis, INSS stage, MYCN amplification and NB-hypo. The algorithm generated explicit classification rules in good agreement with existing clinical knowledge. Through an iterative procedure we identified and removed from the dataset those examples which caused instability in the rules. This workflow generated a stable classifier very accurate in predicting good and poor outcome patients. The good performance of the classifier was validated in an independent dataset. NB-hypo was an important component of the rules with a strength similar to that of tumor staging. The novelty of our work is to identify stability, explicit rules and blending of molecular and clinical risk factors as the key features to generate classification rules for NB patients to be conveyed to the clinic and to be used to design new therapies. We derived, through LLM, a set of four stable rules identifying a new class of poor outcome patients that could benefit from new therapies potentially targeting tumor hypoxia or its consequences.

  14. Aneurysmal subarachnoid hemorrhage prognostic decision-making algorithm using classification and regression tree analysis.

    PubMed

    Lo, Benjamin W Y; Fukuda, Hitoshi; Angle, Mark; Teitelbaum, Jeanne; Macdonald, R Loch; Farrokhyar, Forough; Thabane, Lehana; Levine, Mitchell A H

    2016-01-01

    Classification and regression tree analysis involves the creation of a decision tree by recursive partitioning of a dataset into more homogeneous subgroups. Thus far, there is scarce literature on using this technique to create clinical prediction tools for aneurysmal subarachnoid hemorrhage (SAH). The classification and regression tree analysis technique was applied to the multicenter Tirilazad database (3551 patients) in order to create the decision-making algorithm. In order to elucidate prognostic subgroups in aneurysmal SAH, neurologic, systemic, and demographic factors were taken into account. The dependent variable used for analysis was the dichotomized Glasgow Outcome Score at 3 months. Classification and regression tree analysis revealed seven prognostic subgroups. Neurological grade, occurrence of post-admission stroke, occurrence of post-admission fever, and age represented the explanatory nodes of this decision tree. Split sample validation revealed classification accuracy of 79% for the training dataset and 77% for the testing dataset. In addition, the occurrence of fever at 1-week post-aneurysmal SAH is associated with increased odds of post-admission stroke (odds ratio: 1.83, 95% confidence interval: 1.56-2.45, P < 0.01). A clinically useful classification tree was generated, which serves as a prediction tool to guide bedside prognostication and clinical treatment decision making. This prognostic decision-making algorithm also shed light on the complex interactions between a number of risk factors in determining outcome after aneurysmal SAH.

  15. Developing a minimum dataset for nursing team leader handover in the intensive care unit: A focus group study.

    PubMed

    Spooner, Amy J; Aitken, Leanne M; Corley, Amanda; Chaboyer, Wendy

    2018-01-01

    Despite increasing demand for structured processes to guide clinical handover, nursing handover tools are limited in the intensive care unit. The study aim was to identify key items to include in a minimum dataset for intensive care nursing team leader shift-to-shift handover. This focus group study was conducted in a 21-bed medical/surgical intensive care unit in Australia. Senior registered nurses involved in team leader handovers were recruited. Focus groups were conducted using a nominal group technique to generate and prioritise minimum dataset items. Nurses were presented with content from previous team leader handovers and asked to select which content items to include in a minimum dataset. Participant responses were summarised as frequencies and percentages. Seventeen senior nurses participated in three focus groups. Participants agreed that ISBAR (Identify-Situation-Background-Assessment-Recommendations) was a useful tool to guide clinical handover. Items recommended to be included in the minimum dataset (≥65% agreement) included Identify (name, age, days in intensive care), Situation (diagnosis, surgical procedure), Background (significant event(s), management of significant event(s)) and Recommendations (patient plan for next shift, tasks to follow up for next shift). Overall, 30 of the 67 (45%) items in the Assessment category were considered important to include in the minimum dataset and focused on relevant observations and treatment within each body system. Other non-ISBAR items considered important to include related to the ICU (admissions to ICU, staffing/skill mix, theatre cases) and patients (infectious status, site of infection, end of life plan). Items were further categorised into those to include in all handovers and those to discuss only when relevant to the patient. The findings suggest a minimum dataset for intensive care nursing team leader shift-to-shift handover should contain items within ISBAR along with unit and patient specific information to maintain continuity of care and patient safety across shift changes. Copyright © 2017 Australian College of Critical Care Nurses Ltd. All rights reserved.

  16. Active learning for clinical text classification: is it better than random sampling?

    PubMed

    Figueroa, Rosa L; Zeng-Treitler, Qing; Ngo, Long H; Goryachev, Sergey; Wiechmann, Eduardo P

    2012-01-01

    This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks. Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results. Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences. The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm. For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.

  17. Active learning for clinical text classification: is it better than random sampling?

    PubMed Central

    Figueroa, Rosa L; Ngo, Long H; Goryachev, Sergey; Wiechmann, Eduardo P

    2012-01-01

    Objective This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks. Design Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results. Measurements Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences. Results The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm. Conclusion For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty. PMID:22707743

  18. SAFE: SPARQL Federation over RDF Data Cubes with Access Control.

    PubMed

    Khan, Yasar; Saleem, Muhammad; Mehdi, Muntazir; Hogan, Aidan; Mehmood, Qaiser; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh

    2017-02-01

    Several query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data. We present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE's indexing system does not hold any data instances-it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change. We validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.

  19. Clinical Utility of Risk Models to Refer Patients with Adnexal Masses to Specialized Oncology Care: Multicenter External Validation Using Decision Curve Analysis.

    PubMed

    Wynants, Laure; Timmerman, Dirk; Verbakel, Jan Y; Testa, Antonia; Savelli, Luca; Fischerova, Daniela; Franchi, Dorella; Van Holsbeke, Caroline; Epstein, Elisabeth; Froyman, Wouter; Guerriero, Stefano; Rossi, Alberto; Fruscio, Robert; Leone, Francesco Pg; Bourne, Tom; Valentin, Lil; Van Calster, Ben

    2017-09-01

    Purpose: To evaluate the utility of preoperative diagnostic models for ovarian cancer based on ultrasound and/or biomarkers for referring patients to specialized oncology care. The investigated models were RMI, ROMA, and 3 models from the International Ovarian Tumor Analysis (IOTA) group [LR2, ADNEX, and the Simple Rules risk score (SRRisk)]. Experimental Design: A secondary analysis of prospectively collected data from 2 cross-sectional cohort studies was performed to externally validate diagnostic models. A total of 2,763 patients (2,403 in dataset 1 and 360 in dataset 2) from 18 centers (11 oncology centers and 7 nononcology hospitals) in 6 countries participated. Excised tissue was histologically classified as benign or malignant. The clinical utility of the preoperative diagnostic models was assessed with net benefit (NB) at a range of risk thresholds (5%-50% risk of malignancy) to refer patients to specialized oncology care. We visualized results with decision curves and generated bootstrap confidence intervals. Results: The prevalence of malignancy was 41% in dataset 1 and 40% in dataset 2. For thresholds up to 10% to 15%, RMI and ROMA had a lower NB than referring all patients. SRRisks and ADNEX demonstrated the highest NB. At a threshold of 20%, the NBs of ADNEX, SRrisks, and RMI were 0.348, 0.350, and 0.270, respectively. Results by menopausal status and type of center (oncology vs. nononcology) were similar. Conclusions: All tested IOTA methods, especially ADNEX and SRRisks, are clinically more useful than RMI and ROMA to select patients with adnexal masses for specialized oncology care. Clin Cancer Res; 23(17); 5082-90. ©2017 AACR . ©2017 American Association for Cancer Research.

  20. Gaussian diffusion sinogram inpainting for X-ray CT metal artifact reduction.

    PubMed

    Peng, Chengtao; Qiu, Bensheng; Li, Ming; Guan, Yihui; Zhang, Cheng; Wu, Zhongyi; Zheng, Jian

    2017-01-05

    Metal objects implanted in the bodies of patients usually generate severe streaking artifacts in reconstructed images of X-ray computed tomography, which degrade the image quality and affect the diagnosis of disease. Therefore, it is essential to reduce these artifacts to meet the clinical demands. In this work, we propose a Gaussian diffusion sinogram inpainting metal artifact reduction algorithm based on prior images to reduce these artifacts for fan-beam computed tomography reconstruction. In this algorithm, prior information that originated from a tissue-classified prior image is used for the inpainting of metal-corrupted projections, and it is incorporated into a Gaussian diffusion function. The prior knowledge is particularly designed to locate the diffusion position and improve the sparsity of the subtraction sinogram, which is obtained by subtracting the prior sinogram of the metal regions from the original sinogram. The sinogram inpainting algorithm is implemented through an approach of diffusing prior energy and is then solved by gradient descent. The performance of the proposed metal artifact reduction algorithm is compared with two conventional metal artifact reduction algorithms, namely the interpolation metal artifact reduction algorithm and normalized metal artifact reduction algorithm. The experimental datasets used included both simulated and clinical datasets. By evaluating the results subjectively, the proposed metal artifact reduction algorithm causes fewer secondary artifacts than the two conventional metal artifact reduction algorithms, which lead to severe secondary artifacts resulting from impertinent interpolation and normalization. Additionally, the objective evaluation shows the proposed approach has the smallest normalized mean absolute deviation and the highest signal-to-noise ratio, indicating that the proposed method has produced the image with the best quality. No matter for the simulated datasets or the clinical datasets, the proposed algorithm has reduced the metal artifacts apparently.

  1. Robust optimization methods for cardiac sparing in tangential breast IMRT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mahmoudzadeh, Houra, E-mail: houra@mie.utoronto.ca; Lee, Jenny; Chan, Timothy C. Y.

    Purpose: In left-sided tangential breast intensity modulated radiation therapy (IMRT), the heart may enter the radiation field and receive excessive radiation while the patient is breathing. The patient’s breathing pattern is often irregular and unpredictable. We verify the clinical applicability of a heart-sparing robust optimization approach for breast IMRT. We compare robust optimized plans with clinical plans at free-breathing and clinical plans at deep inspiration breath-hold (DIBH) using active breathing control (ABC). Methods: Eight patients were included in the study with each patient simulated using 4D-CT. The 4D-CT image acquisition generated ten breathing phase datasets. An average scan was constructedmore » using all the phase datasets. Two of the eight patients were also imaged at breath-hold using ABC. The 4D-CT datasets were used to calculate the accumulated dose for robust optimized and clinical plans based on deformable registration. We generated a set of simulated breathing probability mass functions, which represent the fraction of time patients spend in different breathing phases. The robust optimization method was applied to each patient using a set of dose-influence matrices extracted from the 4D-CT data and a model of the breathing motion uncertainty. The goal of the optimization models was to minimize the dose to the heart while ensuring dose constraints on the target were achieved under breathing motion uncertainty. Results: Robust optimized plans were improved or equivalent to the clinical plans in terms of heart sparing for all patients studied. The robust method reduced the accumulated heart dose (D10cc) by up to 801 cGy compared to the clinical method while also improving the coverage of the accumulated whole breast target volume. On average, the robust method reduced the heart dose (D10cc) by 364 cGy and improved the optBreast dose (D99%) by 477 cGy. In addition, the robust method had smaller deviations from the planned dose to the accumulated dose. The deviation of the accumulated dose from the planned dose for the optBreast (D99%) was 12 cGy for robust versus 445 cGy for clinical. The deviation for the heart (D10cc) was 41 cGy for robust and 320 cGy for clinical. Conclusions: The robust optimization approach can reduce heart dose compared to the clinical method at free-breathing and can potentially reduce the need for breath-hold techniques.« less

  2. Building a national Infection Intelligence Platform to improve antimicrobial stewardship and drive better patient outcomes: the Scottish experience.

    PubMed

    Bennie, Marion; Malcolm, William; Marwick, Charis A; Kavanagh, Kimberley; Sneddon, Jean; Nathwani, Dilip

    2017-10-01

    The better use of new and emerging data streams to understand the epidemiology of infectious disease and to inform and evaluate antimicrobial stewardship improvement programmes is paramount in the global fight against antimicrobial resistance. To create a national informatics platform that synergizes the wealth of disjointed, infection-related health data, building an intelligence capability that allows rapid enquiry, generation of new knowledge and feedback to clinicians and policy makers. A multi-stakeholder community, led by the Scottish Antimicrobial Prescribing Group, secured government funding to deliver a national programme of work centred on three key aspects: (i) technical platform development with record linkage capability across multiple datasets; (ii) a proportionate governance approach to enhance responsiveness; and (iii) generation of new evidence to guide clinical practice. The National Health Service Scotland Infection Intelligence Platform (IIP) is now hosted within the national health data repository to assure resilience and sustainability. New technical solutions include simplified 'data views' of complex, linked datasets and embedded statistical programs to enhance capability. These developments have enabled responsiveness, flexibility and robustness in conducting population-based studies including a focus on intended and unintended effects of antimicrobial stewardship interventions and quantification of infection risk factors and clinical outcomes. We have completed the build and test phase of IIP, overcoming the technical and governance challenges, and produced new capability in infection informatics, generating new evidence for improved clinical practice. This provides a foundation for expansion and opportunity for global collaborations. © The Author 2017. Published by Oxford University Press on behalf of the British Society for Antimicrobial Chemotherapy. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  3. Applications of the LBA-ECO Metadata Warehouse

    NASA Astrophysics Data System (ADS)

    Wilcox, L.; Morrell, A.; Griffith, P. C.

    2006-05-01

    The LBA-ECO Project Office has developed a system to harvest and warehouse metadata resulting from the Large-Scale Biosphere Atmosphere Experiment in Amazonia. The harvested metadata is used to create dynamically generated reports, available at www.lbaeco.org, which facilitate access to LBA-ECO datasets. The reports are generated for specific controlled vocabulary terms (such as an investigation team or a geospatial region), and are cross-linked with one another via these terms. This approach creates a rich contextual framework enabling researchers to find datasets relevant to their research. It maximizes data discovery by association and provides a greater understanding of the scientific and social context of each dataset. For example, our website provides a profile (e.g. participants, abstract(s), study sites, and publications) for each LBA-ECO investigation. Linked from each profile is a list of associated registered dataset titles, each of which link to a dataset profile that describes the metadata in a user-friendly way. The dataset profiles are generated from the harvested metadata, and are cross-linked with associated reports via controlled vocabulary terms such as geospatial region. The region name appears on the dataset profile as a hyperlinked term. When researchers click on this link, they find a list of reports relevant to that region, including a list of dataset titles associated with that region. Each dataset title in this list is hyperlinked to its corresponding dataset profile. Moreover, each dataset profile contains hyperlinks to each associated data file at its home data repository and to publications that have used the dataset. We also use the harvested metadata in administrative applications to assist quality assurance efforts. These include processes to check for broken hyperlinks to data files, automated emails that inform our administrators when critical metadata fields are updated, dynamically generated reports of metadata records that link to datasets with questionable file formats, and dynamically generated region/site coordinate quality assurance reports. These applications are as important as those that facilitate access to information because they help ensure a high standard of quality for the information. This presentation will discuss reports currently in use, provide a technical overview of the system, and discuss plans to extend this system to harvest metadata resulting from the North American Carbon Program by drawing on datasets in many different formats, residing in many thematic data centers and also distributed among hundreds of investigators.

  4. Using classification models for the generation of disease-specific medications from biomedical literature and clinical data repository.

    PubMed

    Wang, Liqin; Haug, Peter J; Del Fiol, Guilherme

    2017-05-01

    Mining disease-specific associations from existing knowledge resources can be useful for building disease-specific ontologies and supporting knowledge-based applications. Many association mining techniques have been exploited. However, the challenge remains when those extracted associations contained much noise. It is unreliable to determine the relevance of the association by simply setting up arbitrary cut-off points on multiple scores of relevance; and it would be expensive to ask human experts to manually review a large number of associations. We propose that machine-learning-based classification can be used to separate the signal from the noise, and to provide a feasible approach to create and maintain disease-specific vocabularies. We initially focused on disease-medication associations for the purpose of simplicity. For a disease of interest, we extracted potentially treatment-related drug concepts from biomedical literature citations and from a local clinical data repository. Each concept was associated with multiple measures of relevance (i.e., features) such as frequency of occurrence. For the machine purpose of learning, we formed nine datasets for three diseases with each disease having two single-source datasets and one from the combination of previous two datasets. All the datasets were labeled using existing reference standards. Thereafter, we conducted two experiments: (1) to test if adding features from the clinical data repository would improve the performance of classification achieved using features from the biomedical literature only, and (2) to determine if classifier(s) trained with known medication-disease data sets would be generalizable to new disease(s). Simple logistic regression and LogitBoost were two classifiers identified as the preferred models separately for the biomedical-literature datasets and combined datasets. The performance of the classification using combined features provided significant improvement beyond that using biomedical-literature features alone (p-value<0.001). The performance of the classifier built from known diseases to predict associated concepts for new diseases showed no significant difference from the performance of the classifier built and tested using the new disease's dataset. It is feasible to use classification approaches to automatically predict the relevance of a concept to a disease of interest. It is useful to combine features from disparate sources for the task of classification. Classifiers built from known diseases were generalizable to new diseases. Copyright © 2017 Elsevier Inc. All rights reserved.

  5. Generation of openEHR Test Datasets for Benchmarking.

    PubMed

    El Helou, Samar; Karvonen, Tuukka; Yamamoto, Goshiro; Kume, Naoto; Kobayashi, Shinji; Kondo, Eiji; Hiragi, Shusuke; Okamoto, Kazuya; Tamura, Hiroshi; Kuroda, Tomohiro

    2017-01-01

    openEHR is a widely used EHR specification. Given its technology-independent nature, different approaches for implementing openEHR data repositories exist. Public openEHR datasets are needed to conduct benchmark analyses over different implementations. To address their current unavailability, we propose a method for generating openEHR test datasets that can be publicly shared and used.

  6. Molecular Subtypes of Glioblastoma Are Relevant to Lower Grade Glioma

    PubMed Central

    Sloan, Andrew E.; Chen, Yanwen; Brat, Daniel J.; O’Neill, Brian Patrick; de Groot, John; Yust-Katz, Shlomit; Yung, Wai-Kwan Alfred; Cohen, Mark L.; Aldape, Kenneth D.; Rosenfeld, Steven; Verhaak, Roeland G. W.; Barnholtz-Sloan, Jill S.

    2014-01-01

    Background Gliomas are the most common primary malignant brain tumors in adults with great heterogeneity in histopathology and clinical course. The intent was to evaluate the relevance of known glioblastoma (GBM) expression and methylation based subtypes to grade II and III gliomas (ie. lower grade gliomas). Methods Gene expression array, single nucleotide polymorphism (SNP) array and clinical data were obtained for 228 GBMs and 176 grade II/II gliomas (GII/III) from the publically available Rembrandt dataset. Two additional datasets with IDH1 mutation status were utilized as validation datasets (one publicly available dataset and one newly generated dataset from MD Anderson). Unsupervised clustering was performed and compared to gene expression subtypes assigned using the Verhaak et al 840-gene classifier. The glioma-CpG Island Methylator Phenotype (G-CIMP) was assigned using prediction models by Fine et al. Results Unsupervised clustering by gene expression aligned with the Verhaak 840-gene subtype group assignments. GII/IIIs were preferentially assigned to the proneural subtype with IDH1 mutation and G-CIMP. GBMs were evenly distributed among the four subtypes. Proneural, IDH1 mutant, G-CIMP GII/III s had significantly better survival than other molecular subtypes. Only 6% of GBMs were proneural and had either IDH1 mutation or G-CIMP but these tumors had significantly better survival than other GBMs. Copy number changes in chromosomes 1p and 19q were associated with GII/IIIs, while these changes in CDKN2A, PTEN and EGFR were more commonly associated with GBMs. Conclusions GBM gene-expression and methylation based subtypes are relevant for GII/III s and associate with overall survival differences. A better understanding of the association between these subtypes and GII/IIIs could further knowledge regarding prognosis and mechanisms of glioma progression. PMID:24614622

  7. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery.

    PubMed

    Taft, L M; Evans, R S; Shyu, C R; Egger, M J; Chawla, N; Mitchell, J A; Thornton, S N; Bray, B; Varner, M

    2009-04-01

    The IOM report, Preventing Medication Errors, emphasizes the overall lack of knowledge of the incidence of adverse drug events (ADE). Operating rooms, emergency departments and intensive care units are known to have a higher incidence of ADE. Labor and delivery (L&D) is an emergency care unit that could have an increased risk of ADE, where reported rates remain low and under-reporting is suspected. Risk factor identification with electronic pattern recognition techniques could improve ADE detection rates. The objective of the present study is to apply Synthetic Minority Over Sampling Technique (SMOTE) as an enhanced sampling method in a sparse dataset to generate prediction models to identify ADE in women admitted for labor and delivery based on patient risk factors and comorbidities. By creating synthetic cases with the SMOTE algorithm and using a 10-fold cross-validation technique, we demonstrated improved performance of the Naïve Bayes and the decision tree algorithms. The true positive rate (TPR) of 0.32 in the raw dataset increased to 0.67 in the 800% over-sampled dataset. Enhanced performance from classification algorithms can be attained with the use of synthetic minority class oversampling techniques in sparse clinical datasets. Predictive models created in this manner can be used to develop evidence based ADE monitoring systems.

  8. Combining users' activity survey and simulators to evaluate human activity recognition systems.

    PubMed

    Azkune, Gorka; Almeida, Aitor; López-de-Ipiña, Diego; Chen, Liming

    2015-04-08

    Evaluating human activity recognition systems usually implies following expensive and time-consuming methodologies, where experiments with humans are run with the consequent ethical and legal issues. We propose a novel evaluation methodology to overcome the enumerated problems, which is based on surveys for users and a synthetic dataset generator tool. Surveys allow capturing how different users perform activities of daily living, while the synthetic dataset generator is used to create properly labelled activity datasets modelled with the information extracted from surveys. Important aspects, such as sensor noise, varying time lapses and user erratic behaviour, can also be simulated using the tool. The proposed methodology is shown to have very important advantages that allow researchers to carry out their work more efficiently. To evaluate the approach, a synthetic dataset generated following the proposed methodology is compared to a real dataset computing the similarity between sensor occurrence frequencies. It is concluded that the similarity between both datasets is more than significant.

  9. Generation of a suite of 3D computer-generated breast phantoms from a limited set of human subject data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hsu, Christina M. L.; Palmeri, Mark L.; Department of Anesthesiology, Duke University Medical Center, Durham, North Carolina 27710

    2013-04-15

    Purpose: The authors previously reported on a three-dimensional computer-generated breast phantom, based on empirical human image data, including a realistic finite-element based compression model that was capable of simulating multimodality imaging data. The computerized breast phantoms are a hybrid of two phantom generation techniques, combining empirical breast CT (bCT) data with flexible computer graphics techniques. However, to date, these phantoms have been based on single human subjects. In this paper, the authors report on a new method to generate multiple phantoms, simulating additional subjects from the limited set of original dedicated breast CT data. The authors developed an image morphingmore » technique to construct new phantoms by gradually transitioning between two human subject datasets, with the potential to generate hundreds of additional pseudoindependent phantoms from the limited bCT cases. The authors conducted a preliminary subjective assessment with a limited number of observers (n= 4) to illustrate how realistic the simulated images generated with the pseudoindependent phantoms appeared. Methods: Several mesh-based geometric transformations were developed to generate distorted breast datasets from the original human subject data. Segmented bCT data from two different human subjects were used as the 'base' and 'target' for morphing. Several combinations of transformations were applied to morph between the 'base' and 'target' datasets such as changing the breast shape, rotating the glandular data, and changing the distribution of the glandular tissue. Following the morphing, regions of skin and fat were assigned to the morphed dataset in order to appropriately assign mechanical properties during the compression simulation. The resulting morphed breast was compressed using a finite element algorithm and simulated mammograms were generated using techniques described previously. Sixty-two simulated mammograms, generated from morphing three human subject datasets, were used in a preliminary observer evaluation where four board certified breast radiologists with varying amounts of experience ranked the level of realism (from 1 ='fake' to 10 ='real') of the simulated images. Results: The morphing technique was able to successfully generate new and unique morphed datasets from the original human subject data. The radiologists evaluated the realism of simulated mammograms generated from the morphed and unmorphed human subject datasets and scored the realism with an average ranking of 5.87 {+-} 1.99, confirming that overall the phantom image datasets appeared more 'real' than 'fake.' Moreover, there was not a significant difference (p > 0.1) between the realism of the unmorphed datasets (6.0 {+-} 1.95) compared to the morphed datasets (5.86 {+-} 1.99). Three of the four observers had overall average rankings of 6.89 {+-} 0.89, 6.9 {+-} 1.24, 6.76 {+-} 1.22, whereas the fourth observer ranked them noticeably lower at 2.94 {+-} 0.7. Conclusions: This work presents a technique that can be used to generate a suite of realistic computerized breast phantoms from a limited number of human subjects. This suite of flexible breast phantoms can be used for multimodality imaging research to provide a known truth while concurrently producing realistic simulated imaging data.« less

  10. TH-AB-207A-05: A Fully-Automated Pipeline for Generating CT Images Across a Range of Doses and Reconstruction Methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Young, S; Lo, P; Hoffman, J

    Purpose: To evaluate the robustness of CAD or Quantitative Imaging methods, they should be tested on a variety of cases and under a variety of image acquisition and reconstruction conditions that represent the heterogeneity encountered in clinical practice. The purpose of this work was to develop a fully-automated pipeline for generating CT images that represent a wide range of dose and reconstruction conditions. Methods: The pipeline consists of three main modules: reduced-dose simulation, image reconstruction, and quantitative analysis. The first two modules of the pipeline can be operated in a completely automated fashion, using configuration files and running the modulesmore » in a batch queue. The input to the pipeline is raw projection CT data; this data is used to simulate different levels of dose reduction using a previously-published algorithm. Filtered-backprojection reconstructions are then performed using FreeCT-wFBP, a freely-available reconstruction software for helical CT. We also added support for an in-house, model-based iterative reconstruction algorithm using iterative coordinate-descent optimization, which may be run in tandem with the more conventional recon methods. The reduced-dose simulations and image reconstructions are controlled automatically by a single script, and they can be run in parallel on our research cluster. The pipeline was tested on phantom and lung screening datasets from a clinical scanner (Definition AS, Siemens Healthcare). Results: The images generated from our test datasets appeared to represent a realistic range of acquisition and reconstruction conditions that we would expect to find clinically. The time to generate images was approximately 30 minutes per dose/reconstruction combination on a hybrid CPU/GPU architecture. Conclusion: The automated research pipeline promises to be a useful tool for either training or evaluating performance of quantitative imaging software such as classifiers and CAD algorithms across the range of acquisition and reconstruction parameters present in the clinical environment. Funding support: NIH U01 CA181156; Disclosures (McNitt-Gray): Institutional research agreement, Siemens Healthcare; Past recipient, research grant support, Siemens Healthcare; Consultant, Toshiba America Medical Systems; Consultant, Samsung Electronics.« less

  11. Context-Aware Generative Adversarial Privacy

    NASA Astrophysics Data System (ADS)

    Huang, Chong; Kairouz, Peter; Chen, Xiao; Sankar, Lalitha; Rajagopal, Ram

    2017-12-01

    Preserving the utility of published datasets while simultaneously providing provable privacy guarantees is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other hand, context-aware privacy solutions, such as information theoretic privacy, achieve an improved privacy-utility tradeoff, but assume that the data holder has access to dataset statistics. We circumvent these limitations by introducing a novel context-aware privacy framework called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to allow the data holder to learn privatization schemes from the dataset itself. Under GAP, learning the privacy mechanism is formulated as a constrained minimax game between two players: a privatizer that sanitizes the dataset in a way that limits the risk of inference attacks on the individuals' private variables, and an adversary that tries to infer the private variables from the sanitized dataset. To evaluate GAP's performance, we investigate two simple (yet canonical) statistical dataset models: (a) the binary data model, and (b) the binary Gaussian mixture model. For both models, we derive game-theoretically optimal minimax privacy mechanisms, and show that the privacy mechanisms learned from data (in a generative adversarial fashion) match the theoretically optimal ones. This demonstrates that our framework can be easily applied in practice, even in the absence of dataset statistics.

  12. Prediction of clinical behaviour and treatment for cancers.

    PubMed

    Futschik, Matthias E; Sullivan, Mike; Reeve, Anthony; Kasabov, Nikola

    2003-01-01

    Prediction of clinical behaviour and treatment for cancers is based on the integration of clinical and pathological parameters. Recent reports have demonstrated that gene expression profiling provides a powerful new approach for determining disease outcome. If clinical and microarray data each contain independent information then it should be possible to combine these datasets to gain more accurate prognostic information. Here, we have used existing clinical information and microarray data to generate a combined prognostic model for outcome prediction for diffuse large B-cell lymphoma (DLBCL). A prediction accuracy of 87.5% was achieved. This constitutes a significant improvement compared to the previously most accurate prognostic model with an accuracy of 77.6%. The model introduced here may be generally applicable to the combination of various types of molecular and clinical data for improving medical decision support systems and individualising patient care.

  13. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    PubMed Central

    Jensen, Tue V.; Pinson, Pierre

    2017-01-01

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation. PMID:29182600

  14. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system.

    PubMed

    Jensen, Tue V; Pinson, Pierre

    2017-11-28

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.

  15. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    NASA Astrophysics Data System (ADS)

    Jensen, Tue V.; Pinson, Pierre

    2017-11-01

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.

  16. Integrating multiple immunogenetic data sources for feature extraction and mining somatic hypermutation patterns: the case of "towards analysis" in chronic lymphocytic leukaemia.

    PubMed

    Kavakiotis, Ioannis; Xochelli, Aliki; Agathangelidis, Andreas; Tsoumakas, Grigorios; Maglaveras, Nicos; Stamatopoulos, Kostas; Hadzidimitriou, Anastasia; Vlahavas, Ioannis; Chouvarda, Ioanna

    2016-06-06

    Somatic Hypermutation (SHM) refers to the introduction of mutations within rearranged V(D)J genes, a process that increases the diversity of Immunoglobulins (IGs). The analysis of SHM has offered critical insight into the physiology and pathology of B cells, leading to strong prognostication markers for clinical outcome in chronic lymphocytic leukaemia (CLL), the most frequent adult B-cell malignancy. In this paper we present a methodology for integrating multiple immunogenetic and clinocobiological data sources in order to extract features and create high quality datasets for SHM analysis in IG receptors of CLL patients. This dataset is used as the basis for a higher level integration procedure, inspired form social choice theory. This is applied in the Towards Analysis, our attempt to investigate the potential ontogenetic transformation of genes belonging to specific stereotyped CLL subsets towards other genes or gene families, through SHM. The data integration process, followed by feature extraction, resulted in the generation of a dataset containing information about mutations occurring through SHM. The Towards analysis performed on the integrated dataset applying voting techniques, revealed the distinct behaviour of subset #201 compared to other subsets, as regards SHM related movements among gene clans, both in allele-conserved and non-conserved gene areas. With respect to movement between genes, a high percentage movement towards pseudo genes was found in all CLL subsets. This data integration and feature extraction process can set the basis for exploratory analysis or a fully automated computational data mining approach on many as yet unanswered, clinically relevant biological questions.

  17. Data-driven decision support for radiologists: re-using the National Lung Screening Trial dataset for pulmonary nodule management.

    PubMed

    Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L

    2015-02-01

    Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.

  18. GEnomes Management Application (GEM.app): a new software tool for large-scale collaborative genome analysis.

    PubMed

    Gonzalez, Michael A; Lebrigio, Rafael F Acosta; Van Booven, Derek; Ulloa, Rick H; Powell, Eric; Speziani, Fiorella; Tekin, Mustafa; Schüle, Rebecca; Züchner, Stephan

    2013-06-01

    Novel genes are now identified at a rapid pace for many Mendelian disorders, and increasingly, for genetically complex phenotypes. However, new challenges have also become evident: (1) effectively managing larger exome and/or genome datasets, especially for smaller labs; (2) direct hands-on analysis and contextual interpretation of variant data in large genomic datasets; and (3) many small and medium-sized clinical and research-based investigative teams around the world are generating data that, if combined and shared, will significantly increase the opportunities for the entire community to identify new genes. To address these challenges, we have developed GEnomes Management Application (GEM.app), a software tool to annotate, manage, visualize, and analyze large genomic datasets (https://genomics.med.miami.edu/). GEM.app currently contains ∼1,600 whole exomes from 50 different phenotypes studied by 40 principal investigators from 15 different countries. The focus of GEM.app is on user-friendly analysis for nonbioinformaticians to make next-generation sequencing data directly accessible. Yet, GEM.app provides powerful and flexible filter options, including single family filtering, across family/phenotype queries, nested filtering, and evaluation of segregation in families. In addition, the system is fast, obtaining results within 4 sec across ∼1,200 exomes. We believe that this system will further enhance identification of genetic causes of human disease. © 2013 Wiley Periodicals, Inc.

  19. TH-A-9A-01: Active Optical Flow Model: Predicting Voxel-Level Dose Prediction in Spine SBRT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, J; Wu, Q.J.; Yin, F

    2014-06-15

    Purpose: To predict voxel-level dose distribution and enable effective evaluation of cord dose sparing in spine SBRT. Methods: We present an active optical flow model (AOFM) to statistically describe cord dose variations and train a predictive model to represent correlations between AOFM and PTV contours. Thirty clinically accepted spine SBRT plans are evenly divided into training and testing datasets. The development of predictive model consists of 1) collecting a sequence of dose maps including PTV and OAR (spinal cord) as well as a set of associated PTV contours adjacent to OAR from the training dataset, 2) classifying data into fivemore » groups based on PTV's locations relative to OAR, two “Top”s, “Left”, “Right”, and “Bottom”, 3) randomly selecting a dose map as the reference in each group and applying rigid registration and optical flow deformation to match all other maps to the reference, 4) building AOFM by importing optical flow vectors and dose values into the principal component analysis (PCA), 5) applying another PCA to features of PTV and OAR contours to generate an active shape model (ASM), and 6) computing a linear regression model of correlations between AOFM and ASM.When predicting dose distribution of a new case in the testing dataset, the PTV is first assigned to a group based on its contour characteristics. Contour features are then transformed into ASM's principal coordinates of the selected group. Finally, voxel-level dose distribution is determined by mapping from the ASM space to the AOFM space using the predictive model. Results: The DVHs predicted by the AOFM-based model and those in clinical plans are comparable in training and testing datasets. At 2% volume the dose difference between predicted and clinical plans is 4.2±4.4% and 3.3±3.5% in the training and testing datasets, respectively. Conclusion: The AOFM is effective in predicting voxel-level dose distribution for spine SBRT. Partially supported by NIH/NCI under grant #R21CA161389 and a master research grant by Varian Medical System.« less

  20. Handling a Small Dataset Problem in Prediction Model by employ Artificial Data Generation Approach: A Review

    NASA Astrophysics Data System (ADS)

    Lateh, Masitah Abdul; Kamilah Muda, Azah; Yusof, Zeratul Izzah Mohd; Azilah Muda, Noor; Sanusi Azmi, Mohd

    2017-09-01

    The emerging era of big data for past few years has led to large and complex data which needed faster and better decision making. However, the small dataset problems still arise in a certain area which causes analysis and decision are hard to make. In order to build a prediction model, a large sample is required as a training sample of the model. Small dataset is insufficient to produce an accurate prediction model. This paper will review an artificial data generation approach as one of the solution to solve the small dataset problem.

  1. Integrated dataset of impact of dissolved organic matter on particle behavior and phototoxicity of titanium dioxide nanoparticles

    EPA Pesticide Factsheets

    This dataset is generated to both qualitatively and quantitatively examine the interactions between nano-TiO2 and natural organic matter (NOM). This integrated dataset assemble all data generated in this project through a series of experiments. This dataset is associated with the following publication:Li , S., H. Ma, L. Wallis, M. Etterson , B. Riley , D. Hoff , and S. Diamond. Impact of natural organic matter on particle behavior and phototoxicity of titanium dioxide nanoparticles. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 542: 324-333, (2016).

  2. Quantification of ultrasonic texture intra-heterogeneity via volumetric stochastic modeling for tissue characterization.

    PubMed

    Al-Kadi, Omar S; Chung, Daniel Y F; Carlisle, Robert C; Coussios, Constantin C; Noble, J Alison

    2015-04-01

    Intensity variations in image texture can provide powerful quantitative information about physical properties of biological tissue. However, tissue patterns can vary according to the utilized imaging system and are intrinsically correlated to the scale of analysis. In the case of ultrasound, the Nakagami distribution is a general model of the ultrasonic backscattering envelope under various scattering conditions and densities where it can be employed for characterizing image texture, but the subtle intra-heterogeneities within a given mass are difficult to capture via this model as it works at a single spatial scale. This paper proposes a locally adaptive 3D multi-resolution Nakagami-based fractal feature descriptor that extends Nakagami-based texture analysis to accommodate subtle speckle spatial frequency tissue intensity variability in volumetric scans. Local textural fractal descriptors - which are invariant to affine intensity changes - are extracted from volumetric patches at different spatial resolutions from voxel lattice-based generated shape and scale Nakagami parameters. Using ultrasound radio-frequency datasets we found that after applying an adaptive fractal decomposition label transfer approach on top of the generated Nakagami voxels, tissue characterization results were superior to the state of art. Experimental results on real 3D ultrasonic pre-clinical and clinical datasets suggest that describing tumor intra-heterogeneity via this descriptor may facilitate improved prediction of therapy response and disease characterization. Copyright © 2014 The Authors. Published by Elsevier B.V. All rights reserved.

  3. Predicting the risk of suicide by analyzing the text of clinical notes.

    PubMed

    Poulin, Chris; Shiner, Brian; Thompson, Paul; Vepstas, Linas; Young-Xu, Yinong; Goertzel, Benjamin; Watts, Bradley; Flashman, Laura; McAllister, Thomas

    2014-01-01

    We developed linguistics-driven prediction models to estimate the risk of suicide. These models were generated from unstructured clinical notes taken from a national sample of U.S. Veterans Administration (VA) medical records. We created three matched cohorts: veterans who committed suicide, veterans who used mental health services and did not commit suicide, and veterans who did not use mental health services and did not commit suicide during the observation period (n = 70 in each group). From the clinical notes, we generated datasets of single keywords and multi-word phrases, and constructed prediction models using a machine-learning algorithm based on a genetic programming framework. The resulting inference accuracy was consistently 65% or more. Our data therefore suggests that computerized text analytics can be applied to unstructured medical records to estimate the risk of suicide. The resulting system could allow clinicians to potentially screen seemingly healthy patients at the primary care level, and to continuously evaluate the suicide risk among psychiatric patients.

  4. Predicting the Risk of Suicide by Analyzing the Text of Clinical Notes

    PubMed Central

    Thompson, Paul; Vepstas, Linas; Young-Xu, Yinong; Goertzel, Benjamin; Watts, Bradley; Flashman, Laura; McAllister, Thomas

    2014-01-01

    We developed linguistics-driven prediction models to estimate the risk of suicide. These models were generated from unstructured clinical notes taken from a national sample of U.S. Veterans Administration (VA) medical records. We created three matched cohorts: veterans who committed suicide, veterans who used mental health services and did not commit suicide, and veterans who did not use mental health services and did not commit suicide during the observation period (n = 70 in each group). From the clinical notes, we generated datasets of single keywords and multi-word phrases, and constructed prediction models using a machine-learning algorithm based on a genetic programming framework. The resulting inference accuracy was consistently 65% or more. Our data therefore suggests that computerized text analytics can be applied to unstructured medical records to estimate the risk of suicide. The resulting system could allow clinicians to potentially screen seemingly healthy patients at the primary care level, and to continuously evaluate the suicide risk among psychiatric patients. PMID:24489669

  5. Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium.

    PubMed

    Ellis, Matthew J; Gillette, Michael; Carr, Steven A; Paulovich, Amanda G; Smith, Richard D; Rodland, Karin K; Townsend, R Reid; Kinsinger, Christopher; Mesri, Mehdi; Rodriguez, Henry; Liebler, Daniel C

    2013-10-01

    The National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium is applying the latest generation of proteomic technologies to genomically annotated tumors from The Cancer Genome Atlas (TCGA) program, a joint initiative of the NCI and the National Human Genome Research Institute. By providing a fully integrated accounting of DNA, RNA, and protein abnormalities in individual tumors, these datasets will illuminate the complex relationship between genomic abnormalities and cancer phenotypes, thus producing biologic insights as well as a wave of novel candidate biomarkers and therapeutic targets amenable to verification using targeted mass spectrometry methods. ©2013 AACR.

  6. Big data, miniregistries: a rapid-turnaround solution to get quality improvement data into the hands of medical specialists.

    PubMed

    Herrinton, Lisa J; Liu, Liyan; Altschuler, Andrea; Dell, Richard; Rabrenovich, Violeta; Compton-Phillips, Amy L

    2015-01-01

    The cost to build and to maintain traditional registries for many dire, complex, low-frequency conditions is prohibitive. The authors used accessible technology to develop a platform that would generate miniregistries (small, routinely updated datasets) for surveillance, to identify patients who were missing elected utilization and to influence clinicians to change practices to improve care. The platform, tested in 5 medical specialty departments, enabled the specialists to rapidly and effectively communicate clinical questions, knowledge of disease, clinical workflows, and improve opportunities. Each miniregistry required 1 to 2 hours of collaboration by a specialist. Turnaround was 1 to 14 days.

  7. Exploring heterogeneity in clinical trials with latent class analysis

    PubMed Central

    Abarda, Abdallah; Contractor, Ateka A.; Wang, Juan; Dayton, C. Mitchell

    2018-01-01

    Case-mix is common in clinical trials and treatment effect can vary across different subgroups. Conventionally, a subgroup analysis is performed by dividing the overall study population by one or two grouping variables. It is usually impossible to explore complex high-order intersections among confounding variables. Latent class analysis (LCA) provides a framework to identify latent classes by observed manifest variables. Distal clinical outcomes and treatment effect can be different across these classes. This paper provides a step-by-step tutorial on how to perform LCA with R. A simulated dataset is generated to illustrate the process. In the example, the classify-analyze approach is employed to explore the differential treatment effects on distal outcomes across latent classes. PMID:29955579

  8. Evolving hard problems: Generating human genetics datasets with a complex etiology.

    PubMed

    Himmelstein, Daniel S; Greene, Casey S; Moore, Jason H

    2011-07-07

    A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.

  9. A curated collection of tissue microarray images and clinical outcome data of prostate cancer patients

    PubMed Central

    Zhong, Qing; Guo, Tiannan; Rechsteiner, Markus; Rüschoff, Jan H.; Rupp, Niels; Fankhauser, Christian; Saba, Karim; Mortezavi, Ashkan; Poyet, Cédric; Hermanns, Thomas; Zhu, Yi; Moch, Holger; Aebersold, Ruedi; Wild, Peter J.

    2017-01-01

    Microscopy image data of human cancers provide detailed phenotypes of spatially and morphologically intact tissues at single-cell resolution, thus complementing large-scale molecular analyses, e.g., next generation sequencing or proteomic profiling. Here we describe a high-resolution tissue microarray (TMA) image dataset from a cohort of 71 prostate tissue samples, which was hybridized with bright-field dual colour chromogenic and silver in situ hybridization probes for the tumour suppressor gene PTEN. These tissue samples were digitized and supplemented with expert annotations, clinical information, statistical models of PTEN genetic status, and computer source codes. For validation, we constructed an additional TMA dataset for 424 prostate tissues, hybridized with FISH probes for PTEN, and performed survival analysis on a subset of 339 radical prostatectomy specimens with overall, disease-specific and recurrence-free survival (maximum 167 months). For application, we further produced 6,036 image patches derived from two whole slides. Our curated collection of prostate cancer data sets provides reuse potential for both biomedical and computational studies. PMID:28291248

  10. Artificial neural network for suppression of banding artifacts in balanced steady-state free precession MRI.

    PubMed

    Kim, Ki Hwan; Park, Sung-Hong

    2017-04-01

    The balanced steady-state free precession (bSSFP) MR sequence is frequently used in clinics, but is sensitive to off-resonance effects, which can cause banding artifacts. Often multiple bSSFP datasets are acquired at different phase cycling (PC) angles and then combined in a special way for banding artifact suppression. Many strategies of combining the datasets have been suggested for banding artifact suppression, but there are still limitations in their performance, especially when the number of phase-cycled bSSFP datasets is small. The purpose of this study is to develop a learning-based model to combine the multiple phase-cycled bSSFP datasets for better banding artifact suppression. Multilayer perceptron (MLP) is a feedforward artificial neural network consisting of three layers of input, hidden, and output layers. MLP models were trained by input bSSFP datasets acquired from human brain and knee at 3T, which were separately performed for two and four PC angles. Banding-free bSSFP images were generated by maximum-intensity projection (MIP) of 8 or 12 phase-cycled datasets and were used as targets for training the output layer. The trained MLP models were applied to another brain and knee datasets acquired with different scan parameters and also to multiple phase-cycled bSSFP functional MRI datasets acquired on rat brain at 9.4T, in comparison with the conventional MIP method. Simulations were also performed to validate the MLP approach. Both the simulations and human experiments demonstrated that MLP suppressed banding artifacts significantly, superior to MIP in both banding artifact suppression and SNR efficiency. MLP demonstrated superior performance over MIP for the 9.4T fMRI data as well, which was not used for training the models, while visually preserving the fMRI maps very well. Artificial neural network is a promising technique for combining multiple phase-cycled bSSFP datasets for banding artifact suppression. Copyright © 2016 Elsevier Inc. All rights reserved.

  11. Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data

    NASA Astrophysics Data System (ADS)

    Sandmann, Sarah; de Graaf, Aniek O.; Karimi, Mohsen; van der Reijden, Bert A.; Hellström-Lindberg, Eva; Jansen, Joop H.; Dugas, Martin

    2017-02-01

    Valid variant calling results are crucial for the use of next-generation sequencing in clinical routine. However, there are numerous variant calling tools that usually differ in algorithms, filtering strategies, recommendations and thus, also in the output. We evaluated eight open-source tools regarding their ability to call single nucleotide variants and short indels with allelic frequencies as low as 1% in non-matched next-generation sequencing data: GATK HaplotypeCaller, Platypus, VarScan, LoFreq, FreeBayes, SNVer, SAMtools and VarDict. We analysed two real datasets from patients with myelodysplastic syndrome, covering 54 Illumina HiSeq samples and 111 Illumina NextSeq samples. Mutations were validated by re-sequencing on the same platform, on a different platform and expert based review. In addition we considered two simulated datasets with varying coverage and error profiles, covering 50 samples each. In all cases an identical target region consisting of 19 genes (42,322 bp) was analysed. Altogether, no tool succeeded in calling all mutations. High sensitivity was always accompanied by low precision. Influence of varying coverages- and background noise on variant calling was generally low. Taking everything into account, VarDict performed best. However, our results indicate that there is a need to improve reproducibility of the results in the context of multithreading.

  12. A hybrid approach for efficient anomaly detection using metaheuristic methods

    PubMed Central

    Ghanem, Tamer F.; Elkilani, Wail S.; Abdul-kader, Hatem M.

    2014-01-01

    Network intrusion detection based on anomaly detection techniques has a significant role in protecting networks and systems against harmful activities. Different metaheuristic techniques have been used for anomaly detector generation. Yet, reported literature has not studied the use of the multi-start metaheuristic method for detector generation. This paper proposes a hybrid approach for anomaly detection in large scale datasets using detectors generated based on multi-start metaheuristic method and genetic algorithms. The proposed approach has taken some inspiration of negative selection-based detector generation. The evaluation of this approach is performed using NSL-KDD dataset which is a modified version of the widely used KDD CUP 99 dataset. The results show its effectiveness in generating a suitable number of detectors with an accuracy of 96.1% compared to other competitors of machine learning algorithms. PMID:26199752

  13. A hybrid approach for efficient anomaly detection using metaheuristic methods.

    PubMed

    Ghanem, Tamer F; Elkilani, Wail S; Abdul-Kader, Hatem M

    2015-07-01

    Network intrusion detection based on anomaly detection techniques has a significant role in protecting networks and systems against harmful activities. Different metaheuristic techniques have been used for anomaly detector generation. Yet, reported literature has not studied the use of the multi-start metaheuristic method for detector generation. This paper proposes a hybrid approach for anomaly detection in large scale datasets using detectors generated based on multi-start metaheuristic method and genetic algorithms. The proposed approach has taken some inspiration of negative selection-based detector generation. The evaluation of this approach is performed using NSL-KDD dataset which is a modified version of the widely used KDD CUP 99 dataset. The results show its effectiveness in generating a suitable number of detectors with an accuracy of 96.1% compared to other competitors of machine learning algorithms.

  14. A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data

    PubMed Central

    2015-01-01

    Background Investigations into novel biomarkers using omics techniques generate large amounts of data. Due to their size and numbers of attributes, these data are suitable for analysis with machine learning methods. A key component of typical machine learning pipelines for omics data is feature selection, which is used to reduce the raw high-dimensional data into a tractable number of features. Feature selection needs to balance the objective of using as few features as possible, while maintaining high predictive power. This balance is crucial when the goal of data analysis is the identification of highly accurate but small panels of biomarkers with potential clinical utility. In this paper we propose a heuristic for the selection of very small feature subsets, via an iterative feature elimination process that is guided by rule-based machine learning, called RGIFE (Rule-guided Iterative Feature Elimination). We use this heuristic to identify putative biomarkers of osteoarthritis (OA), articular cartilage degradation and synovial inflammation, using both proteomic and transcriptomic datasets. Results and discussion Our RGIFE heuristic increased the classification accuracies achieved for all datasets when no feature selection is used, and performed well in a comparison with other feature selection methods. Using this method the datasets were reduced to a smaller number of genes or proteins, including those known to be relevant to OA, cartilage degradation and joint inflammation. The results have shown the RGIFE feature reduction method to be suitable for analysing both proteomic and transcriptomics data. Methods that generate large ‘omics’ datasets are increasingly being used in the area of rheumatology. Conclusions Feature reduction methods are advantageous for the analysis of omics data in the field of rheumatology, as the applications of such techniques are likely to result in improvements in diagnosis, treatment and drug discovery. PMID:25923811

  15. Estimation of parameters of dose volume models and their confidence limits

    NASA Astrophysics Data System (ADS)

    van Luijk, P.; Delvigne, T. C.; Schilstra, C.; Schippers, J. M.

    2003-07-01

    Predictions of the normal-tissue complication probability (NTCP) for the ranking of treatment plans are based on fits of dose-volume models to clinical and/or experimental data. In the literature several different fit methods are used. In this work frequently used methods and techniques to fit NTCP models to dose response data for establishing dose-volume effects, are discussed. The techniques are tested for their usability with dose-volume data and NTCP models. Different methods to estimate the confidence intervals of the model parameters are part of this study. From a critical-volume (CV) model with biologically realistic parameters a primary dataset was generated, serving as the reference for this study and describable by the NTCP model. The CV model was fitted to this dataset. From the resulting parameters and the CV model, 1000 secondary datasets were generated by Monte Carlo simulation. All secondary datasets were fitted to obtain 1000 parameter sets of the CV model. Thus the 'real' spread in fit results due to statistical spreading in the data is obtained and has been compared with estimates of the confidence intervals obtained by different methods applied to the primary dataset. The confidence limits of the parameters of one dataset were estimated using the methods, employing the covariance matrix, the jackknife method and directly from the likelihood landscape. These results were compared with the spread of the parameters, obtained from the secondary parameter sets. For the estimation of confidence intervals on NTCP predictions, three methods were tested. Firstly, propagation of errors using the covariance matrix was used. Secondly, the meaning of the width of a bundle of curves that resulted from parameters that were within the one standard deviation region in the likelihood space was investigated. Thirdly, many parameter sets and their likelihood were used to create a likelihood-weighted probability distribution of the NTCP. It is concluded that for the type of dose response data used here, only a full likelihood analysis will produce reliable results. The often-used approximations, such as the usage of the covariance matrix, produce inconsistent confidence limits on both the parameter sets and the resulting NTCP values.

  16. Online Tools for Bioinformatics Analyses in Nutrition Sciences12

    PubMed Central

    Malkaram, Sridhar A.; Hassan, Yousef I.; Zempleni, Janos

    2012-01-01

    Recent advances in “omics” research have resulted in the creation of large datasets that were generated by consortiums and centers, small datasets that were generated by individual investigators, and bioinformatics tools for mining these datasets. It is important for nutrition laboratories to take full advantage of the analysis tools to interrogate datasets for information relevant to genomics, epigenomics, transcriptomics, proteomics, and metabolomics. This review provides guidance regarding bioinformatics resources that are currently available in the public domain, with the intent to provide a starting point for investigators who want to take advantage of the opportunities provided by the bioinformatics field. PMID:22983844

  17. Comparative factor analysis models for an empirical study of EEG data, II: A data-guided resolution of the rotation indeterminacy.

    PubMed

    Rogers, L J; Douglas, R R

    1984-02-01

    In this paper (the second in a series), we consider a (generic) pair of datasets, which have been analyzed by the techniques of the previous paper. Thus, their "stable subspaces" have been established by comparative factor analysis. The pair of datasets must satisfy two confirmable conditions. The first is the "Inclusion Condition," which requires that the stable subspace of one of the datasets is nearly identical to a subspace of the other dataset's stable subspace. On the basis of that, we have assumed the pair to have similar generating signals, with stochastically independent generators. The second verifiable condition is that the (presumed same) generating signals have distinct ratios of variances for the two datasets. Under these conditions a small elaboration of some elementary linear algebra reduces the rotation problem to several eigenvalue-eigenvector problems. Finally, we emphasize that an analysis of each dataset by the method of Douglas and Rogers (1983) is an essential prerequisite for the useful application of the techniques in this paper. Nonempirical methods of estimating the number of factors simply will not suffice, as confirmed by simulations reported in the previous paper.

  18. 101 Labeled Brain Images and a Consistent Human Cortical Labeling Protocol

    PubMed Central

    Klein, Arno; Tourville, Jason

    2012-01-01

    We introduce the Mindboggle-101 dataset, the largest and most complete set of free, publicly accessible, manually labeled human brain images. To manually label the macroscopic anatomy in magnetic resonance images of 101 healthy participants, we created a new cortical labeling protocol that relies on robust anatomical landmarks and minimal manual edits after initialization with automated labels. The “Desikan–Killiany–Tourville” (DKT) protocol is intended to improve the ease, consistency, and accuracy of labeling human cortical areas. Given how difficult it is to label brains, the Mindboggle-101 dataset is intended to serve as brain atlases for use in labeling other brains, as a normative dataset to establish morphometric variation in a healthy population for comparison against clinical populations, and contribute to the development, training, testing, and evaluation of automated registration and labeling algorithms. To this end, we also introduce benchmarks for the evaluation of such algorithms by comparing our manual labels with labels automatically generated by probabilistic and multi-atlas registration-based approaches. All data and related software and updated information are available on the http://mindboggle.info/data website. PMID:23227001

  19. Patient-dependent count-rate adaptive normalization for PET detector efficiency with delayed-window coincidence events

    NASA Astrophysics Data System (ADS)

    Niu, Xiaofeng; Ye, Hongwei; Xia, Ting; Asma, Evren; Winkler, Mark; Gagnon, Daniel; Wang, Wenli

    2015-07-01

    Quantitative PET imaging is widely used in clinical diagnosis in oncology and neuroimaging. Accurate normalization correction for the efficiency of each line-of- response is essential for accurate quantitative PET image reconstruction. In this paper, we propose a normalization calibration method by using the delayed-window coincidence events from the scanning phantom or patient. The proposed method could dramatically reduce the ‘ring’ artifacts caused by mismatched system count-rates between the calibration and phantom/patient datasets. Moreover, a modified algorithm for mean detector efficiency estimation is proposed, which could generate crystal efficiency maps with more uniform variance. Both phantom and real patient datasets are used for evaluation. The results show that the proposed method could lead to better uniformity in reconstructed images by removing ring artifacts, and more uniform axial variance profiles, especially around the axial edge slices of the scanner. The proposed method also has the potential benefit to simplify the normalization calibration procedure, since the calibration can be performed using the on-the-fly acquired delayed-window dataset.

  20. Airways, vasculature, and interstitial tissue: anatomically informed computational modeling of human lungs for virtual clinical trials

    NASA Astrophysics Data System (ADS)

    Abadi, Ehsan; Sturgeon, Gregory M.; Agasthya, Greeshma; Harrawood, Brian; Hoeschen, Christoph; Kapadia, Anuj; Segars, W. P.; Samei, Ehsan

    2017-03-01

    This study aimed to model virtual human lung phantoms including both non-parenchymal and parenchymal structures. Initial branches of the non-parenchymal structures (airways, arteries, and veins) were segmented from anatomical data in each lobe separately. A volume-filling branching algorithm was utilized to grow the higher generations of the airways and vessels to the level of terminal branches. The diameters of the airways and vessels were estimated using established relationships between flow rates and diameters. The parenchyma was modeled based on secondary pulmonary lobule units. Polyhedral shapes with variable sizes were modeled, and the borders were assigned to interlobular septa. A heterogeneous background was added inside these units using a non-parametric texture synthesis algorithm which was informed by a high-resolution CT lung specimen dataset. A voxelized based CT simulator was developed to create synthetic helical CT images of the phantom with different pitch values. Results showed the progressive degradation in depiction of lung details with increased pitch. Overall, the enhanced lung models combined with the XCAT phantoms prove to provide a powerful toolset to perform virtual clinical trials in the context of thoracic imaging. Such trials, not practical using clinical datasets or simplistic phantoms, can quantitatively evaluate and optimize advanced imaging techniques towards patient-based care.

  1. SPANG: a SPARQL client supporting generation and reuse of queries for distributed RDF databases.

    PubMed

    Chiba, Hirokazu; Uchiyama, Ikuo

    2017-02-08

    Toward improved interoperability of distributed biological databases, an increasing number of datasets have been published in the standardized Resource Description Framework (RDF). Although the powerful SPARQL Protocol and RDF Query Language (SPARQL) provides a basis for exploiting RDF databases, writing SPARQL code is burdensome for users including bioinformaticians. Thus, an easy-to-use interface is necessary. We developed SPANG, a SPARQL client that has unique features for querying RDF datasets. SPANG dynamically generates typical SPARQL queries according to specified arguments. It can also call SPARQL template libraries constructed in a local system or published on the Web. Further, it enables combinatorial execution of multiple queries, each with a distinct target database. These features facilitate easy and effective access to RDF datasets and integrative analysis of distributed data. SPANG helps users to exploit RDF datasets by generation and reuse of SPARQL queries through a simple interface. This client will enhance integrative exploitation of biological RDF datasets distributed across the Web. This software package is freely available at http://purl.org/net/spang .

  2. Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research

    PubMed Central

    McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R

    2018-01-01

    Objectives This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. Methods A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. Results A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Conclusions Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. PMID:29084729

  3. An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads

    PubMed Central

    2013-01-01

    Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies. PMID:24564333

  4. An algorithm for direct causal learning of influences on patient outcomes.

    PubMed

    Rathnam, Chandramouli; Lee, Sanghoon; Jiang, Xia

    2017-01-01

    This study aims at developing and introducing a new algorithm, called direct causal learner (DCL), for learning the direct causal influences of a single target. We applied it to both simulated and real clinical and genome wide association study (GWAS) datasets and compared its performance to classic causal learning algorithms. The DCL algorithm learns the causes of a single target from passive data using Bayesian-scoring, instead of using independence checks, and a novel deletion algorithm. We generate 14,400 simulated datasets and measure the number of datasets for which DCL correctly and partially predicts the direct causes. We then compare its performance with the constraint-based path consistency (PC) and conservative PC (CPC) algorithms, the Bayesian-score based fast greedy search (FGS) algorithm, and the partial ancestral graphs algorithm fast causal inference (FCI). In addition, we extend our comparison of all five algorithms to both a real GWAS dataset and real breast cancer datasets over various time-points in order to observe how effective they are at predicting the causal influences of Alzheimer's disease and breast cancer survival. DCL consistently outperforms FGS, PC, CPC, and FCI in discovering the parents of the target for the datasets simulated using a simple network. Overall, DCL predicts significantly more datasets correctly (McNemar's test significance: p<0.0001) than any of the other algorithms for these network types. For example, when assessing overall performance (simple and complex network results combined), DCL correctly predicts approximately 1400 more datasets than the top FGS method, 1600 more datasets than the top CPC method, 4500 more datasets than the top PC method, and 5600 more datasets than the top FCI method. Although FGS did correctly predict more datasets than DCL for the complex networks, and DCL correctly predicted only a few more datasets than CPC for these networks, there is no significant difference in performance between these three algorithms for this network type. However, when we use a more continuous measure of accuracy, we find that all the DCL methods are able to better partially predict more direct causes than FGS and CPC for the complex networks. In addition, DCL consistently had faster runtimes than the other algorithms. In the application to the real datasets, DCL identified rs6784615, located on the NISCH gene, and rs10824310, located on the PRKG1 gene, as direct causes of late onset Alzheimer's disease (LOAD) development. In addition, DCL identified ER category as a direct predictor of breast cancer mortality within 5 years, and HER2 status as a direct predictor of 10-year breast cancer mortality. These predictors have been identified in previous studies to have a direct causal relationship with their respective phenotypes, supporting the predictive power of DCL. When the other algorithms discovered predictors from the real datasets, these predictors were either also found by DCL or could not be supported by previous studies. Our results show that DCL outperforms FGS, PC, CPC, and FCI in almost every case, demonstrating its potential to advance causal learning. Furthermore, our DCL algorithm effectively identifies direct causes in the LOAD and Metabric GWAS datasets, which indicates its potential for clinical applications. Copyright © 2016 Elsevier B.V. All rights reserved.

  5. Consensus Analysis of Whole Transcriptome Profiles from Two Breast Cancer Patient Cohorts Reveals Long Non-Coding RNAs Associated with Intrinsic Subtype and the Tumour Microenvironment.

    PubMed

    Bradford, James R; Cox, Angela; Bernard, Philip; Camp, Nicola J

    2016-01-01

    Long non-coding RNAs (lncRNAs) are emerging as crucial regulators of cellular processes and diseases such as cancer; however, their functions remain poorly characterised. Several studies have demonstrated that lncRNAs are typically disease and tumour subtype specific, particularly in breast cancer where lncRNA expression alone is sufficient to discriminate samples based on hormone status and molecular intrinsic subtype. However, little attempt has been made to assess the reproducibility of lncRNA signatures across more than one dataset. In this work, we derive consensus lncRNA signatures indicative of breast cancer subtype based on two clinical RNA-Seq datasets: the Utah Breast Cancer Study and The Cancer Genome Atlas, through integration of differential expression and hypothesis-free clustering analyses. The most consistent signature is associated with breast cancers of the basal-like subtype, leading us to generate a putative set of six lncRNA basal-like breast cancer markers, at least two of which may have a role in cis-regulation of known poor prognosis markers. Through in silico functional characterization of individual signatures and integration of expression data from pre-clinical cancer models, we discover that discordance between signatures derived from different clinical cohorts can arise from the strong influence of non-cancerous cells in tumour samples. As a consequence, we identify nine lncRNAs putatively associated with breast cancer associated fibroblasts, or the immune response. Overall, our study establishes the confounding effects of tumour purity on lncRNA signature derivation, and generates several novel hypotheses on the role of lncRNAs in basal-like breast cancers and the tumour microenvironment.

  6. Viral pathogen discovery

    PubMed Central

    Chiu, Charles Y

    2015-01-01

    Viral pathogen discovery is of critical importance to clinical microbiology, infectious diseases, and public health. Genomic approaches for pathogen discovery, including consensus polymerase chain reaction (PCR), microarrays, and unbiased next-generation sequencing (NGS), have the capacity to comprehensively identify novel microbes present in clinical samples. Although numerous challenges remain to be addressed, including the bioinformatics analysis and interpretation of large datasets, these technologies have been successful in rapidly identifying emerging outbreak threats, screening vaccines and other biological products for microbial contamination, and discovering novel viruses associated with both acute and chronic illnesses. Downstream studies such as genome assembly, epidemiologic screening, and a culture system or animal model of infection are necessary to establish an association of a candidate pathogen with disease. PMID:23725672

  7. A probabilistic topic model for clinical risk stratification from electronic health records.

    PubMed

    Huang, Zhengxing; Dong, Wei; Duan, Huilong

    2015-12-01

    Risk stratification aims to provide physicians with the accurate assessment of a patient's clinical risk such that an individualized prevention or management strategy can be developed and delivered. Existing risk stratification techniques mainly focus on predicting the overall risk of an individual patient in a supervised manner, and, at the cohort level, often offer little insight beyond a flat score-based segmentation from the labeled clinical dataset. To this end, in this paper, we propose a new approach for risk stratification by exploring a large volume of electronic health records (EHRs) in an unsupervised fashion. Along this line, this paper proposes a novel probabilistic topic modeling framework called probabilistic risk stratification model (PRSM) based on Latent Dirichlet Allocation (LDA). The proposed PRSM recognizes a patient clinical state as a probabilistic combination of latent sub-profiles, and generates sub-profile-specific risk tiers of patients from their EHRs in a fully unsupervised fashion. The achieved stratification results can be easily recognized as high-, medium- and low-risk, respectively. In addition, we present an extension of PRSM, called weakly supervised PRSM (WS-PRSM) by incorporating minimum prior information into the model, in order to improve the risk stratification accuracy, and to make our models highly portable to risk stratification tasks of various diseases. We verify the effectiveness of the proposed approach on a clinical dataset containing 3463 coronary heart disease (CHD) patient instances. Both PRSM and WS-PRSM were compared with two established supervised risk stratification algorithms, i.e., logistic regression and support vector machine, and showed the effectiveness of our models in risk stratification of CHD in terms of the Area Under the receiver operating characteristic Curve (AUC) analysis. As well, in comparison with PRSM, WS-PRSM has over 2% performance gain, on the experimental dataset, demonstrating that incorporating risk scoring knowledge as prior information can improve the performance in risk stratification. Experimental results reveal that our models achieve competitive performance in risk stratification in comparison with existing supervised approaches. In addition, the unsupervised nature of our models makes them highly portable to the risk stratification tasks of various diseases. Moreover, patient sub-profiles and sub-profile-specific risk tiers generated by our models are coherent and informative, and provide significant potential to be explored for the further tasks, such as patient cohort analysis. We hypothesize that the proposed framework can readily meet the demand for risk stratification from a large volume of EHRs in an open-ended fashion. Copyright © 2015 Elsevier Inc. All rights reserved.

  8. Parameter estimation of the Farquhar-von Caemmerer-Berry biochemical model from photosynthetic carbon dioxide response curves

    USDA-ARS?s Scientific Manuscript database

    The methods of Sharkey and Gu for estimating the eight parameters of the Farquhar-von Caemmerer-Berry (FvBC) model were examined using generated photosynthesis versus intercellular carbon dioxide concentration (A/Ci) datasets. The generated datasets included data with (A) high accuracy, (B) normal ...

  9. Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach.

    PubMed

    Weng, Wei-Hung; Wagholikar, Kavishwar B; McCray, Alexa T; Szolovits, Peter; Chueh, Henry C

    2017-12-01

    The medical subdomain of a clinical note, such as cardiology or neurology, is useful content-derived metadata for developing machine learning downstream applications. To classify the medical subdomain of a note accurately, we have constructed a machine learning-based natural language processing (NLP) pipeline and developed medical subdomain classifiers based on the content of the note. We constructed the pipeline using the clinical NLP system, clinical Text Analysis and Knowledge Extraction System (cTAKES), the Unified Medical Language System (UMLS) Metathesaurus, Semantic Network, and learning algorithms to extract features from two datasets - clinical notes from Integrating Data for Analysis, Anonymization, and Sharing (iDASH) data repository (n = 431) and Massachusetts General Hospital (MGH) (n = 91,237), and built medical subdomain classifiers with different combinations of data representation methods and supervised learning algorithms. We evaluated the performance of classifiers and their portability across the two datasets. The convolutional recurrent neural network with neural word embeddings trained-medical subdomain classifier yielded the best performance measurement on iDASH and MGH datasets with area under receiver operating characteristic curve (AUC) of 0.975 and 0.991, and F1 scores of 0.845 and 0.870, respectively. Considering better clinical interpretability, linear support vector machine-trained medical subdomain classifier using hybrid bag-of-words and clinically relevant UMLS concepts as the feature representation, with term frequency-inverse document frequency (tf-idf)-weighting, outperformed other shallow learning classifiers on iDASH and MGH datasets with AUC of 0.957 and 0.964, and F1 scores of 0.932 and 0.934 respectively. We trained classifiers on one dataset, applied to the other dataset and yielded the threshold of F1 score of 0.7 in classifiers for half of the medical subdomains we studied. Our study shows that a supervised learning-based NLP approach is useful to develop medical subdomain classifiers. The deep learning algorithm with distributed word representation yields better performance yet shallow learning algorithms with the word and concept representation achieves comparable performance with better clinical interpretability. Portable classifiers may also be used across datasets from different institutions.

  10. Federated Tensor Factorization for Computational Phenotyping

    PubMed Central

    Kim, Yejin; Sun, Jimeng; Yu, Hwanjo; Jiang, Xiaoqian

    2017-01-01

    Tensor factorization models offer an effective approach to convert massive electronic health records into meaningful clinical concepts (phenotypes) for data analysis. These models need a large amount of diverse samples to avoid population bias. An open challenge is how to derive phenotypes jointly across multiple hospitals, in which direct patient-level data sharing is not possible (e.g., due to institutional policies). In this paper, we developed a novel solution to enable federated tensor factorization for computational phenotyping without sharing patient-level data. We developed secure data harmonization and federated computation procedures based on alternating direction method of multipliers (ADMM). Using this method, the multiple hospitals iteratively update tensors and transfer secure summarized information to a central server, and the server aggregates the information to generate phenotypes. We demonstrated with real medical datasets that our method resembles the centralized training model (based on combined datasets) in terms of accuracy and phenotypes discovery while respecting privacy. PMID:29071165

  11. P-MartCancer-Interactive Online Software to Enable Analysis of Shotgun Cancer Proteomic Datasets.

    PubMed

    Webb-Robertson, Bobbie-Jo M; Bramer, Lisa M; Jensen, Jeffrey L; Kobold, Markus A; Stratton, Kelly G; White, Amanda M; Rodland, Karin D

    2017-11-01

    P-MartCancer is an interactive web-based software environment that enables statistical analyses of peptide or protein data, quantitated from mass spectrometry-based global proteomics experiments, without requiring in-depth knowledge of statistical programming. P-MartCancer offers a series of statistical modules associated with quality assessment, peptide and protein statistics, protein quantification, and exploratory data analyses driven by the user via customized workflows and interactive visualization. Currently, P-MartCancer offers access and the capability to analyze multiple cancer proteomic datasets generated through the Clinical Proteomics Tumor Analysis Consortium at the peptide, gene, and protein levels. P-MartCancer is deployed as a web service (https://pmart.labworks.org/cptac.html), alternatively available via Docker Hub (https://hub.docker.com/r/pnnl/pmart-web/). Cancer Res; 77(21); e47-50. ©2017 AACR . ©2017 American Association for Cancer Research.

  12. Big Data Transforms Discovery-Utilization Therapeutics Continuum

    PubMed Central

    Waldman, SA; Terzic, A

    2015-01-01

    Enabling omic technologies adopt a holistic view to produce unprecedented insights into the molecular underpinnings of health and disease, in part, by generating massive high-dimensional biological data. Leveraging these systems-level insights as an engine driving the healthcare evolution is maximized through integration with medical, demographic, and environmental datasets from individuals to populations. Big data analytics has accordingly emerged to add value to the technical aspects of storage, transfer, and analysis required for merging vast arrays of omic-, clinical- and eco-datasets. In turn, this new field at the interface of biology, medicine, and information science is systematically transforming modern therapeutics across discovery, development, regulation, and utilization. “…a man's discourse was like to a rich Persian carpet, the beautiful figures and patterns of which can be shown only by spreading and extending it out; when it is contracted and folded up, they are obscured and lost” Themistocles quoted by Plutarch AD 46 – AD 120 PMID:26888297

  13. Using routine clinical and administrative data to produce a dataset of attendances at Emergency Departments following self-harm.

    PubMed

    Polling, C; Tulloch, A; Banerjee, S; Cross, S; Dutta, R; Wood, D M; Dargan, P I; Hotopf, M

    2015-07-16

    Self-harm is a significant public health concern in the UK. This is reflected in the recent addition to the English Public Health Outcomes Framework of rates of attendance at Emergency Departments (EDs) following self-harm. However there is currently no source of data to measure this outcome. Routinely available data for inpatient admissions following self-harm miss the majority of cases presenting to services. We aimed to investigate (i) if a dataset of ED presentations could be produced using a combination of routinely collected clinical and administrative data and (ii) to validate this dataset against another one produced using methods similar to those used in previous studies. Using the Clinical Record Interactive Search system, the electronic health records (EHRs) used in four EDs were linked to Hospital Episode Statistics to create a dataset of attendances following self-harm. This dataset was compared with an audit dataset of ED attendances created by manual searching of ED records. The proportion of total cases detected by each dataset was compared. There were 1932 attendances detected by the EHR dataset and 1906 by the audit. The EHR and audit datasets detected 77% and 76 of all attendances respectively and both detected 82% of individual patients. There were no differences in terms of age, sex, ethnicity or marital status between those detected and those missed using the EHR method. Both datasets revealed more than double the number of self-harm incidents than could be identified from inpatient admission records. It was possible to use routinely collected EHR data to create a dataset of attendances at EDs following self-harm. The dataset detected the same proportion of attendances and individuals as the audit dataset, proved more comprehensive than the use of inpatient admission records, and did not show a systematic bias in those cases it missed.

  14. CPTAC Releases Largest-Ever Ovarian Cancer Proteome Dataset from Previously Genome Characterized Tumors | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) scientists have just released a comprehensive dataset of the proteomic analysis of high grade serous ovarian tumor samples, previously genomically analyzed by The Cancer Genome Atlas (TCGA).  This is one of the largest public datasets covering the proteome, phosphoproteome and glycoproteome with complementary deep genomic sequencing data on the same tumor.

  15. Evaluation of outbreak detection performance using multi-stream syndromic surveillance for influenza-like illness in rural Hubei Province, China: a temporal simulation model based on healthcare-seeking behaviors.

    PubMed

    Fan, Yunzhou; Wang, Ying; Jiang, Hongbo; Yang, Wenwen; Yu, Miao; Yan, Weirong; Diwan, Vinod K; Xu, Biao; Dong, Hengjin; Palm, Lars; Nie, Shaofa

    2014-01-01

    Syndromic surveillance promotes the early detection of diseases outbreaks. Although syndromic surveillance has increased in developing countries, performance on outbreak detection, particularly in cases of multi-stream surveillance, has scarcely been evaluated in rural areas. This study introduces a temporal simulation model based on healthcare-seeking behaviors to evaluate the performance of multi-stream syndromic surveillance for influenza-like illness. Data were obtained in six towns of rural Hubei Province, China, from April 2012 to June 2013. A Susceptible-Exposed-Infectious-Recovered model generated 27 scenarios of simulated influenza A (H1N1) outbreaks, which were converted into corresponding simulated syndromic datasets through the healthcare-behaviors model. We then superimposed converted syndromic datasets onto the baselines obtained to create the testing datasets. Outbreak performance of single-stream surveillance of clinic visit, frequency of over the counter drug purchases, school absenteeism, and multi-stream surveillance of their combinations were evaluated using receiver operating characteristic curves and activity monitoring operation curves. In the six towns examined, clinic visit surveillance and school absenteeism surveillance exhibited superior performances of outbreak detection than over the counter drug purchase frequency surveillance; the performance of multi-stream surveillance was preferable to signal-stream surveillance, particularly at low specificity (Sp <90%). The temporal simulation model based on healthcare-seeking behaviors offers an accessible method for evaluating the performance of multi-stream surveillance.

  16. BubbleGUM: automatic extraction of phenotype molecular signatures and comprehensive visualization of multiple Gene Set Enrichment Analyses.

    PubMed

    Spinelli, Lionel; Carpentier, Sabrina; Montañana Sanchis, Frédéric; Dalod, Marc; Vu Manh, Thien-Phong

    2015-10-19

    Recent advances in the analysis of high-throughput expression data have led to the development of tools that scaled-up their focus from single-gene to gene set level. For example, the popular Gene Set Enrichment Analysis (GSEA) algorithm can detect moderate but coordinated expression changes of groups of presumably related genes between pairs of experimental conditions. This considerably improves extraction of information from high-throughput gene expression data. However, although many gene sets covering a large panel of biological fields are available in public databases, the ability to generate home-made gene sets relevant to one's biological question is crucial but remains a substantial challenge to most biologists lacking statistic or bioinformatic expertise. This is all the more the case when attempting to define a gene set specific of one condition compared to many other ones. Thus, there is a crucial need for an easy-to-use software for generation of relevant home-made gene sets from complex datasets, their use in GSEA, and the correction of the results when applied to multiple comparisons of many experimental conditions. We developed BubbleGUM (GSEA Unlimited Map), a tool that allows to automatically extract molecular signatures from transcriptomic data and perform exhaustive GSEA with multiple testing correction. One original feature of BubbleGUM notably resides in its capacity to integrate and compare numerous GSEA results into an easy-to-grasp graphical representation. We applied our method to generate transcriptomic fingerprints for murine cell types and to assess their enrichments in human cell types. This analysis allowed us to confirm homologies between mouse and human immunocytes. BubbleGUM is an open-source software that allows to automatically generate molecular signatures out of complex expression datasets and to assess directly their enrichment by GSEA on independent datasets. Enrichments are displayed in a graphical output that helps interpreting the results. This innovative methodology has recently been used to answer important questions in functional genomics, such as the degree of similarities between microarray datasets from different laboratories or with different experimental models or clinical cohorts. BubbleGUM is executable through an intuitive interface so that both bioinformaticians and biologists can use it. It is available at http://www.ciml.univ-mrs.fr/applications/BubbleGUM/index.html .

  17. Transcriptomics hit the target: Monitoring of ligand-activated and stress response pathways for chemical testing.

    PubMed

    Limonciel, Alice; Moenks, Konrad; Stanzel, Sven; Truisi, Germaine L; Parmentier, Céline; Aschauer, Lydia; Wilmes, Anja; Richert, Lysiane; Hewitt, Philip; Mueller, Stefan O; Lukas, Arno; Kopp-Schneider, Annette; Leonard, Martin O; Jennings, Paul

    2015-12-25

    High content omic methods provide a deep insight into cellular events occurring upon chemical exposure of a cell population or tissue. However, this improvement in analytic precision is not yet matched by a thorough understanding of molecular mechanisms that would allow an optimal interpretation of these biological changes. For transcriptomics (TCX), one type of molecular effects that can be assessed already is the modulation of the transcriptional activity of a transcription factor (TF). As more ChIP-seq datasets reporting genes specifically bound by a TF become publicly available for mining, the generation of target gene lists of TFs of toxicological relevance becomes possible, based on actual protein-DNA interaction and modulation of gene expression. In this study, we generated target gene signatures for Nrf2, ATF4, XBP1, p53, HIF1a, AhR and PPAR gamma and tracked TF modulation in a large collection of in vitro TCX datasets from renal and hepatic cell models exposed to clinical nephro- and hepato-toxins. The result is a global monitoring of TF modulation with great promise as a mechanistically based tool for chemical hazard identification. Copyright © 2014 Elsevier Ltd. All rights reserved.

  18. Antibiotic Resistome: Improving Detection and Quantification Accuracy for Comparative Metagenomics.

    PubMed

    Elbehery, Ali H A; Aziz, Ramy K; Siam, Rania

    2016-04-01

    The unprecedented rise of life-threatening antibiotic resistance (AR), combined with the unparalleled advances in DNA sequencing of genomes and metagenomes, has pushed the need for in silico detection of the resistance potential of clinical and environmental metagenomic samples through the quantification of AR genes (i.e., genes conferring antibiotic resistance). Therefore, determining an optimal methodology to quantitatively and accurately assess AR genes in a given environment is pivotal. Here, we optimized and improved existing AR detection methodologies from metagenomic datasets to properly consider AR-generating mutations in antibiotic target genes. Through comparative metagenomic analysis of previously published AR gene abundance in three publicly available metagenomes, we illustrate how mutation-generated resistance genes are either falsely assigned or neglected, which alters the detection and quantitation of the antibiotic resistome. In addition, we inspected factors influencing the outcome of AR gene quantification using metagenome simulation experiments, and identified that genome size, AR gene length, total number of metagenomics reads and selected sequencing platforms had pronounced effects on the level of detected AR. In conclusion, our proposed improvements in the current methodologies for accurate AR detection and resistome assessment show reliable results when tested on real and simulated metagenomic datasets.

  19. Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research.

    PubMed

    McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R; Beresford, Michael W

    2018-02-01

    This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  20. Simulation of Smart Home Activity Datasets

    PubMed Central

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-01-01

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371

  1. Simulation of Smart Home Activity Datasets.

    PubMed

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-06-16

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.

  2. Object detection approach using generative sparse, hierarchical networks with top-down and lateral connections for combining texture/color detection and shape/contour detection

    DOEpatents

    Paiton, Dylan M.; Kenyon, Garrett T.; Brumby, Steven P.; Schultz, Peter F.; George, John S.

    2015-07-28

    An approach to detecting objects in an image dataset may combine texture/color detection, shape/contour detection, and/or motion detection using sparse, generative, hierarchical models with lateral and top-down connections. A first independent representation of objects in an image dataset may be produced using a color/texture detection algorithm. A second independent representation of objects in the image dataset may be produced using a shape/contour detection algorithm. A third independent representation of objects in the image dataset may be produced using a motion detection algorithm. The first, second, and third independent representations may then be combined into a single coherent output using a combinatorial algorithm.

  3. A comparison of three clustering methods for finding subgroups in MRI, SMS or clinical data: SPSS TwoStep Cluster analysis, Latent Gold and SNOB.

    PubMed

    Kent, Peter; Jensen, Rikke K; Kongsted, Alice

    2014-10-02

    There are various methodological approaches to identifying clinically important subgroups and one method is to identify clusters of characteristics that differentiate people in cross-sectional and/or longitudinal data using Cluster Analysis (CA) or Latent Class Analysis (LCA). There is a scarcity of head-to-head comparisons that can inform the choice of which clustering method might be suitable for particular clinical datasets and research questions. Therefore, the aim of this study was to perform a head-to-head comparison of three commonly available methods (SPSS TwoStep CA, Latent Gold LCA and SNOB LCA). The performance of these three methods was compared: (i) quantitatively using the number of subgroups detected, the classification probability of individuals into subgroups, the reproducibility of results, and (ii) qualitatively using subjective judgments about each program's ease of use and interpretability of the presentation of results.We analysed five real datasets of varying complexity in a secondary analysis of data from other research projects. Three datasets contained only MRI findings (n = 2,060 to 20,810 vertebral disc levels), one dataset contained only pain intensity data collected for 52 weeks by text (SMS) messaging (n = 1,121 people), and the last dataset contained a range of clinical variables measured in low back pain patients (n = 543 people). Four artificial datasets (n = 1,000 each) containing subgroups of varying complexity were also analysed testing the ability of these clustering methods to detect subgroups and correctly classify individuals when subgroup membership was known. The results from the real clinical datasets indicated that the number of subgroups detected varied, the certainty of classifying individuals into those subgroups varied, the findings had perfect reproducibility, some programs were easier to use and the interpretability of the presentation of their findings also varied. The results from the artificial datasets indicated that all three clustering methods showed a near-perfect ability to detect known subgroups and correctly classify individuals into those subgroups. Our subjective judgement was that Latent Gold offered the best balance of sensitivity to subgroups, ease of use and presentation of results with these datasets but we recognise that different clustering methods may suit other types of data and clinical research questions.

  4. Extracting the normal lung dose-response curve from clinical DVH data: a possible role for low dose hyper-radiosensitivity, increased radioresistance

    NASA Astrophysics Data System (ADS)

    Gordon, J. J.; Snyder, K.; Zhong, H.; Barton, K.; Sun, Z.; Chetty, I. J.; Matuszak, M.; Ten Haken, R. K.

    2015-09-01

    In conventionally fractionated radiation therapy for lung cancer, radiation pneumonitis’ (RP) dependence on the normal lung dose-volume histogram (DVH) is not well understood. Complication models alternatively make RP a function of a summary statistic, such as mean lung dose (MLD). This work searches over damage profiles, which quantify sub-volume damage as a function of dose. Profiles that achieve best RP predictive accuracy on a clinical dataset are hypothesized to approximate DVH dependence. Step function damage rate profiles R(D) are generated, having discrete steps at several dose points. A range of profiles is sampled by varying the step heights and dose point locations. Normal lung damage is the integral of R(D) with the cumulative DVH. Each profile is used in conjunction with a damage cutoff to predict grade 2 plus (G2+) RP for DVHs from a University of Michigan clinical trial dataset consisting of 89 CFRT patients, of which 17 were diagnosed with G2+ RP. Optimal profiles achieve a modest increase in predictive accuracy—erroneous RP predictions are reduced from 11 (using MLD) to 8. A novel result is that optimal profiles have a similar distinctive shape: enhanced damage contribution from low doses (<20 Gy), a flat contribution from doses in the range ~20-40 Gy, then a further enhanced contribution from doses above 40 Gy. These features resemble the hyper-radiosensitivity / increased radioresistance (HRS/IRR) observed in some cell survival curves, which can be modeled using Joiner’s induced repair model. A novel search strategy is employed, which has the potential to estimate RP dependence on the normal lung DVH. When applied to a clinical dataset, identified profiles share a characteristic shape, which resembles HRS/IRR. This suggests that normal lung may have enhanced sensitivity to low doses, and that this sensitivity can affect RP risk.

  5. Conducting high-value secondary dataset analysis: an introductory guide and resources.

    PubMed

    Smith, Alexander K; Ayanian, John Z; Covinsky, Kenneth E; Landon, Bruce E; McCarthy, Ellen P; Wee, Christina C; Steinman, Michael A

    2011-08-01

    Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.

  6. Segmentation of Unstructured Datasets

    NASA Technical Reports Server (NTRS)

    Bhat, Smitha

    1996-01-01

    Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.

  7. ASSISTments Dataset from Multiple Randomized Controlled Experiments

    ERIC Educational Resources Information Center

    Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil

    2016-01-01

    In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…

  8. The Clinical Next-Generation Sequencing Database: A Tool for the Unified Management of Clinical Information and Genetic Variants to Accelerate Variant Pathogenicity Classification.

    PubMed

    Nishio, Shin-Ya; Usami, Shin-Ichi

    2017-03-01

    Recent advances in next-generation sequencing (NGS) have given rise to new challenges due to the difficulties in variant pathogenicity interpretation and large dataset management, including many kinds of public population databases as well as public or commercial disease-specific databases. Here, we report a new database development tool, named the "Clinical NGS Database," for improving clinical NGS workflow through the unified management of variant information and clinical information. This database software offers a two-feature approach to variant pathogenicity classification. The first of these approaches is a phenotype similarity-based approach. This database allows the easy comparison of the detailed phenotype of each patient with the average phenotype of the same gene mutation at the variant or gene level. It is also possible to browse patients with the same gene mutation quickly. The other approach is a statistical approach to variant pathogenicity classification based on the use of the odds ratio for comparisons between the case and the control for each inheritance mode (families with apparently autosomal dominant inheritance vs. control, and families with apparently autosomal recessive inheritance vs. control). A number of case studies are also presented to illustrate the utility of this database. © 2016 The Authors. **Human Mutation published by Wiley Periodicals, Inc.

  9. Developing a provisional, international minimal dataset for Juvenile Dermatomyositis: for use in clinical practice to inform research.

    PubMed

    McCann, Liza J; Arnold, Katie; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Beard, Laura; Beresford, Michael W; Wedderburn, Lucy R

    2014-01-01

    Juvenile dermatomyositis (JDM) is a rare but severe autoimmune inflammatory myositis of childhood. International collaboration is essential in order to undertake clinical trials, understand the disease and improve long-term outcome. The aim of this study was to propose from existing collaborative initiatives a preliminary minimal dataset for JDM. This will form the basis of the future development of an international consensus-approved minimum core dataset to be used both in clinical care and inform research, allowing integration of data between centres. A working group of internationally-representative JDM experts was formed to develop a provisional minimal dataset. Clinical and laboratory variables contained within current national and international collaborative databases of patients with idiopathic inflammatory myopathies were scrutinised. Judgements were informed by published literature and a more detailed analysis of the Juvenile Dermatomyositis Cohort Biomarker Study and Repository, UK and Ireland. A provisional minimal JDM dataset has been produced, with an associated glossary of definitions. The provisional minimal dataset will request information at time of patient diagnosis and during on-going prospective follow up. At time of patient diagnosis, information will be requested on patient demographics, diagnostic criteria and treatments given prior to diagnosis. During on-going prospective follow-up, variables will include the presence of active muscle or skin disease, major organ involvement or constitutional symptoms, investigations, treatment, physician global assessments and patient reported outcome measures. An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. This preliminary dataset can now be developed into a consensus-approved minimum core dataset and tested in a wider setting with the aim of achieving international agreement.

  10. Developing a provisional, international Minimal Dataset for Juvenile Dermatomyositis: for use in clinical practice to inform research

    PubMed Central

    2014-01-01

    Background Juvenile dermatomyositis (JDM) is a rare but severe autoimmune inflammatory myositis of childhood. International collaboration is essential in order to undertake clinical trials, understand the disease and improve long-term outcome. The aim of this study was to propose from existing collaborative initiatives a preliminary minimal dataset for JDM. This will form the basis of the future development of an international consensus-approved minimum core dataset to be used both in clinical care and inform research, allowing integration of data between centres. Methods A working group of internationally-representative JDM experts was formed to develop a provisional minimal dataset. Clinical and laboratory variables contained within current national and international collaborative databases of patients with idiopathic inflammatory myopathies were scrutinised. Judgements were informed by published literature and a more detailed analysis of the Juvenile Dermatomyositis Cohort Biomarker Study and Repository, UK and Ireland. Results A provisional minimal JDM dataset has been produced, with an associated glossary of definitions. The provisional minimal dataset will request information at time of patient diagnosis and during on-going prospective follow up. At time of patient diagnosis, information will be requested on patient demographics, diagnostic criteria and treatments given prior to diagnosis. During on-going prospective follow-up, variables will include the presence of active muscle or skin disease, major organ involvement or constitutional symptoms, investigations, treatment, physician global assessments and patient reported outcome measures. Conclusions An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. This preliminary dataset can now be developed into a consensus-approved minimum core dataset and tested in a wider setting with the aim of achieving international agreement. PMID:25075205

  11. An interactive web application for the dissemination of human systems immunology data.

    PubMed

    Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien

    2015-06-19

    Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.

  12. CIFAR10-DVS: An Event-Stream Dataset for Object Classification

    PubMed Central

    Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

    2017-01-01

    Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582

  13. CIFAR10-DVS: An Event-Stream Dataset for Object Classification.

    PubMed

    Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

    2017-01-01

    Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.

  14. Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study

    PubMed Central

    2015-01-01

    Objective This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are “invisible” or not deposited in a known repository. Methods We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article. Results About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects. Conclusion In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a “dataset,” determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets. PMID:26207759

  15. Realistic computer network simulation for network intrusion detection dataset generation

    NASA Astrophysics Data System (ADS)

    Payer, Garrett

    2015-05-01

    The KDD-99 Cup dataset is dead. While it can continue to be used as a toy example, the age of this dataset makes it all but useless for intrusion detection research and data mining. Many of the attacks used within the dataset are obsolete and do not reflect the features important for intrusion detection in today's networks. Creating a new dataset encompassing a large cross section of the attacks found on the Internet today could be useful, but would eventually fall to the same problem as the KDD-99 Cup; its usefulness would diminish after a period of time. To continue research into intrusion detection, the generation of new datasets needs to be as dynamic and as quick as the attacker. Simply examining existing network traffic and using domain experts such as intrusion analysts to label traffic is inefficient, expensive, and not scalable. The only viable methodology is simulation using technologies including virtualization, attack-toolsets such as Metasploit and Armitage, and sophisticated emulation of threat and user behavior. Simulating actual user behavior and network intrusion events dynamically not only allows researchers to vary scenarios quickly, but enables online testing of intrusion detection mechanisms by interacting with data as it is generated. As new threat behaviors are identified, they can be added to the simulation to make quicker determinations as to the effectiveness of existing and ongoing network intrusion technology, methodology and models.

  16. The next generation of melanocyte data: Genetic, epigenetic, and transcriptional resource datasets and analysis tools.

    PubMed

    Loftus, Stacie K

    2018-05-01

    The number of melanocyte- and melanoma-derived next generation sequence genome-scale datasets have rapidly expanded over the past several years. This resource guide provides a summary of publicly available sources of melanocyte cell derived whole genome, exome, mRNA and miRNA transcriptome, chromatin accessibility and epigenetic datasets. Also highlighted are bioinformatic resources and tools for visualization and data queries which allow researchers a genome-scale view of the melanocyte. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.

  17. The Function Biomedical Informatics Research Network Data Repository

    PubMed Central

    Keator, David B.; van Erp, Theo G.M.; Turner, Jessica A.; Glover, Gary H.; Mueller, Bryon A.; Liu, Thomas T.; Voyvodic, James T.; Rasmussen, Jerod; Calhoun, Vince D.; Lee, Hyo Jong; Toga, Arthur W.; McEwen, Sarah; Ford, Judith M.; Mathalon, Daniel H.; Diaz, Michele; O’Leary, Daniel S.; Bockholt, H. Jeremy; Gadde, Syam; Preda, Adrian; Wible, Cynthia G.; Stern, Hal S.; Belger, Aysenil; McCarthy, Gregory; Ozyurt, Burak; Potkin, Steven G.

    2015-01-01

    The Function Biomedical Informatics Research Network (FBIRN) developed methods and tools for conducting multi-scanner functional magnetic resonance imaging (fMRI) studies. Method and tool development were based on two major goals: 1) to assess the major sources of variation in fMRI studies conducted across scanners, including instrumentation, acquisition protocols, challenge tasks, and analysis methods, and 2) to provide a distributed network infrastructure and an associated federated database to host and query large, multi-site, fMRI and clinical datasets. In the process of achieving these goals the FBIRN test bed generated several multi-scanner brain imaging data sets to be shared with the wider scientific community via the BIRN Data Repository (BDR). The FBIRN Phase 1 dataset consists of a traveling subject study of 5 healthy subjects, each scanned on 10 different 1.5 to 4 Tesla scanners. The FBIRN Phase 2 and Phase 3 datasets consist of subjects with schizophrenia or schizoaffective disorder along with healthy comparison subjects scanned at multiple sites. In this paper, we provide concise descriptions of FBIRN’s multi-scanner brain imaging data sets and details about the BIRN Data Repository instance of the Human Imaging Database (HID) used to publicly share the data. PMID:26364863

  18. Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems

    NASA Astrophysics Data System (ADS)

    Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille

    2016-04-01

    Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not suitable for CPV systems. For these systems, the TMY datasets obtained using dedicated drivers (DNI only or more precise one) are more representative to derive TMY datasets from limited long-term meteorological dataset.

  19. Accuracy of five intraoral scanners compared to indirect digitalization.

    PubMed

    Güth, Jan-Frederik; Runkel, Cornelius; Beuer, Florian; Stimmelmayr, Michael; Edelhoff, Daniel; Keul, Christine

    2017-06-01

    Direct and indirect digitalization offer two options for computer-aided design (CAD)/ computer-aided manufacturing (CAM)-generated restorations. The aim of this study was to evaluate the accuracy of different intraoral scanners and compare them to the process of indirect digitalization. A titanium testing model was directly digitized 12 times with each intraoral scanner: (1) CS 3500 (CS), (2) Zfx Intrascan (ZFX), (3) CEREC AC Bluecam (BLU), (4) CEREC AC Omnicam (OC) and (5) True Definition (TD). As control, 12 polyether impressions were taken and the referring plaster casts were digitized indirectly with the D-810 laboratory scanner (CON). The accuracy (trueness/precision) of the datasets was evaluated by an analysing software (Geomagic Qualify 12.1) using a "best fit alignment" of the datasets with a highly accurate reference dataset of the testing model, received from industrial computed tomography. Direct digitalization using the TD showed the significant highest overall "trueness", followed by CS. Both performed better than CON. BLU, ZFX and OC showed higher differences from the reference dataset than CON. Regarding the overall "precision", the CS 3500 intraoral scanner and the True Definition showed the best performance. CON, BLU and OC resulted in significantly higher precision than ZFX did. Within the limitations of this in vitro study, the accuracy of the ascertained datasets was dependent on the scanning system. The direct digitalization was not superior to indirect digitalization for all tested systems. Regarding the accuracy, all tested intraoral scanning technologies seem to be able to reproduce a single quadrant within clinical acceptable accuracy. However, differences were detected between the tested systems.

  20. Gene-Expression Signature Predicts Postoperative Recurrence in Stage I Non-Small Cell Lung Cancer Patients

    PubMed Central

    Lu, Yan; Wang, Liang; Liu, Pengyuan; Yang, Ping; You, Ming

    2012-01-01

    About 30% stage I non-small cell lung cancer (NSCLC) patients undergoing resection will recur. Robust prognostic markers are required to better manage therapy options. The purpose of this study is to develop and validate a novel gene-expression signature that can predict tumor recurrence of stage I NSCLC patients. Cox proportional hazards regression analysis was performed to identify recurrence-related genes and a partial Cox regression model was used to generate a gene signature of recurrence in the training dataset −142 stage I lung adenocarcinomas without adjunctive therapy from the Director's Challenge Consortium. Four independent validation datasets, including GSE5843, GSE8894, and two other datasets provided by Mayo Clinic and Washington University, were used to assess the prediction accuracy by calculating the correlation between risk score estimated from gene expression and real recurrence-free survival time and AUC of time-dependent ROC analysis. Pathway-based survival analyses were also performed. 104 probesets correlated with recurrence in the training dataset. They are enriched in cell adhesion, apoptosis and regulation of cell proliferation. A 51-gene expression signature was identified to distinguish patients likely to develop tumor recurrence (Dxy = −0.83, P<1e-16) and this signature was validated in four independent datasets with AUC >85%. Multiple pathways including leukocyte transendothelial migration and cell adhesion were highly correlated with recurrence-free survival. The gene signature is highly predictive of recurrence in stage I NSCLC patients, which has important prognostic and therapeutic implications for the future management of these patients. PMID:22292069

  1. Robust Statistical Fusion of Image Labels

    PubMed Central

    Landman, Bennett A.; Asman, Andrew J.; Scoggins, Andrew G.; Bogovic, John A.; Xing, Fangxu; Prince, Jerry L.

    2011-01-01

    Image labeling and parcellation (i.e. assigning structure to a collection of voxels) are critical tasks for the assessment of volumetric and morphometric features in medical imaging data. The process of image labeling is inherently error prone as images are corrupted by noise and artifacts. Even expert interpretations are subject to subjectivity and the precision of the individual raters. Hence, all labels must be considered imperfect with some degree of inherent variability. One may seek multiple independent assessments to both reduce this variability and quantify the degree of uncertainty. Existing techniques have exploited maximum a posteriori statistics to combine data from multiple raters and simultaneously estimate rater reliabilities. Although quite successful, wide-scale application has been hampered by unstable estimation with practical datasets, for example, with label sets with small or thin objects to be labeled or with partial or limited datasets. As well, these approaches have required each rater to generate a complete dataset, which is often impossible given both human foibles and the typical turnover rate of raters in a research or clinical environment. Herein, we propose a robust approach to improve estimation performance with small anatomical structures, allow for missing data, account for repeated label sets, and utilize training/catch trial data. With this approach, numerous raters can label small, overlapping portions of a large dataset, and rater heterogeneity can be robustly controlled while simultaneously estimating a single, reliable label set and characterizing uncertainty. The proposed approach enables many individuals to collaborate in the construction of large datasets for labeling tasks (e.g., human parallel processing) and reduces the otherwise detrimental impact of rater unavailability. PMID:22010145

  2. Object detection approach using generative sparse, hierarchical networks with top-down and lateral connections for combining texture/color detection and shape/contour detection

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Paiton, Dylan M.; Kenyon, Garrett T.; Brumby, Steven P.

    An approach to detecting objects in an image dataset may combine texture/color detection, shape/contour detection, and/or motion detection using sparse, generative, hierarchical models with lateral and top-down connections. A first independent representation of objects in an image dataset may be produced using a color/texture detection algorithm. A second independent representation of objects in the image dataset may be produced using a shape/contour detection algorithm. A third independent representation of objects in the image dataset may be produced using a motion detection algorithm. The first, second, and third independent representations may then be combined into a single coherent output using amore » combinatorial algorithm.« less

  3. Ub-ISAP: a streamlined UNIX pipeline for mining unique viral vector integration sites from next generation sequencing data.

    PubMed

    Kamboj, Atul; Hallwirth, Claus V; Alexander, Ian E; McCowage, Geoffrey B; Kramer, Belinda

    2017-06-17

    The analysis of viral vector genomic integration sites is an important component in assessing the safety and efficiency of patient treatment using gene therapy. Alongside this clinical application, integration site identification is a key step in the genetic mapping of viral elements in mutagenesis screens that aim to elucidate gene function. We have developed a UNIX-based vector integration site analysis pipeline (Ub-ISAP) that utilises a UNIX-based workflow for automated integration site identification and annotation of both single and paired-end sequencing reads. Reads that contain viral sequences of interest are selected and aligned to the host genome, and unique integration sites are then classified as transcription start site-proximal, intragenic or intergenic. Ub-ISAP provides a reliable and efficient pipeline to generate large datasets for assessing the safety and efficiency of integrating vectors in clinical settings, with broader applications in cancer research. Ub-ISAP is available as an open source software package at https://sourceforge.net/projects/ub-isap/ .

  4. Big Data and the Future of Radiology Informatics.

    PubMed

    Kansagra, Akash P; Yu, John-Paul J; Chatterjee, Arindam R; Lenchik, Leon; Chow, Daniel S; Prater, Adam B; Yeh, Jean; Doshi, Ankur M; Hawkins, C Matthew; Heilbrun, Marta E; Smith, Stacy E; Oselkin, Martin; Gupta, Pushpender; Ali, Sayed

    2016-01-01

    Rapid growth in the amount of data that is electronically recorded as part of routine clinical operations has generated great interest in the use of Big Data methodologies to address clinical and research questions. These methods can efficiently analyze and deliver insights from high-volume, high-variety, and high-growth rate datasets generated across the continuum of care, thereby forgoing the time, cost, and effort of more focused and controlled hypothesis-driven research. By virtue of an existing robust information technology infrastructure and years of archived digital data, radiology departments are particularly well positioned to take advantage of emerging Big Data techniques. In this review, we describe four areas in which Big Data is poised to have an immediate impact on radiology practice, research, and operations. In addition, we provide an overview of the Big Data adoption cycle and describe how academic radiology departments can promote Big Data development. Copyright © 2016 The Association of University Radiologists. Published by Elsevier Inc. All rights reserved.

  5. Validation of Short-Term Noise Assessment Procedures: FY16 Summary of Procedures, Progress, and Preliminary Results

    DTIC Science & Technology

    Validation project. This report describes the procedure used to generate the noise models output dataset , and then it compares that dataset to the...benchmark, the Engineer Research and Development Centers Long-Range Sound Propagation dataset . It was found that the models consistently underpredict the

  6. Systematic chemical-genetic and chemical-chemical interaction datasets for prediction of compound synergism

    PubMed Central

    Wildenhain, Jan; Spitzer, Michaela; Dolma, Sonam; Jarvik, Nick; White, Rachel; Roy, Marcia; Griffiths, Emma; Bellows, David S.; Wright, Gerard D.; Tyers, Mike

    2016-01-01

    The network structure of biological systems suggests that effective therapeutic intervention may require combinations of agents that act synergistically. However, a dearth of systematic chemical combination datasets have limited the development of predictive algorithms for chemical synergism. Here, we report two large datasets of linked chemical-genetic and chemical-chemical interactions in the budding yeast Saccharomyces cerevisiae. We screened 5,518 unique compounds against 242 diverse yeast gene deletion strains to generate an extended chemical-genetic matrix (CGM) of 492,126 chemical-gene interaction measurements. This CGM dataset contained 1,434 genotype-specific inhibitors, termed cryptagens. We selected 128 structurally diverse cryptagens and tested all pairwise combinations to generate a benchmark dataset of 8,128 pairwise chemical-chemical interaction tests for synergy prediction, termed the cryptagen matrix (CM). An accompanying database resource called ChemGRID was developed to enable analysis, visualisation and downloads of all data. The CGM and CM datasets will facilitate the benchmarking of computational approaches for synergy prediction, as well as chemical structure-activity relationship models for anti-fungal drug discovery. PMID:27874849

  7. Exploration of the association rules mining technique for the signal detection of adverse drug events in spontaneous reporting systems.

    PubMed

    Wang, Chao; Guo, Xiao-Jing; Xu, Jin-Fang; Wu, Cheng; Sun, Ya-Lin; Ye, Xiao-Fei; Qian, Wei; Ma, Xiu-Qiang; Du, Wen-Min; He, Jia

    2012-01-01

    The detection of signals of adverse drug events (ADEs) has increased because of the use of data mining algorithms in spontaneous reporting systems (SRSs). However, different data mining algorithms have different traits and conditions for application. The objective of our study was to explore the application of association rule (AR) mining in ADE signal detection and to compare its performance with that of other algorithms. Monte Carlo simulation was applied to generate drug-ADE reports randomly according to the characteristics of SRS datasets. Thousand simulated datasets were mined by AR and other algorithms. On average, 108,337 reports were generated by the Monte Carlo simulation. Based on the predefined criterion that 10% of the drug-ADE combinations were true signals, with RR equaling to 10, 4.9, 1.5, and 1.2, AR detected, on average, 284 suspected associations with a minimum support of 3 and a minimum lift of 1.2. The area under the receiver operating characteristic (ROC) curve of the AR was 0.788, which was equivalent to that shown for other algorithms. Additionally, AR was applied to reports submitted to the Shanghai SRS in 2009. Five hundred seventy combinations were detected using AR from 24,297 SRS reports, and they were compared with recognized ADEs identified by clinical experts and various other sources. AR appears to be an effective method for ADE signal detection, both in simulated and real SRS datasets. The limitations of this method exposed in our study, i.e., a non-uniform thresholds setting and redundant rules, require further research.

  8. The ICR96 exon CNV validation series: a resource for orthogonal assessment of exon CNV calling in NGS data.

    PubMed

    Mahamdallie, Shazia; Ruark, Elise; Yost, Shawn; Ramsay, Emma; Uddin, Imran; Wylie, Harriett; Elliott, Anna; Strydom, Ann; Renwick, Anthony; Seal, Sheila; Rahman, Nazneen

    2017-01-01

    Detection of deletions and duplications of whole exons (exon CNVs) is a key requirement of genetic testing. Accurate detection of this variant type has proved very challenging in targeted next-generation sequencing (NGS) data, particularly if only a single exon is involved. Many different NGS exon CNV calling methods have been developed over the last five years. Such methods are usually evaluated using simulated and/or in-house data due to a lack of publicly-available datasets with orthogonally generated results. This hinders tool comparisons, transparency and reproducibility. To provide a community resource for assessment of exon CNV calling methods in targeted NGS data, we here present the ICR96 exon CNV validation series. The dataset includes high-quality sequencing data from a targeted NGS assay (the TruSight Cancer Panel) together with Multiplex Ligation-dependent Probe Amplification (MLPA) results for 96 independent samples. 66 samples contain at least one validated exon CNV and 30 samples have validated negative results for exon CNVs in 26 genes. The dataset includes 46 exon CNVs in BRCA1 , BRCA2 , TP53 , MLH1 , MSH2 , MSH6 , PMS2 , EPCAM or PTEN , giving excellent representation of the cancer predisposition genes most frequently tested in clinical practice. Moreover, the validated exon CNVs include 25 single exon CNVs, the most difficult type of exon CNV to detect. The FASTQ files for the ICR96 exon CNV validation series can be accessed through the European-Genome phenome Archive (EGA) under the accession number EGAS00001002428.

  9. Predicting Response to Histone Deacetylase Inhibitors Using High-Throughput Genomics.

    PubMed

    Geeleher, Paul; Loboda, Andrey; Lenkala, Divya; Wang, Fan; LaCroix, Bonnie; Karovic, Sanja; Wang, Jacqueline; Nebozhyn, Michael; Chisamore, Michael; Hardwick, James; Maitland, Michael L; Huang, R Stephanie

    2015-11-01

    Many disparate biomarkers have been proposed as predictors of response to histone deacetylase inhibitors (HDI); however, all have failed when applied clinically. Rather than this being entirely an issue of reproducibility, response to the HDI vorinostat may be determined by the additive effect of multiple molecular factors, many of which have previously been demonstrated. We conducted a large-scale gene expression analysis using the Cancer Genome Project for discovery and generated another large independent cancer cell line dataset across different cancers for validation. We compared different approaches in terms of how accurately vorinostat response can be predicted on an independent out-of-batch set of samples and applied the polygenic marker prediction principles in a clinical trial. Using machine learning, the small effects that aggregate, resulting in sensitivity or resistance, can be recovered from gene expression data in a large panel of cancer cell lines.This approach can predict vorinostat response accurately, whereas single gene or pathway markers cannot. Our analyses recapitulated and contextualized many previous findings and suggest an important role for processes such as chromatin remodeling, autophagy, and apoptosis. As a proof of concept, we also discovered a novel causative role for CHD4, a helicase involved in the histone deacetylase complex that is associated with poor clinical outcome. As a clinical validation, we demonstrated that a common dose-limiting toxicity of vorinostat, thrombocytopenia, can be predicted (r = 0.55, P = .004) several days before it is detected clinically. Our work suggests a paradigm shift from single-gene/pathway evaluation to simultaneously evaluating multiple independent high-throughput gene expression datasets, which can be easily extended to other investigational compounds where similar issues are hampering clinical adoption. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  10. A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources.

    PubMed

    Moon, Sungrim; Pakhomov, Serguei; Liu, Nathan; Ryan, James O; Melton, Genevieve B

    2014-01-01

    To create a sense inventory of abbreviations and acronyms from clinical texts. The most frequently occurring abbreviations and acronyms from 352,267 dictated clinical notes were used to create a clinical sense inventory. Senses of each abbreviation and acronym were manually annotated from 500 random instances and lexically matched with long forms within the Unified Medical Language System (UMLS V.2011AB), Another Database of Abbreviations in Medline (ADAM), and Stedman's Dictionary, Medical Abbreviations, Acronyms & Symbols, 4th edition (Stedman's). Redundant long forms were merged after they were lexically normalized using Lexical Variant Generation (LVG). The clinical sense inventory was found to have skewed sense distributions, practice-specific senses, and incorrect uses. Of 440 abbreviations and acronyms analyzed in this study, 949 long forms were identified in clinical notes. This set was mapped to 17,359, 5233, and 4879 long forms in UMLS, ADAM, and Stedman's, respectively. After merging long forms, only 2.3% matched across all medical resources. The UMLS, ADAM, and Stedman's covered 5.7%, 8.4%, and 11% of the merged clinical long forms, respectively. The sense inventory of clinical abbreviations and acronyms and anonymized datasets generated from this study are available for public use at http://www.bmhi.umn.edu/ihi/research/nlpie/resources/index.htm ('Sense Inventories', website). Clinical sense inventories of abbreviations and acronyms created using clinical notes and medical dictionary resources demonstrate challenges with term coverage and resource integration. Further work is needed to help with standardizing abbreviations and acronyms in clinical care and biomedicine to facilitate automated processes such as text-mining and information extraction.

  11. A global distributed basin morphometric dataset

    NASA Astrophysics Data System (ADS)

    Shen, Xinyi; Anagnostou, Emmanouil N.; Mei, Yiwen; Hong, Yang

    2017-01-01

    Basin morphometry is vital information for relating storms to hydrologic hazards, such as landslides and floods. In this paper we present the first comprehensive global dataset of distributed basin morphometry at 30 arc seconds resolution. The dataset includes nine prime morphometric variables; in addition we present formulas for generating twenty-one additional morphometric variables based on combination of the prime variables. The dataset can aid different applications including studies of land-atmosphere interaction, and modelling of floods and droughts for sustainable water management. The validity of the dataset has been consolidated by successfully repeating the Hack's law.

  12. A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks.

    PubMed

    Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong

    2017-01-01

    Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.

  13. A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks

    PubMed Central

    Wu, Chenxue; Liu, Zhao; Zhu, Yunhong

    2017-01-01

    Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687

  14. Multi-Fidelity Framework for Modeling Combustion Instability

    DTIC Science & Technology

    2016-07-27

    generated from the reduced-domain dataset. Evaluations of the framework are performed based on simplified test problems for a model rocket combustor showing...generated from the reduced-domain dataset. Evaluations of the framework are performed based on simplified test problems for a model rocket combustor...of Aeronautics and Astronautics and Associate Fellow AIAA. ‡ Professor Emeritus. § Senior Scientist, Rocket Propulsion Division and Senior Member

  15. Adaptive template generation for amyloid PET using a deep learning approach.

    PubMed

    Kang, Seung Kwan; Seo, Seongho; Shin, Seong A; Byun, Min Soo; Lee, Dong Young; Kim, Yu Kyeong; Lee, Dong Soo; Lee, Jae Sung

    2018-05-11

    Accurate spatial normalization (SN) of amyloid positron emission tomography (PET) images for Alzheimer's disease assessment without coregistered anatomical magnetic resonance imaging (MRI) of the same individual is technically challenging. In this study, we applied deep neural networks to generate individually adaptive PET templates for robust and accurate SN of amyloid PET without using matched 3D MR images. Using 681 pairs of simultaneously acquired 11 C-PIB PET and T1-weighted 3D MRI scans of AD, MCI, and cognitively normal subjects, we trained and tested two deep neural networks [convolutional auto-encoder (CAE) and generative adversarial network (GAN)] that produce adaptive best PET templates. More specifically, the networks were trained using 685,100 pieces of augmented data generated by rotating 527 randomly selected datasets and validated using 154 datasets. The input to the supervised neural networks was the 3D PET volume in native space and the label was the spatially normalized 3D PET image using the transformation parameters obtained from MRI-based SN. The proposed deep learning approach significantly enhanced the quantitative accuracy of MRI-less amyloid PET assessment by reducing the SN error observed when an average amyloid PET template is used. Given an input image, the trained deep neural networks rapidly provide individually adaptive 3D PET templates without any discontinuity between the slices (in 0.02 s). As the proposed method does not require 3D MRI for the SN of PET images, it has great potential for use in routine analysis of amyloid PET images in clinical practice and research. © 2018 Wiley Periodicals, Inc.

  16. Hierarchical storage of large volume of multidector CT data using distributed servers

    NASA Astrophysics Data System (ADS)

    Ratib, Osman; Rosset, Antoine; Heuberger, Joris; Bandon, David

    2006-03-01

    Multidector scanners and hybrid multimodality scanners have the ability to generate large number of high-resolution images resulting in very large data sets. In most cases, these datasets are generated for the sole purpose of generating secondary processed images and 3D rendered images as well as oblique and curved multiplanar reformatted images. It is therefore not essential to archive the original images after they have been processed. We have developed an architecture of distributed archive servers for temporary storage of large image datasets for 3D rendering and image processing without the need for long term storage in PACS archive. With the relatively low cost of storage devices it is possible to configure these servers to hold several months or even years of data, long enough for allowing subsequent re-processing if required by specific clinical situations. We tested the latest generation of RAID servers provided by Apple computers with a capacity of 5 TBytes. We implemented a peer-to-peer data access software based on our Open-Source image management software called OsiriX, allowing remote workstations to directly access DICOM image files located on the server through a new technology called "bonjour". This architecture offers a seamless integration of multiple servers and workstations without the need for central database or complex workflow management tools. It allows efficient access to image data from multiple workstation for image analysis and visualization without the need for image data transfer. It provides a convenient alternative to centralized PACS architecture while avoiding complex and time-consuming data transfer and storage.

  17. Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

    PubMed Central

    Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.

    2016-01-01

    An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777

  18. CrossCheck: an open-source web tool for high-throughput screen data analysis.

    PubMed

    Najafov, Jamil; Najafov, Ayaz

    2017-07-19

    Modern high-throughput screening methods allow researchers to generate large datasets that potentially contain important biological information. However, oftentimes, picking relevant hits from such screens and generating testable hypotheses requires training in bioinformatics and the skills to efficiently perform database mining. There are currently no tools available to general public that allow users to cross-reference their screen datasets with published screen datasets. To this end, we developed CrossCheck, an online platform for high-throughput screen data analysis. CrossCheck is a centralized database that allows effortless comparison of the user-entered list of gene symbols with 16,231 published datasets. These datasets include published data from genome-wide RNAi and CRISPR screens, interactome proteomics and phosphoproteomics screens, cancer mutation databases, low-throughput studies of major cell signaling mediators, such as kinases, E3 ubiquitin ligases and phosphatases, and gene ontological information. Moreover, CrossCheck includes a novel database of predicted protein kinase substrates, which was developed using proteome-wide consensus motif searches. CrossCheck dramatically simplifies high-throughput screen data analysis and enables researchers to dig deep into the published literature and streamline data-driven hypothesis generation. CrossCheck is freely accessible as a web-based application at http://proteinguru.com/crosscheck.

  19. Automated generation of massive image knowledge collections using Microsoft Live Labs Pivot to promote neuroimaging and translational research.

    PubMed

    Viangteeravat, Teeradache; Anyanwu, Matthew N; Ra Nagisetty, Venkateswara; Kuscu, Emin

    2011-07-15

    Massive datasets comprising high-resolution images, generated in neuro-imaging studies and in clinical imaging research, are increasingly challenging our ability to analyze, share, and filter such images in clinical and basic translational research. Pivot collection exploratory analysis provides each user the ability to fully interact with the massive amounts of visual data to fully facilitate sufficient sorting, flexibility and speed to fluidly access, explore or analyze the massive image data sets of high-resolution images and their associated meta information, such as neuro-imaging databases from the Allen Brain Atlas. It is used in clustering, filtering, data sharing and classifying of the visual data into various deep zoom levels and meta information categories to detect the underlying hidden pattern within the data set that has been used. We deployed prototype Pivot collections using the Linux CentOS running on the Apache web server. We also tested the prototype Pivot collections on other operating systems like Windows (the most common variants) and UNIX, etc. It is demonstrated that the approach yields very good results when compared with other approaches used by some researchers for generation, creation, and clustering of massive image collections such as the coronal and horizontal sections of the mouse brain from the Allen Brain Atlas. Pivot visual analytics was used to analyze a prototype of dataset Dab2 co-expressed genes from the Allen Brain Atlas. The metadata along with high-resolution images were automatically extracted using the Allen Brain Atlas API. It is then used to identify the hidden information based on the various categories and conditions applied by using options generated from automated collection. A metadata category like chromosome, as well as data for individual cases like sex, age, and plan attributes of a particular gene, is used to filter, sort and to determine if there exist other genes with a similar characteristics to Dab2. And online access to the mouse brain pivot collection can be viewed using the link http://edtech-dev.uthsc.edu/CTSI/teeDev1/unittest/PaPa/collection.html (user name: tviangte and password: demome) Our proposed algorithm has automated the creation of large image Pivot collections; this will enable investigators of clinical research projects to easily and quickly analyse the image collections through a perspective that is useful for making critical decisions about the image patterns discovered.

  20. Using two on-going HIV studies to obtain clinical data from before, during and after pregnancy for HIV-positive women.

    PubMed

    Huntington, Susie E; Bansi, Loveleen K; Thorne, Claire; Anderson, Jane; Newell, Marie-Louise; Taylor, Graham P; Pillay, Deenan; Hill, Teresa; Tookey, Pat A; Sabin, Caroline A

    2012-07-28

    The UK Collaborative HIV Cohort (UK CHIC) is an observational study that collates data on HIV-positive adults accessing HIV clinical care at (currently) 13 large clinics in the UK but does not collect pregnancy specific data. The National Study of HIV in Pregnancy and Childhood (NSHPC) collates data on HIV-positive women receiving antenatal care from every maternity unit in the UK and Ireland. Both studies collate pseudonymised data and neither dataset contains unique patient identifiers. A methodology was developed to find and match records for women reported to both studies thereby obtaining clinical and treatment data on pregnant HIV-positive women not available from either dataset alone. Women in UK CHIC receiving HIV-clinical care in 1996-2009, were found in the NSHPC dataset by initially 'linking' records with identical date-of-birth, linked records were then accepted as a genuine 'match', if they had further matching fields including CD4 test date. In total, 2063 women were found in both datasets, representing 23.1% of HIV-positive women with a pregnancy in the UK (n = 8932). Clinical data was available in UK CHIC following most pregnancies (92.0%, 2471/2685 pregnancies starting before 2009). There was bias towards matching women with repeat pregnancies (35.9% (741/2063) of women found in both datasets had a repeat pregnancy compared to 21.9% (1502/6869) of women in NSHPC only) and matching women HIV diagnosed before their first reported pregnancy (54.8% (1131/2063) compared to 47.7% (3278/6869), respectively). Through the use of demographic data and clinical dates, records from two independent studies were successfully matched, providing data not available from either study alone.

  1. KernelADASYN: Kernel Based Adaptive Synthetic Data Generation for Imbalanced Learning

    DTIC Science & Technology

    2015-08-17

    eases [35], Indian liver patient dataset (ILPD) [36], Parkinsons dataset [37], Vertebral Column dataset [38], breast cancer dataset [39], breast tissue...Both the data set of Breast Cancer and Breast Tissue aim to predict the patient is normal or abnormal according to the measurements. The data set SPECT...9, pp. 1263–1284, 2009. [3] M. Elter, R. Schulz-Wendtland, and T. Wittenberg, “The prediction of breast cancer biopsy outcomes using two cad

  2. Human neutral genetic variation and forensic STR data.

    PubMed

    Silva, Nuno M; Pereira, Luísa; Poloni, Estella S; Currat, Mathias

    2012-01-01

    The forensic genetics field is generating extensive population data on polymorphism of short tandem repeats (STR) markers in globally distributed samples. In this study we explored and quantified the informative power of these datasets to address issues related to human evolution and diversity, by using two online resources: an allele frequency dataset representing 141 populations summing up to almost 26 thousand individuals; a genotype dataset consisting of 42 populations and more than 11 thousand individuals. We show that the genetic relationships between populations based on forensic STRs are best explained by geography, as observed when analysing other worldwide datasets generated specifically to study human diversity. However, the global level of genetic differentiation between populations (as measured by a fixation index) is about half the value estimated with those other datasets, which contain a much higher number of markers but much less individuals. We suggest that the main factor explaining this difference is an ascertainment bias in forensics data resulting from the choice of markers for individual identification. We show that this choice results in average low variance of heterozygosity across world regions, and hence in low differentiation among populations. Thus, the forensic genetic markers currently produced for the purpose of individual assignment and identification allow the detection of the patterns of neutral genetic structure that characterize the human population but they do underestimate the levels of this genetic structure compared to the datasets of STRs (or other kinds of markers) generated specifically to study the diversity of human populations.

  3. A nonparametric statistical technique for combining global precipitation datasets: development and hydrological evaluation over the Iberian Peninsula

    NASA Astrophysics Data System (ADS)

    Abul Ehsan Bhuiyan, Md; Nikolopoulos, Efthymios I.; Anagnostou, Emmanouil N.; Quintana-Seguí, Pere; Barella-Ortiz, Anaïs

    2018-02-01

    This study investigates the use of a nonparametric, tree-based model, quantile regression forests (QRF), for combining multiple global precipitation datasets and characterizing the uncertainty of the combined product. We used the Iberian Peninsula as the study area, with a study period spanning 11 years (2000-2010). Inputs to the QRF model included three satellite precipitation products, CMORPH, PERSIANN, and 3B42 (V7); an atmospheric reanalysis precipitation and air temperature dataset; satellite-derived near-surface daily soil moisture data; and a terrain elevation dataset. We calibrated the QRF model for two seasons and two terrain elevation categories and used it to generate ensemble for these conditions. Evaluation of the combined product was based on a high-resolution, ground-reference precipitation dataset (SAFRAN) available at 5 km 1 h-1 resolution. Furthermore, to evaluate relative improvements and the overall impact of the combined product in hydrological response, we used the generated ensemble to force a distributed hydrological model (the SURFEX land surface model and the RAPID river routing scheme) and compared its streamflow simulation results with the corresponding simulations from the individual global precipitation and reference datasets. We concluded that the proposed technique could generate realizations that successfully encapsulate the reference precipitation and provide significant improvement in streamflow simulations, with reduction in systematic and random error on the order of 20-99 and 44-88 %, respectively, when considering the ensemble mean.

  4. A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge

    PubMed Central

    Gururaj, Anupama E.; Chen, Xiaoling; Pournejati, Saeid; Alter, George; Hersh, William R.; Demner-Fushman, Dina; Ohno-Machado, Lucila

    2017-01-01

    Abstract The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval. Database URL: https://biocaddie.org/benchmark-data PMID:29220453

  5. CPTAC Releases Largest-Ever Breast Cancer Proteome Dataset from Previously Genome Characterized Tumors | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) scientists have released a dataset of proteins and  phosphopeptides identified through deep proteomic and phosphoproteomic analysis of breast tumor samples, previously genomically analyzed by The Cancer Genome Atlas (TCGA).

  6. Inference for multivariate regression model based on multiply imputed synthetic data generated via posterior predictive sampling

    NASA Astrophysics Data System (ADS)

    Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.

    2017-06-01

    The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.

  7. Device and methods for "gold standard" registration of clinical 3D and 2D cerebral angiograms

    NASA Astrophysics Data System (ADS)

    Madan, Hennadii; Likar, Boštjan; Pernuš, Franjo; Å piclin, Žiga

    2015-03-01

    Translation of any novel and existing 3D-2D image registration methods into clinical image-guidance systems is limited due to lack of their objective validation on clinical image datasets. The main reason is that, besides the calibration of the 2D imaging system, a reference or "gold standard" registration is very difficult to obtain on clinical image datasets. In the context of cerebral endovascular image-guided interventions (EIGIs), we present a calibration device in the form of a headband with integrated fiducial markers and, secondly, propose an automated pipeline comprising 3D and 2D image processing, analysis and annotation steps, the result of which is a retrospective calibration of the 2D imaging system and an optimal, i.e., "gold standard" registration of 3D and 2D images. The device and methods were used to create the "gold standard" on 15 datasets of 3D and 2D cerebral angiograms, whereas each dataset was acquired on a patient undergoing EIGI for either aneurysm coiling or embolization of arteriovenous malformation. The use of the device integrated seamlessly in the clinical workflow of EIGI. While the automated pipeline eliminated all manual input or interactive image processing, analysis or annotation. In this way, the time to obtain the "gold standard" was reduced from 30 to less than one minute and the "gold standard" of 3D-2D registration on all 15 datasets of cerebral angiograms was obtained with a sub-0.1 mm accuracy.

  8. Evaluation of experimental design and computational parameter choices affecting analyses of ChIP-seq and RNA-seq data in undomesticated poplar trees.

    Treesearch

    Lijun Liu; V. Missirian; Matthew S. Zinkgraf; Andrew Groover; V. Filkov

    2014-01-01

    Background: One of the great advantages of next generation sequencing is the ability to generate large genomic datasets for virtually all species, including non-model organisms. It should be possible, in turn, to apply advanced computational approaches to these datasets to develop models of biological processes. In a practical sense, working with non-model organisms...

  9. Molecular digital pathology: progress and potential of exchanging molecular data.

    PubMed

    Roy, Somak; Pfeifer, John D; LaFramboise, William A; Pantanowitz, Liron

    2016-09-01

    Many of the demands to perform next generation sequencing (NGS) in the clinical laboratory can be resolved using the principles of telepathology. Molecular telepathology can allow facilities to outsource all or a portion of their NGS operation such as cloud computing, bioinformatics pipelines, variant data management, and knowledge curation. Clinical pathology laboratories can electronically share diverse types of molecular data with reference laboratories, technology service providers, and/or regulatory agencies. Exchange of electronic molecular data allows laboratories to perform validation of rare diseases using foreign data, check the accuracy of their test results against benchmarks, and leverage in silico proficiency testing. This review covers the emerging subject of molecular telepathology, describes clinical use cases for the appropriate exchange of molecular data, and highlights key issues such as data integrity, interoperable formats for massive genomic datasets, security, malpractice and emerging regulations involved with this novel practice.

  10. Active contour based segmentation of resected livers in CT images

    NASA Astrophysics Data System (ADS)

    Oelmann, Simon; Oyarzun Laura, Cristina; Drechsler, Klaus; Wesarg, Stefan

    2015-03-01

    The majority of state of the art segmentation algorithms are able to give proper results in healthy organs but not in pathological ones. However, many clinical applications require an accurate segmentation of pathological organs. The determination of the target boundaries for radiotherapy or liver volumetry calculations are examples of this. Volumetry measurements are of special interest after tumor resection for follow up of liver regrow. The segmentation of resected livers presents additional challenges that were not addressed by state of the art algorithms. This paper presents a snakes based algorithm specially developed for the segmentation of resected livers. The algorithm is enhanced with a novel dynamic smoothing technique that allows the active contour to propagate with different speeds depending on the intensities visible in its neighborhood. The algorithm is evaluated in 6 clinical CT images as well as 18 artificial datasets generated from additional clinical CT images.

  11. BioStar models of clinical and genomic data for biomedical data warehouse design

    PubMed Central

    Wang, Liangjiang; Ramanathan, Murali

    2008-01-01

    Biomedical research is now generating large amounts of data, ranging from clinical test results to microarray gene expression profiles. The scale and complexity of these datasets give rise to substantial challenges in data management and analysis. It is highly desirable that data warehousing and online analytical processing technologies can be applied to biomedical data integration and mining. The major difficulty probably lies in the task of capturing and modelling diverse biological objects and their complex relationships. This paper describes multidimensional data modelling for biomedical data warehouse design. Since the conventional models such as star schema appear to be insufficient for modelling clinical and genomic data, we develop a new model called BioStar schema. The new model can capture the rich semantics of biomedical data and provide greater extensibility for the fast evolution of biological research methodologies. PMID:18048122

  12. Level 2 Ancillary Products and Datasets Algorithm Theoretical Basis

    NASA Technical Reports Server (NTRS)

    Diner, D.; Abdou, W.; Gordon, H.; Kahn, R.; Knyazikhin, Y.; Martonchik, J.; McDonald, D.; McMuldroch, S.; Myneni, R.; West, R.

    1999-01-01

    This Algorithm Theoretical Basis (ATB) document describes the algorithms used to generate the parameters of certain ancillary products and datasets used during Level 2 processing of Multi-angle Imaging SpectroRadiometer (MIST) data.

  13. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

    PubMed Central

    Wernisch, Lorenz

    2017-01-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190

  14. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

    PubMed

    Gabasova, Evelina; Reid, John; Wernisch, Lorenz

    2017-10-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

  15. Developing a semantic web model for medical differential diagnosis recommendation.

    PubMed

    Mohammed, Osama; Benlamri, Rachid

    2014-10-01

    In this paper we describe a novel model for differential diagnosis designed to make recommendations by utilizing semantic web technologies. The model is a response to a number of requirements, ranging from incorporating essential clinical diagnostic semantics to the integration of data mining for the process of identifying candidate diseases that best explain a set of clinical features. We introduce two major components, which we find essential to the construction of an integral differential diagnosis recommendation model: the evidence-based recommender component and the proximity-based recommender component. Both approaches are driven by disease diagnosis ontologies designed specifically to enable the process of generating diagnostic recommendations. These ontologies are the disease symptom ontology and the patient ontology. The evidence-based diagnosis process develops dynamic rules based on standardized clinical pathways. The proximity-based component employs data mining to provide clinicians with diagnosis predictions, as well as generates new diagnosis rules from provided training datasets. This article describes the integration between these two components along with the developed diagnosis ontologies to form a novel medical differential diagnosis recommendation model. This article also provides test cases from the implementation of the overall model, which shows quite promising diagnostic recommendation results.

  16. Generation of a precise DEM by interactive synthesis of multi-temporal elevation datasets: a case study of Schirmacher Oasis, East Antarctica

    NASA Astrophysics Data System (ADS)

    Jawak, Shridhar D.; Luis, Alvarinho J.

    2016-05-01

    Digital elevation model (DEM) is indispensable for analysis such as topographic feature extraction, ice sheet melting, slope stability analysis, landscape analysis and so on. Such analysis requires a highly accurate DEM. Available DEMs of Antarctic region compiled by using radar altimetry and the Antarctic digital database indicate elevation variations of up to hundreds of meters, which necessitates the generation of local improved DEM. An improved DEM of the Schirmacher Oasis, East Antarctica has been generated by synergistically fusing satellite-derived laser altimetry data from Geoscience Laser Altimetry System (GLAS), Radarsat Antarctic Mapping Project (RAMP) elevation data and Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) global elevation data (GDEM). This is a characteristic attempt to generate a DEM of any part of Antarctica by fusing multiple elevation datasets, which is essential to model the ice elevation change and address the ice mass balance. We analyzed a suite of interpolation techniques for constructing a DEM from GLAS, RAMP and ASTER DEM-based point elevation datasets, in order to determine the level of confidence with which the interpolation techniques can generate a better interpolated continuous surface, and eventually improve the elevation accuracy of DEM from synergistically fused RAMP, GLAS and ASTER point elevation datasets. The DEM presented in this work has a vertical accuracy (≈ 23 m) better than RAMP DEM (≈ 57 m) and ASTER DEM (≈ 64 m) individually. The RAMP DEM and ASTER DEM elevations were corrected using differential GPS elevations as ground reference data, and the accuracy obtained after fusing multitemporal datasets is found to be improved than that of existing DEMs constructed by using RAMP or ASTER alone. This is our second attempt of fusing multitemporal, multisensory and multisource elevation data to generate a DEM of Antarctica, in order to address the ice elevation change and address the ice mass balance. Our approach focuses on the strengths of each elevation data source to produce an accurate elevation model.

  17. Atlas-based segmentation technique incorporating inter-observer delineation uncertainty for whole breast

    NASA Astrophysics Data System (ADS)

    Bell, L. R.; Dowling, J. A.; Pogson, E. M.; Metcalfe, P.; Holloway, L.

    2017-01-01

    Accurate, efficient auto-segmentation methods are essential for the clinical efficacy of adaptive radiotherapy delivered with highly conformal techniques. Current atlas based auto-segmentation techniques are adequate in this respect, however fail to account for inter-observer variation. An atlas-based segmentation method that incorporates inter-observer variation is proposed. This method is validated for a whole breast radiotherapy cohort containing 28 CT datasets with CTVs delineated by eight observers. To optimise atlas accuracy, the cohort was divided into categories by mean body mass index and laterality, with atlas’ generated for each in a leave-one-out approach. Observer CTVs were merged and thresholded to generate an auto-segmentation model representing both inter-observer and inter-patient differences. For each category, the atlas was registered to the left-out dataset to enable propagation of the auto-segmentation from atlas space. Auto-segmentation time was recorded. The segmentation was compared to the gold-standard contour using the dice similarity coefficient (DSC) and mean absolute surface distance (MASD). Comparison with the smallest and largest CTV was also made. This atlas-based auto-segmentation method incorporating inter-observer variation was shown to be efficient (<4min) and accurate for whole breast radiotherapy, with good agreement (DSC>0.7, MASD <9.3mm) between the auto-segmented contours and CTV volumes.

  18. A machine learning approach as a surrogate of finite element analysis-based inverse method to estimate the zero-pressure geometry of human thoracic aorta.

    PubMed

    Liang, Liang; Liu, Minliang; Martin, Caitlin; Sun, Wei

    2018-05-09

    Advances in structural finite element analysis (FEA) and medical imaging have made it possible to investigate the in vivo biomechanics of human organs such as blood vessels, for which organ geometries at the zero-pressure level need to be recovered. Although FEA-based inverse methods are available for zero-pressure geometry estimation, these methods typically require iterative computation, which are time-consuming and may be not suitable for time-sensitive clinical applications. In this study, by using machine learning (ML) techniques, we developed an ML model to estimate the zero-pressure geometry of human thoracic aorta given 2 pressurized geometries of the same patient at 2 different blood pressure levels. For the ML model development, a FEA-based method was used to generate a dataset of aorta geometries of 3125 virtual patients. The ML model, which was trained and tested on the dataset, is capable of recovering zero-pressure geometries consistent with those generated by the FEA-based method. Thus, this study demonstrates the feasibility and great potential of using ML techniques as a fast surrogate of FEA-based inverse methods to recover zero-pressure geometries of human organs. Copyright © 2018 John Wiley & Sons, Ltd.

  19. MicroRNA array normalization: an evaluation using a randomized dataset as the benchmark.

    PubMed

    Qin, Li-Xuan; Zhou, Qin

    2014-01-01

    MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.

  20. MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark

    PubMed Central

    Qin, Li-Xuan; Zhou, Qin

    2014-01-01

    MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456

  1. Incorporating Topic Assignment Constraint and Topic Correlation Limitation into Clinical Goal Discovering for Clinical Pathway Mining.

    PubMed

    Xu, Xiao; Jin, Tao; Wei, Zhijie; Wang, Jianmin

    2017-01-01

    Clinical pathways are widely used around the world for providing quality medical treatment and controlling healthcare cost. However, the expert-designed clinical pathways can hardly deal with the variances among hospitals and patients. It calls for more dynamic and adaptive process, which is derived from various clinical data. Topic-based clinical pathway mining is an effective approach to discover a concise process model. Through this approach, the latent topics found by latent Dirichlet allocation (LDA) represent the clinical goals. And process mining methods are used to extract the temporal relations between these topics. However, the topic quality is usually not desirable due to the low performance of the LDA in clinical data. In this paper, we incorporate topic assignment constraint and topic correlation limitation into the LDA to enhance the ability of discovering high-quality topics. Two real-world datasets are used to evaluate the proposed method. The results show that the topics discovered by our method are with higher coherence, informativeness, and coverage than the original LDA. These quality topics are suitable to represent the clinical goals. Also, we illustrate that our method is effective in generating a comprehensive topic-based clinical pathway model.

  2. Incorporating Topic Assignment Constraint and Topic Correlation Limitation into Clinical Goal Discovering for Clinical Pathway Mining

    PubMed Central

    Xu, Xiao; Wei, Zhijie

    2017-01-01

    Clinical pathways are widely used around the world for providing quality medical treatment and controlling healthcare cost. However, the expert-designed clinical pathways can hardly deal with the variances among hospitals and patients. It calls for more dynamic and adaptive process, which is derived from various clinical data. Topic-based clinical pathway mining is an effective approach to discover a concise process model. Through this approach, the latent topics found by latent Dirichlet allocation (LDA) represent the clinical goals. And process mining methods are used to extract the temporal relations between these topics. However, the topic quality is usually not desirable due to the low performance of the LDA in clinical data. In this paper, we incorporate topic assignment constraint and topic correlation limitation into the LDA to enhance the ability of discovering high-quality topics. Two real-world datasets are used to evaluate the proposed method. The results show that the topics discovered by our method are with higher coherence, informativeness, and coverage than the original LDA. These quality topics are suitable to represent the clinical goals. Also, we illustrate that our method is effective in generating a comprehensive topic-based clinical pathway model. PMID:29065617

  3. Optimising the clinical strategy for autoimmune liver diseases: Principles of value-based medicine.

    PubMed

    Carbone, Marco; Cristoferi, Laura; Cortesi, Paolo Angelo; Rota, Matteo; Ciaccio, Antonio; Okolicsanyi, Stefano; Gemma, Marta; Scalone, Luciana; Cesana, Giancarlo; Fabris, Luca; Colledan, Michele; Fagiuoli, Stefano; Ideo, Gaetano; Belli, Luca Saverio; Munari, Luca Maria; Mantovani, Lorenzo; Strazzabosco, Mario

    2018-04-01

    Autoimmune hepatitis, primary biliary cholangitis, and primary sclerosing cholangitis represent the three major autoimmune liver diseases (AILDs). Their management is highly specialized, requires a multidisciplinary approach and often relies on expensive, orphan drugs. Unfortunately, their treatment is often unsatisfactory, and the care pathway heterogeneous across different centers. Disease-specific clinical outcome indicators (COIs) able to evaluate the whole cycle of care are needed to assist both clinicians and administrators in improving quality and value of care. Aim of our study was to generate a set of COIs for the three AILDs. We then prospectively validated these indicators based on a series of consecutive patients recruited at three tertiary clinical centers in Lombardy, Italy. In phase I using a Delphi method and a RAND 9-point appropriateness scale a set of COIs was generated. In phase II the indicators were applied in a real-life dataset. Two-hundred fourteen patients were enrolled and followed-up for a median time of 54months and the above COIs were recorded using a web-based electronic medical record program. The COIs were easy to collect in the clinical practice environment and their values compared well with the available natural history studies. We have generated a comprehensive set of COIs which sequentially capture different clinical outcome of the three AILDs explored. These indicators represent a critical tool to implement a value-based approach to patients with these conditions, to monitor, compare and improve quality through benchmarking of clinical performance and to assess the significance of novel drugs and technologies. This article is part of a Special Issue entitled: Cholangiocytes in Health and Diseaseedited by Jesus Banales, Marco Marzioni, Nicholas LaRusso and Peter Jansen. Copyright © 2017. Published by Elsevier B.V.

  4. A Comparison of Lung Nodule Segmentation Algorithms: Methods and Results from a Multi-institutional Study.

    PubMed

    Kalpathy-Cramer, Jayashree; Zhao, Binsheng; Goldgof, Dmitry; Gu, Yuhua; Wang, Xingwei; Yang, Hao; Tan, Yongqiang; Gillies, Robert; Napel, Sandy

    2016-08-01

    Tumor volume estimation, as well as accurate and reproducible borders segmentation in medical images, are important in the diagnosis, staging, and assessment of response to cancer therapy. The goal of this study was to demonstrate the feasibility of a multi-institutional effort to assess the repeatability and reproducibility of nodule borders and volume estimate bias of computerized segmentation algorithms in CT images of lung cancer, and to provide results from such a study. The dataset used for this evaluation consisted of 52 tumors in 41 CT volumes (40 patient datasets and 1 dataset containing scans of 12 phantom nodules of known volume) from five collections available in The Cancer Imaging Archive. Three academic institutions developing lung nodule segmentation algorithms submitted results for three repeat runs for each of the nodules. We compared the performance of lung nodule segmentation algorithms by assessing several measurements of spatial overlap and volume measurement. Nodule sizes varied from 29 μl to 66 ml and demonstrated a diversity of shapes. Agreement in spatial overlap of segmentations was significantly higher for multiple runs of the same algorithm than between segmentations generated by different algorithms (p < 0.05) and was significantly higher on the phantom dataset compared to the other datasets (p < 0.05). Algorithms differed significantly in the bias of the measured volumes of the phantom nodules (p < 0.05) underscoring the need for assessing performance on clinical data in addition to phantoms. Algorithms that most accurately estimated nodule volumes were not the most repeatable, emphasizing the need to evaluate both their accuracy and precision. There were considerable differences between algorithms, especially in a subset of heterogeneous nodules, underscoring the recommendation that the same software be used at all time points in longitudinal studies.

  5. P-MartCancer–Interactive Online Software to Enable Analysis of Shotgun Cancer Proteomic Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Webb-Robertson, Bobbie-Jo M.; Bramer, Lisa M.; Jensen, Jeffrey L.

    P-MartCancer is a new interactive web-based software environment that enables biomedical and biological scientists to perform in-depth analyses of global proteomics data without requiring direct interaction with the data or with statistical software. P-MartCancer offers a series of statistical modules associated with quality assessment, peptide and protein statistics, protein quantification and exploratory data analyses driven by the user via customized workflows and interactive visualization. Currently, P-MartCancer offers access to multiple cancer proteomic datasets generated through the Clinical Proteomics Tumor Analysis Consortium (CPTAC) at the peptide, gene and protein levels. P-MartCancer is deployed using Azure technologies (http://pmart.labworks.org/cptac.html), the web-service is alternativelymore » available via Docker Hub (https://hub.docker.com/r/pnnl/pmart-web/) and many statistical functions can be utilized directly from an R package available on GitHub (https://github.com/pmartR).« less

  6. Scalable multi-sample single-cell data analysis by Partition-Assisted Clustering and Multiple Alignments of Networks

    PubMed Central

    Samusik, Nikolay; Wang, Xiaowei; Guan, Leying; Nolan, Garry P.

    2017-01-01

    Mass cytometry (CyTOF) has greatly expanded the capability of cytometry. It is now easy to generate multiple CyTOF samples in a single study, with each sample containing single-cell measurement on 50 markers for more than hundreds of thousands of cells. Current methods do not adequately address the issues concerning combining multiple samples for subpopulation discovery, and these issues can be quickly and dramatically amplified with increasing number of samples. To overcome this limitation, we developed Partition-Assisted Clustering and Multiple Alignments of Networks (PAC-MAN) for the fast automatic identification of cell populations in CyTOF data closely matching that of expert manual-discovery, and for alignments between subpopulations across samples to define dataset-level cellular states. PAC-MAN is computationally efficient, allowing the management of very large CyTOF datasets, which are increasingly common in clinical studies and cancer studies that monitor various tissue samples for each subject. PMID:29281633

  7. Protecting patient privacy when sharing patient-level data from clinical trials.

    PubMed

    Tucker, Katherine; Branson, Janice; Dilleen, Maria; Hollis, Sally; Loughlin, Paul; Nixon, Mark J; Williams, Zoë

    2016-07-08

    Greater transparency and, in particular, sharing of patient-level data for further scientific research is an increasingly important topic for the pharmaceutical industry and other organisations who sponsor and conduct clinical trials as well as generally in the interests of patients participating in studies. A concern remains, however, over how to appropriately prepare and share clinical trial data with third party researchers, whilst maintaining patient confidentiality. Clinical trial datasets contain very detailed information on each participant. Risk to patient privacy can be mitigated by data reduction techniques. However, retention of data utility is important in order to allow meaningful scientific research. In addition, for clinical trial data, an excessive application of such techniques may pose a public health risk if misleading results are produced. After considering existing guidance, this article makes recommendations with the aim of promoting an approach that balances data utility and privacy risk and is applicable across clinical trial data holders. Our key recommendations are as follows: 1. Data anonymisation/de-identification: Data holders are responsible for generating de-identified datasets which are intended to offer increased protection for patient privacy through masking or generalisation of direct and some indirect identifiers. 2. Controlled access to data, including use of a data sharing agreement: A legally binding data sharing agreement should be in place, including agreements not to download or further share data and not to attempt to seek to identify patients. Appropriate levels of security should be used for transferring data or providing access; one solution is use of a secure 'locked box' system which provides additional safeguards. This article provides recommendations on best practices to de-identify/anonymise clinical trial data for sharing with third-party researchers, as well as controlled access to data and data sharing agreements. The recommendations are applicable to all clinical trial data holders. Further work will be needed to identify and evaluate competing possibilities as regulations, attitudes to risk and technologies evolve.

  8. Discovery and Validation of Novel Expression Signature for Postcystectomy Recurrence in High-Risk Bladder Cancer

    PubMed Central

    Lam, Lucia L.; Ghadessi, Mercedeh; Erho, Nicholas; Vergara, Ismael A.; Alshalalfa, Mohammed; Buerki, Christine; Haddad, Zaid; Sierocinski, Thomas; Triche, Timothy J.; Skinner, Eila C.; Davicioni, Elai; Daneshmand, Siamak; Black, Peter C.

    2014-01-01

    Background Nearly half of muscle-invasive bladder cancer patients succumb to their disease following cystectomy. Selecting candidates for adjuvant therapy is currently based on clinical parameters with limited predictive power. This study aimed to develop and validate genomic-based signatures that can better identify patients at risk for recurrence than clinical models alone. Methods Transcriptome-wide expression profiles were generated using 1.4 million feature-arrays on archival tumors from 225 patients who underwent radical cystectomy and had muscle-invasive and/or node-positive bladder cancer. Genomic (GC) and clinical (CC) classifiers for predicting recurrence were developed on a discovery set (n = 133). Performances of GC, CC, an independent clinical nomogram (IBCNC), and genomic-clinicopathologic classifiers (G-CC, G-IBCNC) were assessed in the discovery and independent validation (n = 66) sets. GC was further validated on four external datasets (n = 341). Discrimination and prognostic abilities of classifiers were compared using area under receiver-operating characteristic curves (AUCs). All statistical tests were two-sided. Results A 15-feature GC was developed on the discovery set with area under curve (AUC) of 0.77 in the validation set. This was higher than individual clinical variables, IBCNC (AUC = 0.73), and comparable to CC (AUC = 0.78). Performance was improved upon combining GC with clinical nomograms (G-IBCNC, AUC = 0.82; G-CC, AUC = 0.86). G-CC high-risk patients had elevated recurrence probabilities (P < .001), with GC being the best predictor by multivariable analysis (P = .005). Genomic-clinicopathologic classifiers outperformed clinical nomograms by decision curve and reclassification analyses. GC performed the best in validation compared with seven prior signatures. GC markers remained prognostic across four independent datasets. Conclusions The validated genomic-based classifiers outperform clinical models for predicting postcystectomy bladder cancer recurrence. This may be used to better identify patients who need more aggressive management. PMID:25344601

  9. Using Clinical Data, Hypothesis Generation Tools and PubMed Trends to Discover the Association between Diabetic Retinopathy and Antihypertensive Drugs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Senter, Katherine G; Sukumar, Sreenivas R; Patton, Robert M

    Diabetic retinopathy (DR) is a leading cause of blindness and common complication of diabetes. Many diabetic patients take antihypertensive drugs to prevent cardiovascular problems, but these drugs may have unintended consequences on eyesight. Six common classes of antihypertensive drug are angiotensin converting enzyme (ACE) inhibitors, alpha blockers, angiotensin receptor blockers (ARBs), -blockers, calcium channel blockers, and diuretics. Analysis of medical history data might indicate which of these drugs provide safe blood pressure control, and a literature review is often used to guide such analyses. Beyond manual reading of relevant publications, we sought to identify quantitative trends in literature from themore » biomedical database PubMed to compare with quantitative trends in the clinical data. By recording and analyzing PubMed search results, we found wide variation in the prevalence of each antihypertensive drug in DR literature. Drug classes developed more recently such as ACE inhibitors and ARBs were most prevalent. We also identified instances of change-over-time in publication patterns. We then compared these literature trends to a dataset of 500 diabetic patients from the UT Hamilton Eye Institute. Data for each patient included class of antihypertensive drug, presence and severity of DR. Graphical comparison revealed that older drug classes such as diuretics, calcium channel blockers, and -blockers were much more prevalent in the clinical data than in the DR and antihypertensive literature. Finally, quantitative analysis of the dataset revealed that patients taking -blockers were statistically more likely to have DR than patients taking other medications, controlling for presence of hypertension and year of diabetes onset. This finding was concerning given the prevalence of -blockers in the clinical data. We determined that clinical use of -blockers should be minimized in diabetic patients to prevent retinal damage.« less

  10. Improving surveillance of sexually transmitted infections using mandatory electronic clinical reporting: the genitourinary medicine clinic activity dataset, England, 2009 to 2013.

    PubMed

    Savage, E J; Mohammed, H; Leong, G; Duffell, S; Hughes, G

    2014-12-04

    A new electronic surveillance system for sexually transmitted infections (STIs) was introduced in England in 2009. The genitourinary medicine clinic activity dataset (GUMCAD) is a mandatory, disaggregated, pseudo-anonymised data return submitted by all STI clinics across England. The dataset includes information on all STI diagnoses made and services provided alongside demographic characteristics for every patient attendance at a clinic. The new system enables the timely analysis and publication of routine STI data, detailed analyses of risk groups and longitudinal analyses of clinic attendees. The system offers flexibility so new codes can be introduced to help monitor outbreaks or unusual STI activity. From January 2009 to December 2013 inclusive, over twenty-five million records from a total of 6,668,648 patients of STI clinics have been submitted. This article describes the successful implementation of this new surveillance system and the types of epidemiological outputs and analyses that GUMCAD enables. The challenges faced are discussed and forthcoming developments in STI surveillance in England are described.

  11. Determining Cutoff Point of Ensemble Trees Based on Sample Size in Predicting Clinical Dose with DNA Microarray Data.

    PubMed

    Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha

    2016-01-01

    Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.

  12. Piloting the European Unified Patient Identity Management (EUPID) Concept to Facilitate Secondary Use of Neuroblastoma Data from Clinical Trials and Biobanking.

    PubMed

    Ebner, Hubert; Hayn, Dieter; Falgenhauer, Markus; Nitzlnader, Michael; Schleiermacher, Gudrun; Haupt, Riccardo; Erminio, Giovanni; Defferrari, Raffaella; Mazzocco, Katia; Kohler, Jan; Tonini, Gian Paolo; Ladenstein, Ruth; Schreier, Guenter

    2016-01-01

    Data from two contexts, i.e. the European Unresectable Neuroblastoma (EUNB) clinical trial and results from comparative genomic hybridisation (CGH) analyses from corresponding tumour samples shall be provided to existing repositories for secondary use. Utilizing the European Unified Patient IDentity Management (EUPID) as developed in the course of the ENCCA project, the following processes were applied to the data: standardization (providing interoperability), pseudonymization (generating distinct but linkable pseudonyms for both contexts), and linking both data sources. The applied procedures resulted in a joined dataset that did not contain any identifiers that would allow to backtrack the records to either data sources. This provided a high degree of privacy to the involved patients as required by data protection regulations, without preventing proper analysis.

  13. Data Mining Techniques Applied to Hydrogen Lactose Breath Test.

    PubMed

    Rubio-Escudero, Cristina; Valverde-Fernández, Justo; Nepomuceno-Chamorro, Isabel; Pontes-Balanza, Beatriz; Hernández-Mendoza, Yoedusvany; Rodríguez-Herrera, Alfonso

    2017-01-01

    Analyze a set of data of hydrogen breath tests by use of data mining tools. Identify new patterns of H2 production. Hydrogen breath tests data sets as well as k-means clustering as the data mining technique to a dataset of 2571 patients. Six different patterns have been extracted upon analysis of the hydrogen breath test data. We have also shown the relevance of each of the samples taken throughout the test. Analysis of the hydrogen breath test data sets using data mining techniques has identified new patterns of hydrogen generation upon lactose absorption. We can see the potential of application of data mining techniques to clinical data sets. These results offer promising data for future research on the relations between gut microbiota produced hydrogen and its link to clinical symptoms.

  14. SPICE for ESA Planetary Missions

    NASA Astrophysics Data System (ADS)

    Costa, M.

    2018-04-01

    The ESA SPICE Service leads the SPICE operations for ESA missions and is responsible for the generation of the SPICE Kernel Dataset for ESA missions. This contribution will describe the status of these datasets and outline the future developments.

  15. Boosting association rule mining in large datasets via Gibbs sampling.

    PubMed

    Qian, Guoqi; Rao, Calyampudi Radhakrishna; Sun, Xiaoying; Wu, Yuehua

    2016-05-03

    Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.

  16. CometQuest: A Rosetta Adventure

    NASA Technical Reports Server (NTRS)

    Leon, Nancy J.; Fisher, Diane K.; Novati, Alexander; Chmielewski, Artur B.; Fitzpatrick, Austin J.; Angrum, Andrea

    2012-01-01

    This software is a higher-performance implementation of tiled WMS, with integral support for KML and time-varying data. This software is compliant with the Open Geospatial WMS standard, and supports KML natively as a WMS return type, including support for the time attribute. Regionated KML wrappers are generated that match the existing tiled WMS dataset. Ping and JPG formats are supported, and the software is implemented as an Apache 2.0 module that supports a threading execution model that is capable of supporting very high request rates. The module intercepts and responds to WMS requests that match certain patterns and returns the existing tiles. If a KML format that matches an existing pyramid and tile dataset is requested, regionated KML is generated and returned to the requesting application. In addition, KML requests that do not match the existing tile datasets generate a KML response that includes the corresponding JPG WMS request, effectively adding KML support to a backing WMS server.

  17. Tiled WMS/KML Server V2

    NASA Technical Reports Server (NTRS)

    Plesea, Lucian

    2012-01-01

    This software is a higher-performance implementation of tiled WMS, with integral support for KML and time-varying data. This software is compliant with the Open Geospatial WMS standard, and supports KML natively as a WMS return type, including support for the time attribute. Regionated KML wrappers are generated that match the existing tiled WMS dataset. Ping and JPG formats are supported, and the software is implemented as an Apache 2.0 module that supports a threading execution model that is capable of supporting very high request rates. The module intercepts and responds to WMS requests that match certain patterns and returns the existing tiles. If a KML format that matches an existing pyramid and tile dataset is requested, regionated KML is generated and returned to the requesting application. In addition, KML requests that do not match the existing tile datasets generate a KML response that includes the corresponding JPG WMS request, effectively adding KML support to a backing WMS server.

  18. Parsing clinical text: how good are the state-of-the-art parsers?

    PubMed

    Jiang, Min; Huang, Yang; Fan, Jung-wei; Tang, Buzhou; Denny, Josh; Xu, Hua

    2015-01-01

    Parsing, which generates a syntactic structure of a sentence (a parse tree), is a critical component of natural language processing (NLP) research in any domain including medicine. Although parsers developed in the general English domain, such as the Stanford parser, have been applied to clinical text, there are no formal evaluations and comparisons of their performance in the medical domain. In this study, we investigated the performance of three state-of-the-art parsers: the Stanford parser, the Bikel parser, and the Charniak parser, using following two datasets: (1) A Treebank containing 1,100 sentences that were randomly selected from progress notes used in the 2010 i2b2 NLP challenge and manually annotated according to a Penn Treebank based guideline; and (2) the MiPACQ Treebank, which is developed based on pathology notes and clinical notes, containing 13,091 sentences. We conducted three experiments on both datasets. First, we measured the performance of the three state-of-the-art parsers on the clinical Treebanks with their default settings. Then we re-trained the parsers using the clinical Treebanks and evaluated their performance using the 10-fold cross validation method. Finally we re-trained the parsers by combining the clinical Treebanks with the Penn Treebank. Our results showed that the original parsers achieved lower performance in clinical text (Bracketing F-measure in the range of 66.6%-70.3%) compared to general English text. After retraining on the clinical Treebank, all parsers achieved better performance, with the best performance from the Stanford parser that reached the highest Bracketing F-measure of 73.68% on progress notes and 83.72% on the MiPACQ corpus using 10-fold cross validation. When the combined clinical Treebanks and Penn Treebank was used, of the three parsers, the Charniak parser achieved the highest Bracketing F-measure of 73.53% on progress notes and the Stanford parser reached the highest F-measure of 84.15% on the MiPACQ corpus. Our study demonstrates that re-training using clinical Treebanks is critical for improving general English parsers' performance on clinical text, and combining clinical and open domain corpora might achieve optimal performance for parsing clinical text.

  19. Publishing descriptions of non-public clinical datasets: proposed guidance for researchers, repositories, editors and funding organisations.

    PubMed

    Hrynaszkiewicz, Iain; Khodiyar, Varsha; Hufton, Andrew L; Sansone, Susanna-Assunta

    2016-01-01

    Sharing of experimental clinical research data usually happens between individuals or research groups rather than via public repositories, in part due to the need to protect research participant privacy. This approach to data sharing makes it difficult to connect journal articles with their underlying datasets and is often insufficient for ensuring access to data in the long term. Voluntary data sharing services such as the Yale Open Data Access (YODA) and Clinical Study Data Request (CSDR) projects have increased accessibility to clinical datasets for secondary uses while protecting patient privacy and the legitimacy of secondary analyses but these resources are generally disconnected from journal articles-where researchers typically search for reliable information to inform future research. New scholarly journal and article types dedicated to increasing accessibility of research data have emerged in recent years and, in general, journals are developing stronger links with data repositories. There is a need for increased collaboration between journals, data repositories, researchers, funders, and voluntary data sharing services to increase the visibility and reliability of clinical research. Using the journal Scientific Data as a case study, we propose and show examples of changes to the format and peer-review process for journal articles to more robustly link them to data that are only available on request. We also propose additional features for data repositories to better accommodate non-public clinical datasets, including Data Use Agreements (DUAs).

  20. A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources

    PubMed Central

    Moon, Sungrim; Pakhomov, Serguei; Liu, Nathan; Ryan, James O; Melton, Genevieve B

    2014-01-01

    Objective To create a sense inventory of abbreviations and acronyms from clinical texts. Methods The most frequently occurring abbreviations and acronyms from 352 267 dictated clinical notes were used to create a clinical sense inventory. Senses of each abbreviation and acronym were manually annotated from 500 random instances and lexically matched with long forms within the Unified Medical Language System (UMLS V.2011AB), Another Database of Abbreviations in Medline (ADAM), and Stedman's Dictionary, Medical Abbreviations, Acronyms & Symbols, 4th edition (Stedman's). Redundant long forms were merged after they were lexically normalized using Lexical Variant Generation (LVG). Results The clinical sense inventory was found to have skewed sense distributions, practice-specific senses, and incorrect uses. Of 440 abbreviations and acronyms analyzed in this study, 949 long forms were identified in clinical notes. This set was mapped to 17 359, 5233, and 4879 long forms in UMLS, ADAM, and Stedman's, respectively. After merging long forms, only 2.3% matched across all medical resources. The UMLS, ADAM, and Stedman's covered 5.7%, 8.4%, and 11% of the merged clinical long forms, respectively. The sense inventory of clinical abbreviations and acronyms and anonymized datasets generated from this study are available for public use at http://www.bmhi.umn.edu/ihi/research/nlpie/resources/index.htm (‘Sense Inventories’, website). Conclusions Clinical sense inventories of abbreviations and acronyms created using clinical notes and medical dictionary resources demonstrate challenges with term coverage and resource integration. Further work is needed to help with standardizing abbreviations and acronyms in clinical care and biomedicine to facilitate automated processes such as text-mining and information extraction. PMID:23813539

  1. NOTE: A feasibility study of markerless fluoroscopic gating for lung cancer radiotherapy using 4DCT templates

    NASA Astrophysics Data System (ADS)

    Li, Ruijiang; Lewis, John H.; Cerviño, Laura I.; Jiang, Steve B.

    2009-10-01

    A major difficulty in conformal lung cancer radiotherapy is respiratory organ motion, which may cause clinically significant targeting errors. Respiratory-gated radiotherapy allows for more precise delivery of prescribed radiation dose to the tumor, while minimizing normal tissue complications. Gating based on external surrogates is limited by its lack of accuracy, while gating based on implanted fiducial markers is limited primarily by the risk of pneumothorax due to marker implantation. Techniques for fluoroscopic gating without implanted fiducial markers (markerless gating) have been developed. These techniques usually require a training fluoroscopic image dataset with marked tumor positions in the images, which limits their clinical implementation. To remove this requirement, this study presents a markerless fluoroscopic gating algorithm based on 4DCT templates. To generate gating signals, we explored the application of three similarity measures or scores between fluoroscopic images and the reference 4DCT template: un-normalized cross-correlation (CC), normalized cross-correlation (NCC) and normalized mutual information (NMI), as well as average intensity (AI) of the region of interest (ROI) in the fluoroscopic images. Performance was evaluated using fluoroscopic and 4DCT data from three lung cancer patients. On average, gating based on CC achieves the highest treatment accuracy given the same efficiency, with a high target coverage (average between 91.9% and 98.6%) for a wide range of nominal duty cycles (20-50%). AI works well for two patients out of three, but failed for the third patient due to interference from the heart. Gating based on NCC and NMI usually failed below 50% nominal duty cycle. Based on this preliminary study with three patients, we found that the proposed CC-based gating algorithm can generate accurate and robust gating signals when using 4DCT reference template. However, this observation is based on results obtained from a very limited dataset, and further investigation on a larger patient population has to be done before its clinical implementation.

  2. Harvard Aging Brain Study: Dataset and accessibility.

    PubMed

    Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G; Chatwal, Jasmeer P; Papp, Kathryn V; Amariglio, Rebecca E; Blacker, Deborah; Rentz, Dorene M; Johnson, Keith A; Sperling, Reisa A; Schultz, Aaron P

    2017-01-01

    The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging. To promote more extensive analyses, imaging data was designed to be compatible with other publicly available datasets. A cloud-based system enables access to interested researchers with blinded data available contingent upon completion of a data usage agreement and administrative approval. Data collection is ongoing and currently in its fifth year. Copyright © 2015 Elsevier Inc. All rights reserved.

  3. Brain-CODE: A Secure Neuroinformatics Platform for Management, Federation, Sharing and Analysis of Multi-Dimensional Neuroscience Data.

    PubMed

    Vaccarino, Anthony L; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M; Stuss, Donald T; Theriault, Elizabeth; Evans, Kenneth R

    2018-01-01

    Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute's "Brain-CODE" is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care.

  4. Brain-CODE: A Secure Neuroinformatics Platform for Management, Federation, Sharing and Analysis of Multi-Dimensional Neuroscience Data

    PubMed Central

    Vaccarino, Anthony L.; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R.; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G.; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F. Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M.; Stuss, Donald T.; Theriault, Elizabeth; Evans, Kenneth R.

    2018-01-01

    Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute’s “Brain-CODE” is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care. PMID:29875648

  5. Consolidating drug data on a global scale using Linked Data.

    PubMed

    Jovanovik, Milos; Trajanov, Dimitar

    2017-01-21

    Drug product data is available on the Web in a distributed fashion. The reasons lie within the regulatory domains, which exist on a national level. As a consequence, the drug data available on the Web are independently curated by national institutions from each country, leaving the data in varying languages, with a varying structure, granularity level and format, on different locations on the Web. Therefore, one of the main challenges in the realm of drug data is the consolidation and integration of large amounts of heterogeneous data into a comprehensive dataspace, for the purpose of developing data-driven applications. In recent years, the adoption of the Linked Data principles has enabled data publishers to provide structured data on the Web and contextually interlink them with other public datasets, effectively de-siloing them. Defining methodological guidelines and specialized tools for generating Linked Data in the drug domain, applicable on a global scale, is a crucial step to achieving the necessary levels of data consolidation and alignment needed for the development of a global dataset of drug product data. This dataset would then enable a myriad of new usage scenarios, which can, for instance, provide insight into the global availability of different drug categories in different parts of the world. We developed a methodology and a set of tools which support the process of generating Linked Data in the drug domain. Using them, we generated the LinkedDrugs dataset by seamlessly transforming, consolidating and publishing high-quality, 5-star Linked Drug Data from twenty-three countries, containing over 248,000 drug products, over 99,000,000 RDF triples and over 278,000 links to generic drugs from the LOD Cloud. Using the linked nature of the dataset, we demonstrate its ability to support advanced usage scenarios in the drug domain. The process of generating the LinkedDrugs dataset demonstrates the applicability of the methodological guidelines and the supporting tools in transforming drug product data from various, independent and distributed sources, into a comprehensive Linked Drug Data dataset. The presented user-centric and analytical usage scenarios over the dataset show the advantages of having a de-siloed, consolidated and comprehensive dataspace of drug data available via the existing infrastructure of the Web.

  6. In vitro downregulated hypoxia transcriptome is associated with poor prognosis in breast cancer.

    PubMed

    Abu-Jamous, Basel; Buffa, Francesca M; Harris, Adrian L; Nandi, Asoke K

    2017-06-15

    Hypoxia is a characteristic of breast tumours indicating poor prognosis. Based on the assumption that those genes which are up-regulated under hypoxia in cell-lines are expected to be predictors of poor prognosis in clinical data, many signatures of poor prognosis were identified. However, it was observed that cell line data do not always concur with clinical data, and therefore conclusions from cell line analysis should be considered with caution. As many transcriptomic cell-line datasets from hypoxia related contexts are available, integrative approaches which investigate these datasets collectively, while not ignoring clinical data, are required. We analyse sixteen heterogeneous breast cancer cell-line transcriptomic datasets in hypoxia-related conditions collectively by employing the unique capabilities of the method, UNCLES, which integrates clustering results from multiple datasets and can address questions that cannot be answered by existing methods. This has been demonstrated by comparison with the state-of-the-art iCluster method. From this collection of genome-wide datasets include 15,588 genes, UNCLES identified a relatively high number of genes (>1000 overall) which are consistently co-regulated over all of the datasets, and some of which are still poorly understood and represent new potential HIF targets, such as RSBN1 and KIAA0195. Two main, anti-correlated, clusters were identified; the first is enriched with MYC targets participating in growth and proliferation, while the other is enriched with HIF targets directly participating in the hypoxia response. Surprisingly, in six clinical datasets, some sub-clusters of growth genes are found consistently positively correlated with hypoxia response genes, unlike the observation in cell lines. Moreover, the ability to predict bad prognosis by a combined signature of one sub-cluster of growth genes and one sub-cluster of hypoxia-induced genes appears to be comparable and perhaps greater than that of known hypoxia signatures. We present a clustering approach suitable to integrate data from diverse experimental set-ups. Its application to breast cancer cell line datasets reveals new hypoxia-regulated signatures of genes which behave differently when in vitro (cell-line) data is compared with in vivo (clinical) data, and are of a prognostic value comparable or exceeding the state-of-the-art hypoxia signatures.

  7. 3D/2D model-to-image registration by imitation learning for cardiac procedures.

    PubMed

    Toth, Daniel; Miao, Shun; Kurzendorfer, Tanja; Rinaldi, Christopher A; Liao, Rui; Mansi, Tommaso; Rhode, Kawal; Mountney, Peter

    2018-05-12

    In cardiac interventions, such as cardiac resynchronization therapy (CRT), image guidance can be enhanced by involving preoperative models. Multimodality 3D/2D registration for image guidance, however, remains a significant research challenge for fundamentally different image data, i.e., MR to X-ray. Registration methods must account for differences in intensity, contrast levels, resolution, dimensionality, field of view. Furthermore, same anatomical structures may not be visible in both modalities. Current approaches have focused on developing modality-specific solutions for individual clinical use cases, by introducing constraints, or identifying cross-modality information manually. Machine learning approaches have the potential to create more general registration platforms. However, training image to image methods would require large multimodal datasets and ground truth for each target application. This paper proposes a model-to-image registration approach instead, because it is common in image-guided interventions to create anatomical models for diagnosis, planning or guidance prior to procedures. An imitation learning-based method, trained on 702 datasets, is used to register preoperative models to intraoperative X-ray images. Accuracy is demonstrated on cardiac models and artificial X-rays generated from CTs. The registration error was [Formula: see text] on 1000 test cases, superior to that of manual ([Formula: see text]) and gradient-based ([Formula: see text]) registration. High robustness is shown in 19 clinical CRT cases. Besides the proposed methods feasibility in a clinical environment, evaluation has shown good accuracy and high robustness indicating that it could be applied in image-guided interventions.

  8. Design of an Evolutionary Approach for Intrusion Detection

    PubMed Central

    2013-01-01

    A novel evolutionary approach is proposed for effective intrusion detection based on benchmark datasets. The proposed approach can generate a pool of noninferior individual solutions and ensemble solutions thereof. The generated ensembles can be used to detect the intrusions accurately. For intrusion detection problem, the proposed approach could consider conflicting objectives simultaneously like detection rate of each attack class, error rate, accuracy, diversity, and so forth. The proposed approach can generate a pool of noninferior solutions and ensembles thereof having optimized trade-offs values of multiple conflicting objectives. In this paper, a three-phase, approach is proposed to generate solutions to a simple chromosome design in the first phase. In the first phase, a Pareto front of noninferior individual solutions is approximated. In the second phase of the proposed approach, the entire solution set is further refined to determine effective ensemble solutions considering solution interaction. In this phase, another improved Pareto front of ensemble solutions over that of individual solutions is approximated. The ensemble solutions in improved Pareto front reported improved detection results based on benchmark datasets for intrusion detection. In the third phase, a combination method like majority voting method is used to fuse the predictions of individual solutions for determining prediction of ensemble solution. Benchmark datasets, namely, KDD cup 1999 and ISCX 2012 dataset, are used to demonstrate and validate the performance of the proposed approach for intrusion detection. The proposed approach can discover individual solutions and ensemble solutions thereof with a good support and a detection rate from benchmark datasets (in comparison with well-known ensemble methods like bagging and boosting). In addition, the proposed approach is a generalized classification approach that is applicable to the problem of any field having multiple conflicting objectives, and a dataset can be represented in the form of labelled instances in terms of its features. PMID:24376390

  9. Using Copula Distributions to Support More Accurate Imaging-Based Diagnostic Classifiers for Neuropsychiatric Disorders

    PubMed Central

    Bansal, Ravi; Hao, Xuejun; Liu, Jun; Peterson, Bradley S.

    2014-01-01

    Many investigators have tried to apply machine learning techniques to magnetic resonance images (MRIs) of the brain in order to diagnose neuropsychiatric disorders. Usually the number of brain imaging measures (such as measures of cortical thickness and measures of local surface morphology) derived from the MRIs (i.e., their dimensionality) has been large (e.g. >10) relative to the number of participants who provide the MRI data (<100). Sparse data in a high dimensional space increases the variability of the classification rules that machine learning algorithms generate, thereby limiting the validity, reproducibility, and generalizability of those classifiers. The accuracy and stability of the classifiers can improve significantly if the multivariate distributions of the imaging measures can be estimated accurately. To accurately estimate the multivariate distributions using sparse data, we propose to estimate first the univariate distributions of imaging data and then combine them using a Copula to generate more accurate estimates of their multivariate distributions. We then sample the estimated Copula distributions to generate dense sets of imaging measures and use those measures to train classifiers. We hypothesize that the dense sets of brain imaging measures will generate classifiers that are stable to variations in brain imaging measures, thereby improving the reproducibility, validity, and generalizability of diagnostic classification algorithms in imaging datasets from clinical populations. In our experiments, we used both computer-generated and real-world brain imaging datasets to assess the accuracy of multivariate Copula distributions in estimating the corresponding multivariate distributions of real-world imaging data. Our experiments showed that diagnostic classifiers generated using imaging measures sampled from the Copula were significantly more accurate and more reproducible than were the classifiers generated using either the real-world imaging measures or their multivariate Gaussian distributions. Thus, our findings demonstrate that estimated multivariate Copula distributions can generate dense sets of brain imaging measures that can in turn be used to train classifiers, and those classifiers are significantly more accurate and more reproducible than are those generated using real-world imaging measures alone. PMID:25093634

  10. A multi-strategy approach to informative gene identification from gene expression data.

    PubMed

    Liu, Ziying; Phan, Sieu; Famili, Fazel; Pan, Youlian; Lenferink, Anne E G; Cantin, Christiane; Collins, Catherine; O'Connor-McCourt, Maureen D

    2010-02-01

    An unsupervised multi-strategy approach has been developed to identify informative genes from high throughput genomic data. Several statistical methods have been used in the field to identify differentially expressed genes. Since different methods generate different lists of genes, it is very challenging to determine the most reliable gene list and the appropriate method. This paper presents a multi-strategy method, in which a combination of several data analysis techniques are applied to a given dataset and a confidence measure is established to select genes from the gene lists generated by these techniques to form the core of our final selection. The remainder of the genes that form the peripheral region are subject to exclusion or inclusion into the final selection. This paper demonstrates this methodology through its application to an in-house cancer genomics dataset and a public dataset. The results indicate that our method provides more reliable list of genes, which are validated using biological knowledge, biological experiments, and literature search. We further evaluated our multi-strategy method by consolidating two pairs of independent datasets, each pair is for the same disease, but generated by different labs using different platforms. The results showed that our method has produced far better results.

  11. Integrative sparse principal component analysis of gene expression data.

    PubMed

    Liu, Mengque; Fan, Xinyan; Fang, Kuangnan; Zhang, Qingzhao; Ma, Shuangge

    2017-12-01

    In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance. © 2017 WILEY PERIODICALS, INC.

  12. A dataset of multiresolution functional brain parcellations in an elderly population with no or mild cognitive impairment.

    PubMed

    Tam, Angela; Dansereau, Christian; Badhwar, AmanPreet; Orban, Pierre; Belleville, Sylvie; Chertkow, Howard; Dagher, Alain; Hanganu, Alexandru; Monchi, Oury; Rosa-Neto, Pedro; Shmuel, Amir; Breitner, John; Bellec, Pierre

    2016-12-01

    We present group eight resolutions of brain parcellations for clusters generated from resting-state functional magnetic resonance images for 99 cognitively normal elderly persons and 129 patients with mild cognitive impairment, pooled from four independent datasets. This dataset was generated as part of the following study: Common Effects of Amnestic Mild Cognitive Impairment on Resting-State Connectivity Across Four Independent Studies (Tam et al., 2015) [1]. The brain parcellations have been registered to both symmetric and asymmetric MNI brain templates and generated using a method called bootstrap analysis of stable clusters (BASC) (Bellec et al., 2010) [2]. We present two variants of these parcellations. One variant contains bihemisphereic parcels (4, 6, 12, 22, 33, 65, 111, and 208 total parcels across eight resolutions). The second variant contains spatially connected regions of interest (ROIs) that span only one hemisphere (10, 17, 30, 51, 77, 199, and 322 total ROIs across eight resolutions). We also present maps illustrating functional connectivity differences between patients and controls for four regions of interest (striatum, dorsal prefrontal cortex, middle temporal lobe, and medial frontal cortex). The brain parcels and associated statistical maps have been publicly released as 3D volumes, available in .mnc and .nii file formats on figshare and on Neurovault. Finally, the code used to generate this dataset is available on Github.

  13. Development and application of GIS-based PRISM integration through a plugin approach

    NASA Astrophysics Data System (ADS)

    Lee, Woo-Seop; Chun, Jong Ahn; Kang, Kwangmin

    2014-05-01

    A PRISM (Parameter-elevation Regressions on Independent Slopes Model) QGIS-plugin was developed on Quantum GIS platform in this study. This Quantum GIS plugin system provides user-friendly graphic user interfaces (GUIs) so that users can obtain gridded meteorological data of high resolutions (1 km × 1 km). Also, this software is designed to run on a personal computer so that it does not require an internet access or a sophisticated computer system. This module is a user-friendly system that a user can generate PRISM data with ease. The proposed PRISM QGIS-plugin is a hybrid statistical-geographic model system that uses coarse resolution datasets (APHRODITE datasets in this study) with digital elevation data to generate the fine-resolution gridded precipitation. To validate the performance of the software, Prek Thnot River Basin in Kandal, Cambodia is selected for application. Overall statistical analysis shows promising outputs generated by the proposed plugin. Error measures such as RMSE (Root Mean Square Error) and MAPE (Mean Absolute Percentage Error) were used to evaluate the performance of the developed PRISM QGIS-plugin. Evaluation results using RMSE and MAPE were 2.76 mm and 4.2%, respectively. This study suggested that the plugin can be used to generate high resolution precipitation datasets for hydrological and climatological studies at a watershed where observed weather datasets are limited.

  14. Progeny Clustering: A Method to Identify Biological Phenotypes

    PubMed Central

    Hu, Chenyue W.; Kornblau, Steven M.; Slater, John H.; Qutub, Amina A.

    2015-01-01

    Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476

  15. Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses

    PubMed Central

    Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi

    2018-01-01

    Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625

  16. Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.

    PubMed

    Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi

    2018-02-27

    Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.

  17. Comparing species tree estimation with large anchored phylogenomic and small Sanger-sequenced molecular datasets: an empirical study on Malagasy pseudoxyrhophiine snakes.

    PubMed

    Ruane, Sara; Raxworthy, Christopher J; Lemmon, Alan R; Lemmon, Emily Moriarty; Burbrink, Frank T

    2015-10-12

    Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar's serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species tree approaches. We also examined the individual gene trees in comparison to the 377-locus species tree using the program MetaTree. Using the full anchored dataset under a variety of methods gave us the same, well-supported phylogeny for pseudoxyrhophiines. The African pseudoxyrhophiine Duberria is the sister taxon to the Malagasy pseudoxyrhophiines genera, providing evidence for a monophyletic radiation in Madagascar. In addition, within Madagascar, the two major clades inferred correspond largely to the aglyphous and opisthoglyphous genera, suggesting that feeding specializations associated with tooth venom delivery may have played a major role in the early diversification of this radiation. The comparison of tree topologies from the concatenated and species-tree methods using different datasets indicated the 5-locus dataset cannot beused to infer a correct phylogeny for the pseudoxyrhophiines under any method tested here and that summary statistics methods require 50 or more loci to consistently recover the species-tree inferred using the complete anchored dataset. However, as few as 15 loci may infer the correct topology when using the full coalescent species tree method *BEAST. MetaTree analyses of each gene tree from the Sanger and anchored datasets found that none of the individual gene trees matched the 377-locus species tree, and that no gene trees were identical with respect to topology. Our results suggest that ≥50 loci may be necessary to confidently infer phylogenies when using summaryspecies-tree methods, but that the coalescent-based method *BEAST consistently recovers the same topology using only 15 loci. These results reinforce that datasets with small numbers of markers may result in misleading topologies, and further, that the method of inference used to generate a phylogeny also has a major influence on the number of loci necessary to infer robust species trees.

  18. Comparison of classification methods for voxel-based prediction of acute ischemic stroke outcome following intra-arterial intervention

    NASA Astrophysics Data System (ADS)

    Winder, Anthony J.; Siemonsen, Susanne; Flottmann, Fabian; Fiehler, Jens; Forkert, Nils D.

    2017-03-01

    Voxel-based tissue outcome prediction in acute ischemic stroke patients is highly relevant for both clinical routine and research. Previous research has shown that features extracted from baseline multi-parametric MRI datasets have a high predictive value and can be used for the training of classifiers, which can generate tissue outcome predictions for both intravenous and conservative treatments. However, with the recent advent and popularization of intra-arterial thrombectomy treatment, novel research specifically addressing the utility of predictive classi- fiers for thrombectomy intervention is necessary for a holistic understanding of current stroke treatment options. The aim of this work was to develop three clinically viable tissue outcome prediction models using approximate nearest-neighbor, generalized linear model, and random decision forest approaches and to evaluate the accuracy of predicting tissue outcome after intra-arterial treatment. Therefore, the three machine learning models were trained, evaluated, and compared using datasets of 42 acute ischemic stroke patients treated with intra-arterial thrombectomy. Classifier training utilized eight voxel-based features extracted from baseline MRI datasets and five global features. Evaluation of classifier-based predictions was performed via comparison to the known tissue outcome, which was determined in follow-up imaging, using the Dice coefficient and leave-on-patient-out cross validation. The random decision forest prediction model led to the best tissue outcome predictions with a mean Dice coefficient of 0.37. The approximate nearest-neighbor and generalized linear model performed equally suboptimally with average Dice coefficients of 0.28 and 0.27 respectively, suggesting that both non-linearity and machine learning are desirable properties of a classifier well-suited to the intra-arterial tissue outcome prediction problem.

  19. RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets.

    PubMed

    Scheuch, Matthias; Höper, Dirk; Beer, Martin

    2015-03-03

    Fuelled by the advent and subsequent development of next generation sequencing technologies, metagenomics became a powerful tool for the analysis of microbial communities both scientifically and diagnostically. The biggest challenge is the extraction of relevant information from the huge sequence datasets generated for metagenomics studies. Although a plethora of tools are available, data analysis is still a bottleneck. To overcome the bottleneck of data analysis, we developed an automated computational workflow called RIEMS - Reliable Information Extraction from Metagenomic Sequence datasets. RIEMS assigns every individual read sequence within a dataset taxonomically by cascading different sequence analyses with decreasing stringency of the assignments using various software applications. After completion of the analyses, the results are summarised in a clearly structured result protocol organised taxonomically. The high accuracy and performance of RIEMS analyses were proven in comparison with other tools for metagenomics data analysis using simulated sequencing read datasets. RIEMS has the potential to fill the gap that still exists with regard to data analysis for metagenomics studies. The usefulness and power of RIEMS for the analysis of genuine sequencing datasets was demonstrated with an early version of RIEMS in 2011 when it was used to detect the orthobunyavirus sequences leading to the discovery of Schmallenberg virus.

  20. Data Sharing and the Development of the Cleveland Clinic Statistical Education Dataset Repository

    ERIC Educational Resources Information Center

    Nowacki, Amy S.

    2013-01-01

    Examples are highly sought by both students and teachers. This is particularly true as many statistical instructors aim to engage their students and increase active participation. While simulated datasets are functional, they lack real perspective and the intricacies of actual data. In order to obtain real datasets, the principal investigator of a…

  1. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

    PubMed Central

    Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila

    2018-01-01

    Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374

  2. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures.

    PubMed

    Urbanowicz, Ryan J; Kiralis, Jeff; Sinnott-Armstrong, Nicholas A; Heberling, Tamra; Fisher, Jonathan M; Moore, Jason H

    2012-10-01

    Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. However, such methods are computationally expensive, difficult to adapt to multiple objectives, and unlikely to yield models with a precise form of epistasis which we refer to as pure and strict. Purely and strictly epistatic models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n-loci are included in the disease model. This makes them an attractive gold standard for simulation studies considering complex multi-locus effects. We introduce GAMETES, a user-friendly software package and algorithm which generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. GAMETES rapidly and precisely generates random, pure, strict n-locus models with specified genetic constraints. These constraints include heritability, minor allele frequencies of the SNPs, and population prevalence. GAMETES also includes a simple dataset simulation strategy which may be utilized to rapidly generate an archive of simulated datasets for given genetic models. We highlight the utility and limitations of GAMETES with an example simulation study using MDR, an algorithm designed to detect epistasis. GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures. While GAMETES has a limited ability to generate models with higher heritabilities, it is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms. In addition, the GAMETES modeling strategy may be flexibly combined with any dataset simulation strategy. Beyond dataset simulation, GAMETES could be employed to pursue theoretical characterization of genetic models and epistasis.

  3. STILTS Plotting Tools

    NASA Astrophysics Data System (ADS)

    Taylor, M. B.

    2009-09-01

    The new plotting functionality in version 2.0 of STILTS is described. STILTS is a mature and powerful package for all kinds of table manipulation, and this version adds facilities for generating plots from one or more tables to its existing wide range of non-graphical capabilities. 2- and 3-dimensional scatter plots and 1-dimensional histograms may be generated using highly configurable style parameters. Features include multiple dataset overplotting, variable transparency, 1-, 2- or 3-dimensional symmetric or asymmetric error bars, higher-dimensional visualization using color, and textual point labeling. Vector and bitmapped output formats are supported. The plotting options provide enough flexibility to perform meaningful visualization on datasets from a few points up to tens of millions. Arbitrarily large datasets can be plotted without heavy memory usage.

  4. Datasets related to in-land water for limnology and remote sensing applications: distance-to-land, distance-to-water, water-body identifier and lake-centre co-ordinates.

    PubMed

    Carrea, Laura; Embury, Owen; Merchant, Christopher J

    2015-11-01

    Datasets containing information to locate and identify water bodies have been generated from data locating static-water-bodies with resolution of about 300 m (1/360 ∘ ) recently released by the Land Cover Climate Change Initiative (LC CCI) of the European Space Agency. The LC CCI water-bodies dataset has been obtained from multi-temporal metrics based on time series of the backscattered intensity recorded by ASAR on Envisat between 2005 and 2010. The new derived datasets provide coherently: distance to land, distance to water, water-body identifiers and lake-centre locations. The water-body identifier dataset locates the water bodies assigning the identifiers of the Global Lakes and Wetlands Database (GLWD), and lake centres are defined for in-land waters for which GLWD IDs were determined. The new datasets therefore link recent lake/reservoir/wetlands extent to the GLWD, together with a set of coordinates which locates unambiguously the water bodies in the database. Information on distance-to-land for each water cell and the distance-to-water for each land cell has many potential applications in remote sensing, where the applicability of geophysical retrieval algorithms may be affected by the presence of water or land within a satellite field of view (image pixel). During the generation and validation of the datasets some limitations of the GLWD database and of the LC CCI water-bodies mask have been found. Some examples of the inaccuracies/limitations are presented and discussed. Temporal change in water-body extent is common. Future versions of the LC CCI dataset are planned to represent temporal variation, and this will permit these derived datasets to be updated.

  5. WE-DE-207B-10: Library-Based X-Ray Scatter Correction for Dedicated Cone-Beam Breast CT: Clinical Validation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shi, L; Zhu, L; Vedantham, S

    Purpose: Scatter contamination is detrimental to image quality in dedicated cone-beam breast CT (CBBCT), resulting in cupping artifacts and loss of contrast in reconstructed images. Such effects impede visualization of breast lesions and the quantitative accuracy. Previously, we proposed a library-based software approach to suppress scatter on CBBCT images. In this work, we quantify the efficacy and stability of this approach using datasets from 15 human subjects. Methods: A pre-computed scatter library is generated using Monte Carlo simulations for semi-ellipsoid breast models and homogeneous fibroglandular/adipose tissue mixture encompassing the range reported in literature. Projection datasets from 15 human subjects thatmore » cover 95 percentile of breast dimensions and fibroglandular volume fraction were included in the analysis. Our investigations indicate that it is sufficient to consider the breast dimensions alone and variation in fibroglandular fraction does not significantly affect the scatter-to-primary ratio. The breast diameter is measured from a first-pass reconstruction; the appropriate scatter distribution is selected from the library; and, deformed by considering the discrepancy in total projection intensity between the clinical dataset and the simulated semi-ellipsoidal breast. The deformed scatter-distribution is subtracted from the measured projections for scatter correction. Spatial non-uniformity (SNU) and contrast-to-noise ratio (CNR) were used as quantitative metrics to evaluate the results. Results: On the 15 patient cases, our method reduced the overall image spatial non-uniformity (SNU) from 7.14%±2.94% (mean ± standard deviation) to 2.47%±0.68% in coronal view and from 10.14%±4.1% to 3.02% ±1.26% in sagittal view. The average contrast to noise ratio (CNR) improved by a factor of 1.49±0.40 in coronal view and by 2.12±1.54 in sagittal view. Conclusion: We demonstrate the robustness and effectiveness of a library-based scatter correction method using patient datasets with large variability in breast dimensions and composition. The high computational efficiency and simplicity in implementation make this attractive for clinical implementation. Supported partly by NIH R21EB019597, R21CA134128 and R01CA195512.The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.« less

  6. SU-E-T-333: Dosimetric Impact of Rotational Error On the Target Coverage in IMPT Lung Cancer Plans

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rana, S; Zheng, Y

    2015-06-15

    Purpose: The main purpose of this study was to investigate the impact of rotational (yaw, roll, and pitch) error on the planning target volume (PTV) coverage in lung cancer plans generated by intensity modulated proton therapy (IMPT). Methods: In this retrospective study, computed tomography (CT) dataset of previously treated lung case was used. IMPT plan were generated on the original CT dataset using left-lateral (LL) and posterior-anterior (PA) beams for a total dose of 74 Gy[RBE] with 2 Gy[RBE] per fraction. In order to investigate the dosimetric impact of rotational error, 12 new CT datasets were generated by re-sampling themore » original CT dataset for rotational (roll, yaw, and pitch) angles ranged from −5° to +5°, with an increment of 2.5°. A total of 12 new IMPT plans were generated based on the re-sampled CT datasets using beam parameters identical to the ones in the original IMPT plan. All treatment plans were generated in XiO treatment planning system. The PTV coverage (i.e., dose received by 95% of the PTV volume, D95) in new IMPT plans were then compared with the PTV coverage in the original IMPT plan. Results: Rotational errors caused the reduction in the PTV coverage in all 12 new IMPT plans when compared to the original IMPT lung plan. Specifically, the PTV coverage was reduced by 4.94% to 50.51% for yaw, by 4.04% to 23.74% for roll, and by 5.21% to 46.88% for pitch errors. Conclusion: Unacceptable dosimetric results were observed in new IMPT plans as the PTV coverage was reduced by up to 26.87% and 50.51% for rotational error of 2.5° and 5°, respectively. Further investigation is underway in evaluating the PTV coverage loss in the IMPT lung cancer plans for smaller rotational angle change.« less

  7. Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics.

    PubMed

    Mahmood, Khalid; Jung, Chol-Hee; Philip, Gayle; Georgeson, Peter; Chung, Jessica; Pope, Bernard J; Park, Daniel J

    2017-05-16

    Genetic variant effect prediction algorithms are used extensively in clinical genomics and research to determine the likely consequences of amino acid substitutions on protein function. It is vital that we better understand their accuracies and limitations because published performance metrics are confounded by serious problems of circularity and error propagation. Here, we derive three independent, functionally determined human mutation datasets, UniFun, BRCA1-DMS and TP53-TA, and employ them, alongside previously described datasets, to assess the pre-eminent variant effect prediction tools. Apparent accuracies of variant effect prediction tools were influenced significantly by the benchmarking dataset. Benchmarking with the assay-determined datasets UniFun and BRCA1-DMS yielded areas under the receiver operating characteristic curves in the modest ranges of 0.52 to 0.63 and 0.54 to 0.75, respectively, considerably lower than observed for other, potentially more conflicted datasets. These results raise concerns about how such algorithms should be employed, particularly in a clinical setting. Contemporary variant effect prediction tools are unlikely to be as accurate at the general prediction of functional impacts on proteins as reported prior. Use of functional assay-based datasets that avoid prior dependencies promises to be valuable for the ongoing development and accurate benchmarking of such tools.

  8. Accounting For Uncertainty in The Application Of High Throughput Datasets

    EPA Science Inventory

    The use of high throughput screening (HTS) datasets will need to adequately account for uncertainties in the data generation process and propagate these uncertainties through to ultimate use. Uncertainty arises at multiple levels in the construction of predictors using in vitro ...

  9. Deep learning in breast cancer risk assessment: evaluation of convolutional neural networks on a clinical dataset of full-field digital mammograms.

    PubMed

    Li, Hui; Giger, Maryellen L; Huynh, Benjamin Q; Antropova, Natalia O

    2017-10-01

    To evaluate deep learning in the assessment of breast cancer risk in which convolutional neural networks (CNNs) with transfer learning are used to extract parenchymal characteristics directly from full-field digital mammographic (FFDM) images instead of using computerized radiographic texture analysis (RTA), 456 clinical FFDM cases were included: a "high-risk" BRCA1/2 gene-mutation carriers dataset (53 cases), a "high-risk" unilateral cancer patients dataset (75 cases), and a "low-risk dataset" (328 cases). Deep learning was compared to the use of features from RTA, as well as to a combination of both in the task of distinguishing between high- and low-risk subjects. Similar classification performances were obtained using CNN [area under the curve [Formula: see text]; standard error [Formula: see text

  10. Blockmodels for connectome analysis

    NASA Astrophysics Data System (ADS)

    Moyer, Daniel; Gutman, Boris; Prasad, Gautam; Faskowitz, Joshua; Ver Steeg, Greg; Thompson, Paul

    2015-12-01

    In the present work we study a family of generative network model and its applications for modeling the human connectome. We introduce a minor but novel variant of the Mixed Membership Stochastic Blockmodel and apply it and two other related model to two human connectome datasets (ADNI and a Bipolar Disorder dataset) with both control and diseased subjects. We further provide a simple generative classifier that, alongside more discriminating methods, provides evidence that blockmodels accurately summarize tractography count networks with respect to a disease classification task.

  11. Improving the discoverability, accessibility, and citability of omics datasets: a case report.

    PubMed

    Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J

    2017-03-01

    Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  12. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

    PubMed

    Planey, Catherine R; Gevaert, Olivier

    2016-03-09

    Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.

  13. Datasets for supplier selection and order allocation with green criteria, all-unit quantity discounts and varying number of suppliers.

    PubMed

    Hamdan, Sadeque; Cheaitou, Ali

    2017-08-01

    This data article provides detailed optimization input and output datasets and optimization code for the published research work titled "Dynamic green supplier selection and order allocation with quantity discounts and varying supplier availability" (Hamdan and Cheaitou, 2017, In press) [1]. Researchers may use these datasets as a baseline for future comparison and extensive analysis of the green supplier selection and order allocation problem with all-unit quantity discount and varying number of suppliers. More particularly, the datasets presented in this article allow researchers to generate the exact optimization outputs obtained by the authors of Hamdan and Cheaitou (2017, In press) [1] using the provided optimization code and then to use them for comparison with the outputs of other techniques or methodologies such as heuristic approaches. Moreover, this article includes the randomly generated optimization input data and the related outputs that are used as input data for the statistical analysis presented in Hamdan and Cheaitou (2017 In press) [1] in which two different approaches for ranking potential suppliers are compared. This article also provides the time analysis data used in (Hamdan and Cheaitou (2017, In press) [1] to study the effect of the problem size on the computation time as well as an additional time analysis dataset. The input data for the time study are generated randomly, in which the problem size is changed, and then are used by the optimization problem to obtain the corresponding optimal outputs as well as the corresponding computation time.

  14. An exploration of the properties of the CORE problem list subset and how it facilitates the implementation of SNOMED CT

    PubMed Central

    Xu, Julia

    2015-01-01

    Objective Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is the emergent international health terminology standard for encoding clinical information in electronic health records. The CORE Problem List Subset was created to facilitate the terminology’s implementation. This study evaluates the CORE Subset’s coverage and examines its growth pattern as source datasets are being incorporated. Methods Coverage of frequently used terms and the corresponding usage of the covered terms were assessed by “leave-one-out” analysis of the eight datasets constituting the current CORE Subset. The growth pattern was studied using a retrospective experiment, growing the Subset one dataset at a time and examining the relationship between the size of the starting subset and the coverage of frequently used terms in the incoming dataset. Linear regression was used to model that relationship. Results On average, the CORE Subset covered 80.3% of the frequently used terms of the left-out dataset, and the covered terms accounted for 83.7% of term usage. There was a significant positive correlation between the CORE Subset’s size and the coverage of the frequently used terms in an incoming dataset. This implies that the CORE Subset will grow at a progressively slower pace as it gets bigger. Conclusion The CORE Problem List Subset is a useful resource for the implementation of Systematized Nomenclature of Medicine Clinical Terms in electronic health records. It offers good coverage of frequently used terms, which account for a high proportion of term usage. If future datasets are incorporated into the CORE Subset, it is likely that its size will remain small and manageable. PMID:25725003

  15. Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation

    NASA Astrophysics Data System (ADS)

    Tobon-Gomez, Catalina; Sukno, Federico M.; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F.

    2012-07-01

    Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18% LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.

  16. Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation.

    PubMed

    Tobon-Gomez, Catalina; Sukno, Federico M; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F

    2012-07-07

    Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18%; LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.

  17. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research.

    PubMed

    Bhattacharya, Sanchita; Dunn, Patrick; Thomas, Cristel G; Smith, Barry; Schaefer, Henry; Chen, Jieming; Hu, Zicheng; Zalocusky, Kelly A; Shankar, Ravi D; Shen-Orr, Shai S; Thomson, Elizabeth; Wiser, Jeffrey; Butte, Atul J

    2018-02-27

    Immunology researchers are beginning to explore the possibilities of reproducibility, reuse and secondary analyses of immunology data. Open-access datasets are being applied in the validation of the methods used in the original studies, leveraging studies for meta-analysis, or generating new hypotheses. To promote these goals, the ImmPort data repository was created for the broader research community to explore the wide spectrum of clinical and basic research data and associated findings. The ImmPort ecosystem consists of four components-Private Data, Shared Data, Data Analysis, and Resources-for data archiving, dissemination, analyses, and reuse. To date, more than 300 studies have been made freely available through the Shared Data portal (www.immport.org/immport-open), which allows research data to be repurposed to accelerate the translation of new insights into discoveries.

  18. Toward an integrated knowledge environment to support modern oncology.

    PubMed

    Blake, Patrick M; Decker, David A; Glennon, Timothy M; Liang, Yong Michael; Losko, Sascha; Navin, Nicholas; Suh, K Stephen

    2011-01-01

    Around the world, teams of researchers continue to develop a wide range of systems to capture, store, and analyze data including treatment, patient outcomes, tumor registries, next-generation sequencing, single-nucleotide polymorphism, copy number, gene expression, drug chemistry, drug safety, and toxicity. Scientists mine, curate, and manually annotate growing mountains of data to produce high-quality databases, while clinical information is aggregated in distant systems. Databases are currently scattered, and relationships between variables coded in disparate datasets are frequently invisible. The challenge is to evolve oncology informatics from a "systems" orientation of standalone platforms and silos into an "integrated knowledge environments" that will connect "knowable" research data with patient clinical information. The aim of this article is to review progress toward an integrated knowledge environment to support modern oncology with a focus on supporting scientific discovery and improving cancer care.

  19. TCGA4U: A Web-Based Genomic Analysis Platform To Explore And Mine TCGA Genomic Data For Translational Research.

    PubMed

    Huang, Zhenzhen; Duan, Huilong; Li, Haomin

    2015-01-01

    Large-scale human cancer genomics projects, such as TCGA, generated large genomics data for further study. Exploring and mining these data to obtain meaningful analysis results can help researchers find potential genomics alterations that intervene the development and metastasis of tumors. We developed a web-based gene analysis platform, named TCGA4U, which used statistics methods and models to help translational investigators explore, mine and visualize human cancer genomic characteristic information from the TCGA datasets. Furthermore, through Gene Ontology (GO) annotation and clinical data integration, the genomic data were transformed into biological process, molecular function, cellular component and survival curves to help researchers identify potential driver genes. Clinical researchers without expertise in data analysis will benefit from such a user-friendly genomic analysis platform.

  20. Plant databases and data analysis tools

    USDA-ARS?s Scientific Manuscript database

    It is anticipated that the coming years will see the generation of large datasets including diagnostic markers in several plant species with emphasis on crop plants. To use these datasets effectively in any plant breeding program, it is essential to have the information available via public database...

  1. A conceptual prototype for the next-generation national elevation dataset

    USGS Publications Warehouse

    Stoker, Jason M.; Heidemann, Hans Karl; Evans, Gayla A.; Greenlee, Susan K.

    2013-01-01

    In 2012 the U.S. Geological Survey's (USGS) National Geospatial Program (NGP) funded a study to develop a conceptual prototype for a new National Elevation Dataset (NED) design with expanded capabilities to generate and deliver a suite of bare earth and above ground feature information over the United States. This report details the research on identifying operational requirements based on prior research, evaluation of what is needed for the USGS to meet these requirements, and development of a possible conceptual framework that could potentially deliver the kinds of information that are needed to support NGP's partners and constituents. This report provides an initial proof-of-concept demonstration using an existing dataset, and recommendations for the future, to inform NGP's ongoing and future elevation program planning and management decisions. The demonstration shows that this type of functional process can robustly create derivatives from lidar point cloud data; however, more research needs to be done to see how well it extends to multiple datasets.

  2. Stochastic models to demonstrate the effect of motivated testing on HIV incidence estimates using the serological testing algorithm for recent HIV seroconversion (STARHS).

    PubMed

    White, Edward W; Lumley, Thomas; Goodreau, Steven M; Goldbaum, Gary; Hawes, Stephen E

    2010-12-01

    To produce valid seroincidence estimates, the serological testing algorithm for recent HIV seroconversion (STARHS) assumes independence between infection and testing, which may be absent in clinical data. STARHS estimates are generally greater than cohort-based estimates of incidence from observable person-time and diagnosis dates. The authors constructed a series of partial stochastic models to examine whether testing motivated by suspicion of infection could bias STARHS. One thousand Monte Carlo simulations of 10,000 men who have sex with men were generated using parameters for HIV incidence and testing frequency from data from a clinical testing population in Seattle. In one set of simulations, infection and testing dates were independent. In another set, some intertest intervals were abbreviated to reflect the distribution of intervals between suspected HIV exposure and testing in a group of Seattle men who have sex with men recently diagnosed as having HIV. Both estimation methods were applied to the simulated datasets. Both cohort-based and STARHS incidence estimates were calculated using the simulated data and compared with previously calculated, empirical cohort-based and STARHS seroincidence estimates from the clinical testing population. Under simulated independence between infection and testing, cohort-based and STARHS incidence estimates resembled cohort estimates from the clinical dataset. Under simulated motivated testing, cohort-based estimates remained unchanged, but STARHS estimates were inflated similar to empirical STARHS estimates. Varying motivation parameters appreciably affected STARHS incidence estimates, but not cohort-based estimates. Cohort-based incidence estimates are robust against dependence between testing and acquisition of infection, whereas STARHS incidence estimates are not.

  3. A spline-based regression parameter set for creating customized DARTEL MRI brain templates from infancy to old age.

    PubMed

    Wilke, Marko

    2018-02-01

    This dataset contains the regression parameters derived by analyzing segmented brain MRI images (gray matter and white matter) from a large population of healthy subjects, using a multivariate adaptive regression splines approach. A total of 1919 MRI datasets ranging in age from 1-75 years from four publicly available datasets (NIH, C-MIND, fCONN, and IXI) were segmented using the CAT12 segmentation framework, writing out gray matter and white matter images normalized using an affine-only spatial normalization approach. These images were then subjected to a six-step DARTEL procedure, employing an iterative non-linear registration approach and yielding increasingly crisp intermediate images. The resulting six datasets per tissue class were then analyzed using multivariate adaptive regression splines, using the CerebroMatic toolbox. This approach allows for flexibly modelling smoothly varying trajectories while taking into account demographic (age, gender) as well as technical (field strength, data quality) predictors. The resulting regression parameters described here can be used to generate matched DARTEL or SHOOT templates for a given population under study, from infancy to old age. The dataset and the algorithm used to generate it are publicly available at https://irc.cchmc.org/software/cerebromatic.php.

  4. Chest x-ray generation and data augmentation for cardiovascular abnormality classification

    NASA Astrophysics Data System (ADS)

    Madani, Ali; Moradi, Mehdi; Karargyris, Alexandros; Syeda-Mahmood, Tanveer

    2018-03-01

    Medical imaging datasets are limited in size due to privacy issues and the high cost of obtaining annotations. Augmentation is a widely used practice in deep learning to enrich the data in data-limited scenarios and to avoid overfitting. However, standard augmentation methods that produce new examples of data by varying lighting, field of view, and spatial rigid transformations do not capture the biological variance of medical imaging data and could result in unrealistic images. Generative adversarial networks (GANs) provide an avenue to understand the underlying structure of image data which can then be utilized to generate new realistic samples. In this work, we investigate the use of GANs for producing chest X-ray images to augment a dataset. This dataset is then used to train a convolutional neural network to classify images for cardiovascular abnormalities. We compare our augmentation strategy with traditional data augmentation and show higher accuracy for normal vs abnormal classification in chest X-rays.

  5. Improving clinical models based on knowledge extracted from current datasets: a new approach.

    PubMed

    Mendes, D; Paredes, S; Rocha, T; Carvalho, P; Henriques, J; Morais, J

    2016-08-01

    The Cardiovascular Diseases (CVD) are the leading cause of death in the world, being prevention recognized to be a key intervention able to contradict this reality. In this context, although there are several models and scores currently used in clinical practice to assess the risk of a new cardiovascular event, they present some limitations. The goal of this paper is to improve the CVD risk prediction taking into account the current models as well as information extracted from real and recent datasets. This approach is based on a decision tree scheme in order to assure the clinical interpretability of the model. An innovative optimization strategy is developed in order to adjust the decision tree thresholds (rule structure is fixed) based on recent clinical datasets. A real dataset collected in the ambit of the National Registry on Acute Coronary Syndromes, Portuguese Society of Cardiology is applied to validate this work. In order to assess the performance of the new approach, the metrics sensitivity, specificity and accuracy are used. This new approach achieves sensitivity, a specificity and an accuracy values of, 80.52%, 74.19% and 77.27% respectively, which represents an improvement of about 26% in relation to the accuracy of the original score.

  6. Automated generation of massive image knowledge collections using Microsoft Live Labs Pivot to promote neuroimaging and translational research

    PubMed Central

    2011-01-01

    Background Massive datasets comprising high-resolution images, generated in neuro-imaging studies and in clinical imaging research, are increasingly challenging our ability to analyze, share, and filter such images in clinical and basic translational research. Pivot collection exploratory analysis provides each user the ability to fully interact with the massive amounts of visual data to fully facilitate sufficient sorting, flexibility and speed to fluidly access, explore or analyze the massive image data sets of high-resolution images and their associated meta information, such as neuro-imaging databases from the Allen Brain Atlas. It is used in clustering, filtering, data sharing and classifying of the visual data into various deep zoom levels and meta information categories to detect the underlying hidden pattern within the data set that has been used. Method We deployed prototype Pivot collections using the Linux CentOS running on the Apache web server. We also tested the prototype Pivot collections on other operating systems like Windows (the most common variants) and UNIX, etc. It is demonstrated that the approach yields very good results when compared with other approaches used by some researchers for generation, creation, and clustering of massive image collections such as the coronal and horizontal sections of the mouse brain from the Allen Brain Atlas. Results Pivot visual analytics was used to analyze a prototype of dataset Dab2 co-expressed genes from the Allen Brain Atlas. The metadata along with high-resolution images were automatically extracted using the Allen Brain Atlas API. It is then used to identify the hidden information based on the various categories and conditions applied by using options generated from automated collection. A metadata category like chromosome, as well as data for individual cases like sex, age, and plan attributes of a particular gene, is used to filter, sort and to determine if there exist other genes with a similar characteristics to Dab2. And online access to the mouse brain pivot collection can be viewed using the link http://edtech-dev.uthsc.edu/CTSI/teeDev1/unittest/PaPa/collection.html (user name: tviangte and password: demome) Conclusions Our proposed algorithm has automated the creation of large image Pivot collections; this will enable investigators of clinical research projects to easily and quickly analyse the image collections through a perspective that is useful for making critical decisions about the image patterns discovered. PMID:21884637

  7. A New Decision Tree to Solve the Puzzle of Alzheimer's Disease Pathogenesis Through Standard Diagnosis Scoring System.

    PubMed

    Kumar, Ashwani; Singh, Tiratha Raj

    2017-03-01

    Alzheimer's disease (AD) is a progressive, incurable and terminal neurodegenerative disorder of the brain and is associated with mutations in amyloid precursor protein, presenilin 1, presenilin 2 or apolipoprotein E, but its underlying mechanisms are still not fully understood. Healthcare sector is generating a large amount of information corresponding to diagnosis, disease identification and treatment of an individual. Mining knowledge and providing scientific decision-making for the diagnosis and treatment of disease from the clinical dataset are therefore increasingly becoming necessary. The current study deals with the construction of classifiers that can be human readable as well as robust in performance for gene dataset of AD using a decision tree. Models of classification for different AD genes were generated according to Mini-Mental State Examination scores and all other vital parameters to achieve the identification of the expression level of different proteins of disorder that may possibly determine the involvement of genes in various AD pathogenesis pathways. The effectiveness of decision tree in AD diagnosis is determined by information gain with confidence value (0.96), specificity (92 %), sensitivity (98 %) and accuracy (77 %). Besides this functional gene classification using different parameters and enrichment analysis, our finding indicates that the measures of all the gene assess in single cohorts are sufficient to diagnose AD and will help in the prediction of important parameters for other relevant assessments.

  8. Computational Tools for Parsimony Phylogenetic Analysis of Omics Data

    PubMed Central

    Salazar, Jose; Amri, Hakima; Noursi, David

    2015-01-01

    Abstract High-throughput assays from genomics, proteomics, metabolomics, and next generation sequencing produce massive omics datasets that are challenging to analyze in biological or clinical contexts. Thus far, there is no publicly available program for converting quantitative omics data into input formats to be used in off-the-shelf robust phylogenetic programs. To the best of our knowledge, this is the first report on creation of two Windows-based programs, OmicsTract and SynpExtractor, to address this gap. We note, as a way of introduction and development of these programs, that one particularly useful bioinformatics inferential modeling is the phylogenetic cladogram. Cladograms are multidimensional tools that show the relatedness between subgroups of healthy and diseased individuals and the latter's shared aberrations; they also reveal some characteristics of a disease that would not otherwise be apparent by other analytical methods. The OmicsTract and SynpExtractor were written for the respective tasks of (1) accommodating advanced phylogenetic parsimony analysis (through standard programs of MIX [from PHYLIP] and TNT), and (2) extracting shared aberrations at the cladogram nodes. OmicsTract converts comma-delimited data tables through assigning each data point into a binary value (“0” for normal states and “1” for abnormal states) then outputs the converted data tables into the proper input file formats for MIX or with embedded commands for TNT. SynapExtractor uses outfiles from MIX and TNT to extract the shared aberrations of each node of the cladogram, matching them with identifying labels from the dataset and exporting them into a comma-delimited file. Labels may be gene identifiers in gene-expression datasets or m/z values in mass spectrometry datasets. By automating these steps, OmicsTract and SynpExtractor offer a veritable opportunity for rapid and standardized phylogenetic analyses of omics data; their model can also be extended to next generation sequencing (NGS) data. We make OmicsTract and SynpExtractor publicly and freely available for non-commercial use in order to strengthen and build capacity for the phylogenetic paradigm of omics analysis. PMID:26230532

  9. Shape-intensity prior level set combining probabilistic atlas and probability map constrains for automatic liver segmentation from abdominal CT images.

    PubMed

    Wang, Jinke; Cheng, Yuanzhi; Guo, Changyong; Wang, Yadong; Tamura, Shinichi

    2016-05-01

    Propose a fully automatic 3D segmentation framework to segment liver on challenging cases that contain the low contrast of adjacent organs and the presence of pathologies from abdominal CT images. First, all of the atlases are weighted in the selected training datasets by calculating the similarities between the atlases and the test image to dynamically generate a subject-specific probabilistic atlas for the test image. The most likely liver region of the test image is further determined based on the generated atlas. A rough segmentation is obtained by a maximum a posteriori classification of probability map, and the final liver segmentation is produced by a shape-intensity prior level set in the most likely liver region. Our method is evaluated and demonstrated on 25 test CT datasets from our partner site, and its results are compared with two state-of-the-art liver segmentation methods. Moreover, our performance results on 10 MICCAI test datasets are submitted to the organizers for comparison with the other automatic algorithms. Using the 25 test CT datasets, average symmetric surface distance is [Formula: see text] mm (range 0.62-2.12 mm), root mean square symmetric surface distance error is [Formula: see text] mm (range 0.97-3.01 mm), and maximum symmetric surface distance error is [Formula: see text] mm (range 12.73-26.67 mm) by our method. Our method on 10 MICCAI test data sets ranks 10th in all the 47 automatic algorithms on the site as of July 2015. Quantitative results, as well as qualitative comparisons of segmentations, indicate that our method is a promising tool to improve the efficiency of both techniques. The applicability of the proposed method to some challenging clinical problems and the segmentation of the liver are demonstrated with good results on both quantitative and qualitative experimentations. This study suggests that the proposed framework can be good enough to replace the time-consuming and tedious slice-by-slice manual segmentation approach.

  10. Querying Large Biological Network Datasets

    ERIC Educational Resources Information Center

    Gulsoy, Gunhan

    2013-01-01

    New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…

  11. Statistical analysis of large simulated yield datasets for studying climate effects

    USDA-ARS?s Scientific Manuscript database

    Ensembles of process-based crop models are now commonly used to simulate crop growth and development for climate scenarios of temperature and/or precipitation changes corresponding to different projections of atmospheric CO2 concentrations. This approach generates large datasets with thousands of de...

  12. Finding the Maine Story in Hugh Cumbersome National Monitoring Datasets

    EPA Science Inventory

    What’s a manager, analyst, or concerned citizen to do with the complex datasets generated by State and Federal monitoring efforts? Is it possible to use such information to address Maine’s environmental issues without having a degree in informatics and statistics? This presentati...

  13. Rare itemsets mining algorithm based on RP-Tree and spark framework

    NASA Astrophysics Data System (ADS)

    Liu, Sainan; Pan, Haoan

    2018-05-01

    For the issues of the rare itemsets mining in big data, this paper proposed a rare itemsets mining algorithm based on RP-Tree and Spark framework. Firstly, it arranged the data vertically according to the transaction identifier, in order to solve the defects of scan the entire data set, the vertical datasets are divided into frequent vertical datasets and rare vertical datasets. Then, it adopted the RP-Tree algorithm to construct the frequent pattern tree that contains rare items and generate rare 1-itemsets. After that, it calculated the support of the itemsets by scanning the two vertical data sets, finally, it used the iterative process to generate rare itemsets. The experimental show that the algorithm can effectively excavate rare itemsets and have great superiority in execution time.

  14. Basin Characteristics for Selected Streamflow-Gaging Stations In and Near West Virginia

    USGS Publications Warehouse

    Paybins, Katherine S.

    2008-01-01

    Basin characteristics have long been used to develop equations describing streamflow. In the past, flow equations used in West Virginia were based on a few hand-calculated basin characteristics. More recently, the use of a Geographic Information System (GIS) to generate basin characteristics from existing datasets has refined the process for developing equations to describe flow values in the Mountain State. These basin characteristics are described in this document for streamflow-gaging stations in and near West Virginia. The GIS program developed in ArcGIS Workstation by Environmental Systems Research Institute (ESRI?) used data that included National Elevation Dataset (NED) at 1:24,000 scale, climate data from the National Oceanic and Atmospheric Agency (NOAA), streamlines from the National Hydrologic Dataset (NHD), and LandSat-based land-cover data (NLCD) for the period 1999-2003. Full automation of data generation was not achieved due to some inaccuracies in the elevation dataset, as well as inaccuracies in the streamflow-gage locations retrieved from the National Water Information System (NWIS). A Pearson?s correlation examination of the data indicates that several of the basin characteristics are correlated with drainage area. However, the GIS-generated data provide a consistent and documented set of basin characteristics for resource managers and researchers to use.

  15. A pragmatic method for transforming clinical research data from the research electronic data capture "REDCap" to Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM): Development and evaluation of REDCap2SDTM.

    PubMed

    Yamamoto, Keiichi; Ota, Keiko; Akiya, Ippei; Shintani, Ayumi

    2017-06-01

    The Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) can be used for new drug application studies as well as secondarily for creating a clinical research data warehouse to leverage clinical research study data across studies conducted within the same disease area. However, currently not all clinical research uses Clinical Data Acquisition Standards Harmonization (CDASH) beginning in the set-up phase of the study. Once already initiated, clinical studies that have not utilized CDASH are difficult to map in the SDTM format. In addition, most electronic data capture (EDC) systems are not equipped to export data in SDTM format; therefore, in many cases, statistical software is used to generate SDTM datasets from accumulated clinical data. In order to facilitate efficient secondary use of accumulated clinical research data using SDTM, it is necessary to develop a new tool to enable mapping of information for SDTM, even during or after the clinical research. REDCap is an EDC system developed by Vanderbilt University and is used globally by over 2100 institutions across 108 countries. In this study, we developed a simulated clinical trial to evaluate a tool called REDCap2SDTM that maps information in the Field Annotation of REDCap to SDTM and executes data conversion, including when data must be pivoted to accommodate the SDTM format, dynamically, by parsing the mapping information using R. We confirmed that generating SDTM data and the define.xml file from REDCap using REDCap2SDTM was possible. Conventionally, generation of SDTM data and the define.xml file from EDC systems requires the creation of individual programs for each clinical study. However, our proposed method can be used to generate this data and file dynamically without programming because it only involves entering the mapping information into the Field Annotation, and additional data into specific files. Our proposed method is adaptable not only to new drug application studies but also to all types of research, including observational and public health studies. Our method is also adaptable to clinical data collected with CDASH at the beginning of a study in non-standard format. We believe that this tool will reduce the workload of new drug application studies and will support data sharing and reuse of clinical research data in academia. Copyright © 2017 Elsevier Inc. All rights reserved.

  16. Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides.

    PubMed

    Stanislawski, Jerzy; Kotulska, Malgorzata; Unold, Olgierd

    2013-01-17

    Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found. Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. We generated a new dataset of hexapeptides, using more economical 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.

  17. Development of a Drug-Response Modeling Framework to Identify Cell Line Derived Translational Biomarkers That Can Predict Treatment Outcome to Erlotinib or Sorafenib

    PubMed Central

    Li, Bin; Shin, Hyunjin; Gulbekyan, Georgy; Pustovalova, Olga; Nikolsky, Yuri; Hope, Andrew; Bessarabova, Marina; Schu, Matthew; Kolpakova-Hart, Elona; Merberg, David; Dorner, Andrew; Trepicchio, William L.

    2015-01-01

    Development of drug responsive biomarkers from pre-clinical data is a critical step in drug discovery, as it enables patient stratification in clinical trial design. Such translational biomarkers can be validated in early clinical trial phases and utilized as a patient inclusion parameter in later stage trials. Here we present a study on building accurate and selective drug sensitivity models for Erlotinib or Sorafenib from pre-clinical in vitro data, followed by validation of individual models on corresponding treatment arms from patient data generated in the BATTLE clinical trial. A Partial Least Squares Regression (PLSR) based modeling framework was designed and implemented, using a special splitting strategy and canonical pathways to capture robust information for model building. Erlotinib and Sorafenib predictive models could be used to identify a sub-group of patients that respond better to the corresponding treatment, and these models are specific to the corresponding drugs. The model derived signature genes reflect each drug’s known mechanism of action. Also, the models predict each drug’s potential cancer indications consistent with clinical trial results from a selection of globally normalized GEO expression datasets. PMID:26107615

  18. Development of a Drug-Response Modeling Framework to Identify Cell Line Derived Translational Biomarkers That Can Predict Treatment Outcome to Erlotinib or Sorafenib.

    PubMed

    Li, Bin; Shin, Hyunjin; Gulbekyan, Georgy; Pustovalova, Olga; Nikolsky, Yuri; Hope, Andrew; Bessarabova, Marina; Schu, Matthew; Kolpakova-Hart, Elona; Merberg, David; Dorner, Andrew; Trepicchio, William L

    2015-01-01

    Development of drug responsive biomarkers from pre-clinical data is a critical step in drug discovery, as it enables patient stratification in clinical trial design. Such translational biomarkers can be validated in early clinical trial phases and utilized as a patient inclusion parameter in later stage trials. Here we present a study on building accurate and selective drug sensitivity models for Erlotinib or Sorafenib from pre-clinical in vitro data, followed by validation of individual models on corresponding treatment arms from patient data generated in the BATTLE clinical trial. A Partial Least Squares Regression (PLSR) based modeling framework was designed and implemented, using a special splitting strategy and canonical pathways to capture robust information for model building. Erlotinib and Sorafenib predictive models could be used to identify a sub-group of patients that respond better to the corresponding treatment, and these models are specific to the corresponding drugs. The model derived signature genes reflect each drug's known mechanism of action. Also, the models predict each drug's potential cancer indications consistent with clinical trial results from a selection of globally normalized GEO expression datasets.

  19. A longitudinal dataset of five years of public activity in the Scratch online community.

    PubMed

    Hill, Benjamin Mako; Monroy-Hernández, Andrés

    2017-01-31

    Scratch is a programming environment and an online community where young people can create, share, learn, and communicate. In collaboration with the Scratch Team at MIT, we created a longitudinal dataset of public activity in the Scratch online community during its first five years (2007-2012). The dataset comprises 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more. To help researchers understand this dataset, and to establish the validity of the data, we also include the source code of every version of the software that operated the website, as well as the software used to generate this dataset. We believe this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.

  20. Wide-Open: Accelerating public data release by automating detection of overdue datasets

    PubMed Central

    Poon, Hoifung; Howe, Bill

    2017-01-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819

  1. Wide-Open: Accelerating public data release by automating detection of overdue datasets.

    PubMed

    Grechkin, Maxim; Poon, Hoifung; Howe, Bill

    2017-06-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

  2. Microbial Community Profiling of Human Saliva Using Shotgun Metagenomic Sequencing

    PubMed Central

    Hasan, Nur A.; Young, Brian A.; Minard-Smith, Angela T.; Saeed, Kelly; Li, Huai; Heizer, Esley M.; McMillan, Nancy J.; Isom, Richard; Abdullah, Abdul Shakur; Bornman, Daniel M.; Faith, Seth A.; Choi, Seon Young; Dickens, Michael L.; Cebula, Thomas A.; Colwell, Rita R.

    2014-01-01

    Human saliva is clinically informative of both oral and general health. Since next generation shotgun sequencing (NGS) is now widely used to identify and quantify bacteria, we investigated the bacterial flora of saliva microbiomes of two healthy volunteers and five datasets from the Human Microbiome Project, along with a control dataset containing short NGS reads from bacterial species representative of the bacterial flora of human saliva. GENIUS, a system designed to identify and quantify bacterial species using unassembled short NGS reads was used to identify the bacterial species comprising the microbiomes of the saliva samples and datasets. Results, achieved within minutes and at greater than 90% accuracy, showed more than 175 bacterial species comprised the bacterial flora of human saliva, including bacteria known to be commensal human flora but also Haemophilus influenzae, Neisseria meningitidis, Streptococcus pneumoniae, and Gamma proteobacteria. Basic Local Alignment Search Tool (BLASTn) analysis in parallel, reported ca. five times more species than those actually comprising the in silico sample. Both GENIUSand BLAST analyses of saliva samples identified major genera comprising the bacterial flora of saliva, but GENIUS provided a more precise description of species composition, identifying to strain in most cases and delivered results at least 10,000 times faster. Therefore, GENIUS offers a facile and accurate system for identification and quantification of bacterial species and/or strains in metagenomic samples. PMID:24846174

  3. Contouring variability of human- and deformable-generated contours in radiotherapy for prostate cancer

    NASA Astrophysics Data System (ADS)

    Gardner, Stephen J.; Wen, Ning; Kim, Jinkoo; Liu, Chang; Pradhan, Deepak; Aref, Ibrahim; Cattaneo, Richard, II; Vance, Sean; Movsas, Benjamin; Chetty, Indrin J.; Elshaikh, Mohamed A.

    2015-06-01

    This study was designed to evaluate contouring variability of human-and deformable-generated contours on planning CT (PCT) and CBCT for ten patients with low-or intermediate-risk prostate cancer. For each patient in this study, five radiation oncologists contoured the prostate, bladder, and rectum, on one PCT dataset and five CBCT datasets. Consensus contours were generated using the STAPLE method in the CERR software package. Observer contours were compared to consensus contour, and contour metrics (Dice coefficient, Hausdorff distance, Contour Distance, Center-of-Mass [COM] Deviation) were calculated. In addition, the first day CBCT was registered to subsequent CBCT fractions (CBCTn: CBCT2-CBCT5) via B-spline Deformable Image Registration (DIR). Contours were transferred from CBCT1 to CBCTn via the deformation field, and contour metrics were calculated through comparison with consensus contours generated from human contour set. The average contour metrics for prostate contours on PCT and CBCT were as follows: Dice coefficient—0.892 (PCT), 0.872 (CBCT-Human), 0.824 (CBCT-Deformed); Hausdorff distance—4.75 mm (PCT), 5.22 mm (CBCT-Human), 5.94 mm (CBCT-Deformed); Contour Distance (overall contour)—1.41 mm (PCT), 1.66 mm (CBCT-Human), 2.30 mm (CBCT-Deformed); COM Deviation—2.01 mm (PCT), 2.78 mm (CBCT-Human), 3.45 mm (CBCT-Deformed). For human contours on PCT and CBCT, the difference in average Dice coefficient between PCT and CBCT (approx. 2%) and Hausdorff distance (approx. 0.5 mm) was small compared to the variation between observers for each patient (standard deviation in Dice coefficient of 5% and Hausdorff distance of 2.0 mm). However, additional contouring variation was found for the deformable-generated contours (approximately 5.0% decrease in Dice coefficient and 0.7 mm increase in Hausdorff distance relative to human-generated contours on CBCT). Though deformable contours provide a reasonable starting point for contouring on CBCT, we conclude that contours generated with B-Spline DIR require physician review and editing if they are to be used in the clinic.

  4. High-resolution MR imaging for dental impressions: a feasibility study.

    PubMed

    Boldt, Julian; Rottner, Kurt; Schmitter, Marc; Hopfgartner, Andreas; Jakob, Peter; Richter, Ernst-Jürgen; Tymofiyeva, Olga

    2018-04-01

    Magnetic resonance imaging is an emerging technology in dental medicine. While low-resolution MRI has especially provided means to examine the temporomandibular joint due to its anatomic inaccessibility, it was the goal of this study to assess whether high-resolution MRI is capable of delivering a dataset sufficiently precise enough to serve as digital impression of human teeth. An informed and consenting patient in need of dental restoration with fixed partial dentures was chosen as subject. Two prepared teeth were measured using MRI and the dataset subjected to mathematical processing before Fourier transformation. After reconstruction, a 3D file was generated which was fed into an existing industry standard CAD/CAM process. A framework for a fixed dental prosthesis was digitally modeled and manufactured by laser-sintering. The fit in situ was found to be acceptable by current clinical standards, which allowed permanent placement of the fixed prosthesis. Using a clinical whole-body MR scanner with the addition of custom add-on hardware, contrast enhancement, and data post-processing, resolution and signal-to-noise ratio were sufficiently achieved to allow fabrication of a dental restoration in an acquisition time comparable to the setting time of common dental impression materials. Furthermore, the measurement was well tolerated. The herein described method can be regarded as proof of principle that MRI is a promising option for digital impressions when fixed partial dentures are required.

  5. Data linkage between existing healthcare databases to support hospital epidemiology.

    PubMed

    García Álvarez, L; Aylin, P; Tian, J; King, C; Catchpole, M; Hassall, S; Whittaker-Axon, K; Holmes, A

    2011-11-01

    Enhancing the use of existing datasets within acute hospitals will greatly facilitate hospital epidemiology, surveillance, the monitoring of a variety of processes, outcomes and risk factors, and the provision of alert systems. Multiple overlapping data systems exist within National Health Service (NHS) hospitals in the UK, and many duplicate data recordings take place because of the lack of linkage and interfaces. This results in hospital-collected data not being used efficiently. The objective was to create an inventory of all existing systems, including administrative, management, human resources, microbiology, patient care and other platforms, to describe the data architecture that could contribute valuable information for a hospital epidemiology unit. These datasets were investigated as to how they could be used to generate surveillance data, key performance indicators and risk information that could be shared at board, clinical programme group, specialty and ward level. An example of an output of this integrated data platform and its application in influenza resilience planning and responsiveness is described. The development of metrics for staff absence and staffing levels may also be used as key indicators for risk-monitoring for infection prevention. This work demonstrates the value of such a data inventory and linkage and the importance of more sophisticated uses of existing NHS data, and innovative collaborative approaches to support clinical care, quality improvement, surveillance, emergency planning and research. Copyright © 2011 The Healthcare Infection Society. All rights reserved.

  6. High-performance 3D compressive sensing MRI reconstruction.

    PubMed

    Kim, Daehyun; Trzasko, Joshua D; Smelyanskiy, Mikhail; Haider, Clifton R; Manduca, Armando; Dubey, Pradeep

    2010-01-01

    Compressive Sensing (CS) is a nascent sampling and reconstruction paradigm that describes how sparse or compressible signals can be accurately approximated using many fewer samples than traditionally believed. In magnetic resonance imaging (MRI), where scan duration is directly proportional to the number of acquired samples, CS has the potential to dramatically decrease scan time. However, the computationally expensive nature of CS reconstructions has so far precluded their use in routine clinical practice - instead, more-easily generated but lower-quality images continue to be used. We investigate the development and optimization of a proven inexact quasi-Newton CS reconstruction algorithm on several modern parallel architectures, including CPUs, GPUs, and Intel's Many Integrated Core (MIC) architecture. Our (optimized) baseline implementation on a quad-core Core i7 is able to reconstruct a 256 × 160×80 volume of the neurovasculature from an 8-channel, 10 × undersampled data set within 56 seconds, which is already a significant improvement over existing implementations. The latest six-core Core i7 reduces the reconstruction time further to 32 seconds. Moreover, we show that the CS algorithm benefits from modern throughput-oriented architectures. Specifically, our CUDA-base implementation on NVIDIA GTX480 reconstructs the same dataset in 16 seconds, while Intel's Knights Ferry (KNF) of the MIC architecture even reduces the time to 12 seconds. Such level of performance allows the neurovascular dataset to be reconstructed within a clinically viable time.

  7. Role of Large Clinical Datasets From Physiologic Monitors in Improving the Safety of Clinical Alarm Systems and Methodological Considerations: A Case From Philips Monitors.

    PubMed

    Sowan, Azizeh Khaled; Reed, Charles Calhoun; Staggers, Nancy

    2016-09-30

    Large datasets of the audit log of modern physiologic monitoring devices have rarely been used for predictive modeling, capturing unsafe practices, or guiding initiatives on alarm systems safety. This paper (1) describes a large clinical dataset using the audit log of the physiologic monitors, (2) discusses benefits and challenges of using the audit log in identifying the most important alarm signals and improving the safety of clinical alarm systems, and (3) provides suggestions for presenting alarm data and improving the audit log of the physiologic monitors. At a 20-bed transplant cardiac intensive care unit, alarm data recorded via the audit log of bedside monitors were retrieved from the server of the central station monitor. Benefits of the audit log are many. They include easily retrievable data at no cost, complete alarm records, easy capture of inconsistent and unsafe practices, and easy identification of bedside monitors missed from a unit change of alarm settings adjustments. Challenges in analyzing the audit log are related to the time-consuming processes of data cleaning and analysis, and limited storage and retrieval capabilities of the monitors. The audit log is a function of current capabilities of the physiologic monitoring systems, monitor's configuration, and alarm management practices by clinicians. Despite current challenges in data retrieval and analysis, large digitalized clinical datasets hold great promise in performance, safety, and quality improvement. Vendors, clinicians, researchers, and professional organizations should work closely to identify the most useful format and type of clinical data to expand medical devices' log capacity.

  8. Major soybean maturity gene haplotypes revealed by SNPViz analysis of 72 sequenced soybean genomes

    USDA-ARS?s Scientific Manuscript database

    In this Genomics Era, vast amounts of next generation sequencing data have become publicly-available for multiple genomes across hundreds of species. Analysis of these large-scale datasets can become cumbersome, especially when comparing nucleotide polymorphisms across many samples within a dataset...

  9. Fast and robust curve skeletonization for real-world elongated objects

    USDA-ARS?s Scientific Manuscript database

    These datasets were generated for calibrating robot-camera systems. In an extension, we also considered the problem of calibrating robots with more than one camera. These datasets are provided as a companion to the paper, "Solving the Robot-World Hand-Eye(s) Calibration Problem with Iterative Meth...

  10. TH-E-BRF-05: Comparison of Survival-Time Prediction Models After Radiotherapy for High-Grade Glioma Patients Based On Clinical and DVH Features

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Magome, T; Haga, A; Igaki, H

    Purpose: Although many outcome prediction models based on dose-volume information have been proposed, it is well known that the prognosis may be affected also by multiple clinical factors. The purpose of this study is to predict the survival time after radiotherapy for high-grade glioma patients based on features including clinical and dose-volume histogram (DVH) information. Methods: A total of 35 patients with high-grade glioma (oligodendroglioma: 2, anaplastic astrocytoma: 3, glioblastoma: 30) were selected in this study. All patients were treated with prescribed dose of 30–80 Gy after surgical resection or biopsy from 2006 to 2013 at The University of Tokyomore » Hospital. All cases were randomly separated into training dataset (30 cases) and test dataset (5 cases). The survival time after radiotherapy was predicted based on a multiple linear regression analysis and artificial neural network (ANN) by using 204 candidate features. The candidate features included the 12 clinical features (tumor location, extent of surgical resection, treatment duration of radiotherapy, etc.), and the 192 DVH features (maximum dose, minimum dose, D95, V60, etc.). The effective features for the prediction were selected according to a step-wise method by using 30 training cases. The prediction accuracy was evaluated by a coefficient of determination (R{sup 2}) between the predicted and actual survival time for the training and test dataset. Results: In the multiple regression analysis, the value of R{sup 2} between the predicted and actual survival time was 0.460 for the training dataset and 0.375 for the test dataset. On the other hand, in the ANN analysis, the value of R{sup 2} was 0.806 for the training dataset and 0.811 for the test dataset. Conclusion: Although a large number of patients would be needed for more accurate and robust prediction, our preliminary Result showed the potential to predict the outcome in the patients with high-grade glioma. This work was partly supported by the JSPS Core-to-Core Program(No. 23003) and Grant-in-aid from the JSPS Fellows.« less

  11. A Web Server and Mobile App for Computing Hemolytic Potency of Peptides.

    PubMed

    Chaudhary, Kumardeep; Kumar, Ritesh; Singh, Sandeep; Tuknait, Abhishek; Gautam, Ankur; Mathur, Deepika; Anand, Priya; Varshney, Grish C; Raghava, Gajendra P S

    2016-03-08

    Numerous therapeutic peptides do not enter the clinical trials just because of their high hemolytic activity. Recently, we developed a database, Hemolytik, for maintaining experimentally validated hemolytic and non-hemolytic peptides. The present study describes a web server and mobile app developed for predicting, and screening of peptides having hemolytic potency. Firstly, we generated a dataset HemoPI-1 that contains 552 hemolytic peptides extracted from Hemolytik database and 552 random non-hemolytic peptides (from Swiss-Prot). The sequence analysis of these peptides revealed that certain residues (e.g., L, K, F, W) and motifs (e.g., "FKK", "LKL", "KKLL", "KWK", "VLK", "CYCR", "CRR", "RFC", "RRR", "LKKL") are more abundant in hemolytic peptides. Therefore, we developed models for discriminating hemolytic and non-hemolytic peptides using various machine learning techniques and achieved more than 95% accuracy. We also developed models for discriminating peptides having high and low hemolytic potential on different datasets called HemoPI-2 and HemoPI-3. In order to serve the scientific community, we developed a web server, mobile app and JAVA-based standalone software (http://crdd.osdd.net/raghava/hemopi/).

  12. Scalable and cost-effective NGS genotyping in the cloud.

    PubMed

    Souilmi, Yassine; Lancaster, Alex K; Jung, Jae-Yoon; Rizzo, Ettore; Hawkins, Jared B; Powles, Ryan; Amzazi, Saaïd; Ghazal, Hassan; Tonellato, Peter J; Wall, Dennis P

    2015-10-15

    While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars. We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.

  13. Enabling systematic interrogation of protein-protein interactions in live cells with a versatile ultra-high-throughput biosensor platform | Office of Cancer Genomics

    Cancer.gov

    The vast datasets generated by next generation gene sequencing and expression profiling have transformed biological and translational research. However, technologies to produce large-scale functional genomics datasets, such as high-throughput detection of protein-protein interactions (PPIs), are still in early development. While a number of powerful technologies have been employed to detect PPIs, a singular PPI biosensor platform featured with both high sensitivity and robustness in a mammalian cell environment remains to be established.

  14. Parsing clinical text: how good are the state-of-the-art parsers?

    PubMed Central

    2015-01-01

    Background Parsing, which generates a syntactic structure of a sentence (a parse tree), is a critical component of natural language processing (NLP) research in any domain including medicine. Although parsers developed in the general English domain, such as the Stanford parser, have been applied to clinical text, there are no formal evaluations and comparisons of their performance in the medical domain. Methods In this study, we investigated the performance of three state-of-the-art parsers: the Stanford parser, the Bikel parser, and the Charniak parser, using following two datasets: (1) A Treebank containing 1,100 sentences that were randomly selected from progress notes used in the 2010 i2b2 NLP challenge and manually annotated according to a Penn Treebank based guideline; and (2) the MiPACQ Treebank, which is developed based on pathology notes and clinical notes, containing 13,091 sentences. We conducted three experiments on both datasets. First, we measured the performance of the three state-of-the-art parsers on the clinical Treebanks with their default settings. Then we re-trained the parsers using the clinical Treebanks and evaluated their performance using the 10-fold cross validation method. Finally we re-trained the parsers by combining the clinical Treebanks with the Penn Treebank. Results Our results showed that the original parsers achieved lower performance in clinical text (Bracketing F-measure in the range of 66.6%-70.3%) compared to general English text. After retraining on the clinical Treebank, all parsers achieved better performance, with the best performance from the Stanford parser that reached the highest Bracketing F-measure of 73.68% on progress notes and 83.72% on the MiPACQ corpus using 10-fold cross validation. When the combined clinical Treebanks and Penn Treebank was used, of the three parsers, the Charniak parser achieved the highest Bracketing F-measure of 73.53% on progress notes and the Stanford parser reached the highest F-measure of 84.15% on the MiPACQ corpus. Conclusions Our study demonstrates that re-training using clinical Treebanks is critical for improving general English parsers' performance on clinical text, and combining clinical and open domain corpora might achieve optimal performance for parsing clinical text. PMID:26045009

  15. DNApod: DNA polymorphism annotation database from next-generation sequence read archives.

    PubMed

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.

  16. DNApod: DNA polymorphism annotation database from next-generation sequence read archives

    PubMed Central

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924

  17. Technique for fast and efficient hierarchical clustering

    DOEpatents

    Stork, Christopher

    2013-10-08

    A fast and efficient technique for hierarchical clustering of samples in a dataset includes compressing the dataset to reduce a number of variables within each of the samples of the dataset. A nearest neighbor matrix is generated to identify nearest neighbor pairs between the samples based on differences between the variables of the samples. The samples are arranged into a hierarchy that groups the samples based on the nearest neighbor matrix. The hierarchy is rendered to a display to graphically illustrate similarities or differences between the samples.

  18. Microarray Data Mining for Potential Selenium Targets in Chemoprevention of Prostate Cancer

    PubMed Central

    ZHANG, HAITAO; DONG, YAN; ZHAO, HONGJUAN; BROOKS, JAMES D.; HAWTHORN, LESLEYANN; NOWAK, NORMA; MARSHALL, JAMES R.; GAO, ALLEN C.; IP, CLEMENT

    2008-01-01

    Background A previous clinical trial showed that selenium supplementation significantly reduced the incidence of prostate cancer. We report here a bioinformatics approach to gain new insights into selenium molecular targets that might be relevant to prostate cancer chemoprevention. Materials and Methods We first performed data mining analysis to identify genes which are consistently dysregulated in prostate cancer using published datasets from gene expression profiling of clinical prostate specimens. We then devised a method to systematically analyze three selenium microarray datasets from the LNCaP human prostate cancer cells, and to match the analysis to the cohort of genes implicated in prostate carcinogenesis. Moreover, we compared the selenium datasets with two datasets obtained from expression profiling of androgen-stimulated LNCaP cells. Results We found that selenium reverses the expression of genes implicated in prostate carcinogenesis. In addition, we found that selenium could counteract the effect of androgen on the expression of a subset obtained from androgen-regulated genes. Conclusions The above information provides us with a treasure of new clues to investigate the mechanism of selenium chemoprevention of prostate cancer. Furthermore, these selenium target genes could also serve as biomarkers in future clinical trials to gauge the efficacy of selenium intervention. PMID:18548127

  19. 1000 Norms Project: protocol of a cross-sectional study cataloging human variation.

    PubMed

    McKay, Marnee J; Baldwin, Jennifer N; Ferreira, Paulo; Simic, Milena; Vanicek, Natalie; Hiller, Claire E; Nightingale, Elizabeth J; Moloney, Niamh A; Quinlan, Kate G; Pourkazemi, Fereshteh; Sman, Amy D; Nicholson, Leslie L; Mousavi, Seyed J; Rose, Kristy; Raymond, Jacqueline; Mackey, Martin G; Chard, Angus; Hübscher, Markus; Wegener, Caleb; Fong Yan, Alycia; Refshauge, Kathryn M; Burns, Joshua

    2016-03-01

    Clinical decision-making regarding diagnosis and management largely depends on comparison with healthy or 'normal' values. Physiotherapists and researchers therefore need access to robust patient-centred outcome measures and appropriate reference values. However there is a lack of high-quality reference data for many clinical measures. The aim of the 1000 Norms Project is to generate a freely accessible database of musculoskeletal and neurological reference values representative of the healthy population across the lifespan. In 2012 the 1000 Norms Project Consortium defined the concept of 'normal', established a sampling strategy and selected measures based on clinical significance, psychometric properties and the need for reference data. Musculoskeletal and neurological items tapping the constructs of dexterity, balance, ambulation, joint range of motion, strength and power, endurance and motor planning will be collected in this cross-sectional study. Standardised questionnaires will evaluate quality of life, physical activity, and musculoskeletal health. Saliva DNA will be analysed for the ACTN3 genotype ('gene for speed'). A volunteer cohort of 1000 participants aged 3 to 100 years will be recruited according to a set of self-reported health criteria. Descriptive statistics will be generated, creating tables of mean values and standard deviations stratified for age and gender. Quantile regression equations will be used to generate age charts and age-specific centile values. This project will be a powerful resource to assist physiotherapists and clinicians across all areas of healthcare to diagnose pathology, track disease progression and evaluate treatment response. This reference dataset will also contribute to the development of robust patient-centred clinical trial outcome measures. Copyright © 2015 Chartered Society of Physiotherapy. Published by Elsevier Ltd. All rights reserved.

  20. Personalized-detailed clinical model for data interoperability among clinical standards.

    PubMed

    Khan, Wajahat Ali; Hussain, Maqbool; Afzal, Muhammad; Amin, Muhammad Bilal; Saleem, Muhammad Aamir; Lee, Sungyoung

    2013-08-01

    Data interoperability among health information exchange (HIE) systems is a major concern for healthcare practitioners to enable provisioning of telemedicine-related services. Heterogeneity exists in these systems not only at the data level but also among different heterogeneous healthcare standards with which these are compliant. The relationship between healthcare organization data and different heterogeneous standards is necessary to achieve the goal of data level interoperability. We propose a personalized-detailed clinical model (P-DCM) approach for the generation of customized mappings that creates the necessary linkage between organization-conformed healthcare standards concepts and clinical model concepts to ensure data interoperability among HIE systems. We consider electronic health record (EHR) standards, openEHR, and HL7 CDA instances transformation using P-DCM. P-DCM concepts associated with openEHR and HL7 CDA help in transformation of instances among these standards. We investigated two datasets: (1) data of 100 diabetic patients, including 50 each of type 1 and type 2, from a local hospital in Korea and (2) data of a single Alzheimer's disease patient. P-DCMs were created for both scenarios, which provided the basis for deriving instances for HL7 CDA and openEHR standards. For proof of concept, we present case studies of encounter information for type 2 diabetes mellitus patients and monitoring of daily routine activities of an Alzheimer's disease patient. These reflect P-DCM-based customized mappings generation with openEHR and HL7 CDA standards. Customized mappings are generated based on the relationship of P-DCM concepts with CDA and openEHR concepts. The objective of this work is to achieve semantic data interoperability among heterogeneous standards. This would lead to effective utilization of resources and allow timely information exchange among healthcare systems.

  1. Personalized-Detailed Clinical Model for Data Interoperability Among Clinical Standards

    PubMed Central

    Khan, Wajahat Ali; Hussain, Maqbool; Afzal, Muhammad; Amin, Muhammad Bilal; Saleem, Muhammad Aamir

    2013-01-01

    Abstract Objective: Data interoperability among health information exchange (HIE) systems is a major concern for healthcare practitioners to enable provisioning of telemedicine-related services. Heterogeneity exists in these systems not only at the data level but also among different heterogeneous healthcare standards with which these are compliant. The relationship between healthcare organization data and different heterogeneous standards is necessary to achieve the goal of data level interoperability. We propose a personalized-detailed clinical model (P-DCM) approach for the generation of customized mappings that creates the necessary linkage between organization-conformed healthcare standards concepts and clinical model concepts to ensure data interoperability among HIE systems. Materials and Methods: We consider electronic health record (EHR) standards, openEHR, and HL7 CDA instances transformation using P-DCM. P-DCM concepts associated with openEHR and HL7 CDA help in transformation of instances among these standards. We investigated two datasets: (1) data of 100 diabetic patients, including 50 each of type 1 and type 2, from a local hospital in Korea and (2) data of a single Alzheimer's disease patient. P-DCMs were created for both scenarios, which provided the basis for deriving instances for HL7 CDA and openEHR standards. Results: For proof of concept, we present case studies of encounter information for type 2 diabetes mellitus patients and monitoring of daily routine activities of an Alzheimer's disease patient. These reflect P-DCM-based customized mappings generation with openEHR and HL7 CDA standards. Customized mappings are generated based on the relationship of P-DCM concepts with CDA and openEHR concepts. Conclusions: The objective of this work is to achieve semantic data interoperability among heterogeneous standards. This would lead to effective utilization of resources and allow timely information exchange among healthcare systems. PMID:23875730

  2. The first MICCAI challenge on PET tumor segmentation.

    PubMed

    Hatt, Mathieu; Laurent, Baptiste; Ouahabi, Anouar; Fayad, Hadi; Tan, Shan; Li, Laquan; Lu, Wei; Jaouen, Vincent; Tauber, Clovis; Czakon, Jakub; Drapejkowski, Filip; Dyrka, Witold; Camarasu-Pop, Sorina; Cervenansky, Frédéric; Girard, Pascal; Glatard, Tristan; Kain, Michael; Yao, Yao; Barillot, Christian; Kirov, Assen; Visvikis, Dimitris

    2018-02-01

    Automatic functional volume segmentation in PET images is a challenge that has been addressed using a large array of methods. A major limitation for the field has been the lack of a benchmark dataset that would allow direct comparison of the results in the various publications. In the present work, we describe a comparison of recent methods on a large dataset following recommendations by the American Association of Physicists in Medicine (AAPM) task group (TG) 211, which was carried out within a MICCAI (Medical Image Computing and Computer Assisted Intervention) challenge. Organization and funding was provided by France Life Imaging (FLI). A dataset of 176 images combining simulated, phantom and clinical images was assembled. A website allowed the participants to register and download training data (n = 19). Challengers then submitted encapsulated pipelines on an online platform that autonomously ran the algorithms on the testing data (n = 157) and evaluated the results. The methods were ranked according to the arithmetic mean of sensitivity and positive predictive value. Sixteen teams registered but only four provided manuscripts and pipeline(s) for a total of 10 methods. In addition, results using two thresholds and the Fuzzy Locally Adaptive Bayesian (FLAB) were generated. All competing methods except one performed with median accuracy above 0.8. The method with the highest score was the convolutional neural network-based segmentation, which significantly outperformed 9 out of 12 of the other methods, but not the improved K-Means, Gaussian Model Mixture and Fuzzy C-Means methods. The most rigorous comparative study of PET segmentation algorithms to date was carried out using a dataset that is the largest used in such studies so far. The hierarchy amongst the methods in terms of accuracy did not depend strongly on the subset of datasets or the metrics (or combination of metrics). All the methods submitted by the challengers except one demonstrated good performance with median accuracy scores above 0.8. Copyright © 2017 Elsevier B.V. All rights reserved.

  3. QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles.

    PubMed

    Van der Borght, Koen; Thys, Kim; Wetzels, Yves; Clement, Lieven; Verbist, Bie; Reumers, Joke; van Vlijmen, Herman; Aerssens, Jeroen

    2015-11-10

    Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNV(D)). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNV(HS)). To also increase specificity, SNVs called were overruled when their frequency was below the 80(th) percentile calculated on the distribution of error frequencies (QQ-SNV(HS-P80)). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNV(D) performed similarly to the existing approaches. QQ-SNV(HS) was more sensitive on all test sets but with more false positives. QQ-SNV(HS-P80) was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5%, QQ-SNV(HS-P80) revealed a sensitivity of 100% (vs. 40-60% for the existing methods) and a specificity of 100% (vs. 98.0-99.7% for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5% were consistently detected by QQ-SNV(HS-P80) from different generations of Illumina sequencers. We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data.

  4. Multivariate longitudinal data analysis with mixed effects hidden Markov models.

    PubMed

    Raffa, Jesse D; Dubin, Joel A

    2015-09-01

    Multiple longitudinal responses are often collected as a means to capture relevant features of the true outcome of interest, which is often hidden and not directly measurable. We outline an approach which models these multivariate longitudinal responses as generated from a hidden disease process. We propose a class of models which uses a hidden Markov model with separate but correlated random effects between multiple longitudinal responses. This approach was motivated by a smoking cessation clinical trial, where a bivariate longitudinal response involving both a continuous and a binomial response was collected for each participant to monitor smoking behavior. A Bayesian method using Markov chain Monte Carlo is used. Comparison of separate univariate response models to the bivariate response models was undertaken. Our methods are demonstrated on the smoking cessation clinical trial dataset, and properties of our approach are examined through extensive simulation studies. © 2015, The International Biometric Society.

  5. Ground Truth Creation for Complex Clinical NLP Tasks - an Iterative Vetting Approach and Lessons Learned.

    PubMed

    Liang, Jennifer J; Tsou, Ching-Huei; Devarakonda, Murthy V

    2017-01-01

    Natural language processing (NLP) holds the promise of effectively analyzing patient record data to reduce cognitive load on physicians and clinicians in patient care, clinical research, and hospital operations management. A critical need in developing such methods is the "ground truth" dataset needed for training and testing the algorithms. Beyond localizable, relatively simple tasks, ground truth creation is a significant challenge because medical experts, just as physicians in patient care, have to assimilate vast amounts of data in EHR systems. To mitigate potential inaccuracies of the cognitive challenges, we present an iterative vetting approach for creating the ground truth for complex NLP tasks. In this paper, we present the methodology, and report on its use for an automated problem list generation task, its effect on the ground truth quality and system accuracy, and lessons learned from the effort.

  6. Integrating Hospital Administrative Data to Improve Health Care Efficiency and Outcomes: “The Socrates Story”

    PubMed Central

    Lawrence, Justin; Delaney, Conor P.

    2013-01-01

    Evaluation of health care outcomes has become increasingly important as we strive to improve quality and efficiency while controlling cost. Many groups feel that analysis of large datasets will be useful in optimizing resource utilization; however, the ideal blend of clinical and administrative data points has not been developed. Hospitals and health care systems have several tools to measure cost and resource utilization, but the data are often housed in disparate systems that are not integrated and do not permit multisystem analysis. Systems Outcomes and Clinical Resources AdministraTive Efficiency Software (SOCRATES) is a novel data merging, warehousing, analysis, and reporting technology, which brings together disparate hospital administrative systems generating automated or customizable risk-adjusted reports. Used in combination with standardized enhanced care pathways, SOCRATES offers a mechanism to improve the quality and efficiency of care, with the ability to measure real-time changes in outcomes. PMID:24436649

  7. Integrating hospital administrative data to improve health care efficiency and outcomes: "the socrates story".

    PubMed

    Lawrence, Justin; Delaney, Conor P

    2013-03-01

    Evaluation of health care outcomes has become increasingly important as we strive to improve quality and efficiency while controlling cost. Many groups feel that analysis of large datasets will be useful in optimizing resource utilization; however, the ideal blend of clinical and administrative data points has not been developed. Hospitals and health care systems have several tools to measure cost and resource utilization, but the data are often housed in disparate systems that are not integrated and do not permit multisystem analysis. Systems Outcomes and Clinical Resources AdministraTive Efficiency Software (SOCRATES) is a novel data merging, warehousing, analysis, and reporting technology, which brings together disparate hospital administrative systems generating automated or customizable risk-adjusted reports. Used in combination with standardized enhanced care pathways, SOCRATES offers a mechanism to improve the quality and efficiency of care, with the ability to measure real-time changes in outcomes.

  8. Squish: Near-Optimal Compression for Archival of Relational Datasets

    PubMed Central

    Gao, Yihan; Parameswaran, Aditya

    2017-01-01

    Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: Squish achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets. PMID:28180028

  9. De-identification of health records using Anonym: effectiveness and robustness across datasets.

    PubMed

    Zuccon, Guido; Kotzur, Daniel; Nguyen, Anthony; Bergheim, Anton

    2014-07-01

    Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data. Crown Copyright © 2014. Published by Elsevier B.V. All rights reserved.

  10. Parallel Index and Query for Large Scale Data Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chou, Jerry; Wu, Kesheng; Ruebel, Oliver

    2011-07-18

    Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing ofmore » a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for inter- esting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.« less

  11. A survey of Australasian obstetric anaesthesia audit.

    PubMed

    Smith, S J; Cyna, A M; Simmons, S W

    1999-08-01

    In order to develop a minimal obstetric anaesthesia dataset based on current Australasian clinical audit best practice, we carried out a postal survey of 69 Australasian anaesthetic departments covering an obstetric service. We asked about data being collected, specifically concerning the high risk obstetric patient, epidural analgesia and postoperative anaesthetic review. Examples of any data collection forms were requested. Of the 66 responses, 35 departments (53%) were not collecting any audit data. Twenty-six of the 31 departments (84%) performing obstetric anaesthesia audit responded to our follow-up telephone survey. Eighteen departments believed that there had been an improvement in patient care as a result of their audit and 13 felt that the benefits outweighed the costs involved. However, only six departments (9%) had performed an audit cycle. The importance of feedback to patients or hospital staff and the incidence of post dural puncture headache (PDPH) were cited by some as priorities for obstetric anaesthesia audit. There was however no consistency as to what data should be collected. Many responses suggested a perceived need to collect clinical data without knowing what to do with it. Our survey has highlighted confusion between three distinct objectives; a dataset for obstetric anaesthesia record keeping, data required for continuing patient management in hospital and, a specific minimal dataset for clinical audit purposes. We conclude that current Australasian obstetric anaesthesia audit strategies are inadequate to develop a minimal dataset for cost-effective clinical audit.

  12. InSilico DB genomic datasets hub: an efficient starting point for analyzing genome-wide studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor.

    PubMed

    Coletta, Alain; Molter, Colin; Duqué, Robin; Steenhoff, David; Taminau, Jonatan; de Schaetzen, Virginie; Meganck, Stijn; Lazar, Cosmin; Venet, David; Detours, Vincent; Nowé, Ann; Bersini, Hugues; Weiss Solís, David Y

    2012-11-18

    Genomics datasets are increasingly useful for gaining biomedical insights, with adoption in the clinic underway. However, multiple hurdles related to data management stand in the way of their efficient large-scale utilization. The solution proposed is a web-based data storage hub. Having clear focus, flexibility and adaptability, InSilico DB seamlessly connects genomics dataset repositories to state-of-the-art and free GUI and command-line data analysis tools. The InSilico DB platform is a powerful collaborative environment, with advanced capabilities for biocuration, dataset sharing, and dataset subsetting and combination. InSilico DB is available from https://insilicodb.org.

  13. Open Data as Open Educational Resources: Towards Transversal Skills and Global Citizenship

    ERIC Educational Resources Information Center

    Atenas, Javiera; Havemann, Leo; Priego, Ernesto

    2015-01-01

    Open Data is the name given to datasets which have been generated by international organisations, governments, NGOs and academic researchers, and made freely available online and openly-licensed. These datasets can be used by educators as Open Educational Resources (OER) to support different teaching and learning activities, allowing students to…

  14. ICLUS Tools and Datasets (Version 1.2) and User's Manual: Arcgis Tools and Datasets for Modeling US Housing Density (External Review Draft)

    EPA Science Inventory

    This draft Geographic Information System (GIS) tool can be used to generate scenarios of housing-density changes and calculate impervious surface cover for the conterminous United States. A draft User’s Guide accompanies the tool. This product distributes the population project...

  15. Viking Seismometer PDS Archive Dataset

    NASA Astrophysics Data System (ADS)

    Lorenz, R. D.

    2016-12-01

    The Viking Lander 2 seismometer operated successfully for over 500 Sols on the Martian surface, recording at least one likely candidate Marsquake. The Viking mission, in an era when data handling hardware (both on board and on the ground) was limited in capability, predated modern planetary data archiving, and ad-hoc repositories of the data, and the very low-level record at NSSDC, were neither convenient to process nor well-known. In an effort supported by the NASA Mars Data Analysis Program, we have converted the bulk of the Viking dataset (namely the 49,000 and 270,000 records made in High- and Event- modes at 20 and 1 Hz respectively) into a simple ASCII table format. Additionally, since wind-generated lander motion is a major component of the signal, contemporaneous meteorological data are included in summary records to facilitate correlation. These datasets are being archived at the PDS Geosciences Node. In addition to brief instrument and dataset descriptions, the archive includes code snippets in the freely-available language 'R' to demonstrate plotting and analysis. Further, we present examples of lander-generated noise, associated with the sampler arm, instrument dumps and other mechanical operations.

  16. Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset.

    PubMed

    Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert

    2018-02-13

    We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for 'target' versus 'non-target' (dataset A) and symbol 'O' versus symbol 'X' (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques.

  17. Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset

    PubMed Central

    Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert

    2018-01-01

    We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for ‘target’ versus ‘non-target’ (dataset A) and symbol ‘O’ versus symbol ‘X’ (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques. PMID:29437166

  18. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Van de Velde, Joris, E-mail: joris.vandevelde@ugent.be; Department of Radiotherapy, Ghent University, Ghent; Audenaert, Emmanuel

    Purpose: To develop contouring guidelines for the brachial plexus (BP) using anatomically validated cadaver datasets. Magnetic resonance imaging (MRI) and computed tomography (CT) were used to obtain detailed visualizations of the BP region, with the goal of achieving maximal inclusion of the actual BP in a small contoured volume while also accommodating for anatomic variations. Methods and Materials: CT and MRI were obtained for 8 cadavers positioned for intensity modulated radiation therapy. 3-dimensional reconstructions of soft tissue (from MRI) and bone (from CT) were combined to create 8 separate enhanced CT project files. Dissection of the corresponding cadavers anatomically validatedmore » the reconstructions created. Seven enhanced CT project files were then automatically fitted, separately in different regions, to obtain a single dataset of superimposed BP regions that incorporated anatomic variations. From this dataset, improved BP contouring guidelines were developed. These guidelines were then applied to the 7 original CT project files and also to 1 additional file, left out from the superimposing procedure. The percentage of BP inclusion was compared with the published guidelines. Results: The anatomic validation procedure showed a high level of conformity for the BP regions examined between the 3-dimensional reconstructions generated and the dissected counterparts. Accurate and detailed BP contouring guidelines were developed, which provided corresponding guidance for each level in a clinical dataset. An average margin of 4.7 mm around the anatomically validated BP contour is sufficient to accommodate for anatomic variations. Using the new guidelines, 100% inclusion of the BP was achieved, compared with a mean inclusion of 37.75% when published guidelines were applied. Conclusion: Improved guidelines for BP delineation were developed using combined MRI and CT imaging with validation by anatomic dissection.« less

  19. Revising the Belgian Nursing Minimum Dataset: from concept to implementation.

    PubMed

    Sermeus, Walter; van den Heede, Koen; Michiels, Dominik; Delesie, Lucas; Thonon, Olivier; Van Boven, Caroline; Codognotto, Jean; Gillet, Pierre

    2005-12-01

    The process of revising the Belgian Nursing Minimum Dataset (B-NMDS) started in 2000 and entailed four major phases. The first phase (June-October 2002) involved the development of a conceptual framework based on a literature review and secondary data analysis. The Nursing Interventions Classification (NIC) was selected as a framework for the revision of the original B-NMDS. The second phase (November 2002-September 2003) focused on language development for six care programs evaluated by panels of clinical experts (N=75). These panels identified the following items as priorities for the revised B-NMDS: hospital financing, nurse staffing allocation, assessment of the appropriateness of hospitalisation, and quality management. During this period, we developed a draft instrument with 92 variables using the NIC. This led to an alpha version of a revised B-NMDS. The third phase (October 2003-December 2004) focused on data collection and validation of the new tool. The revised B-NMDS (alpha version) was tested in 158 nursing wards in 66 Belgian hospitals from December 2003 until March 2004. This test generated data for some 95,000 in-patient days. The interrater reliability of the revised B-NMDS was assessed. The criterion-related validity of the revised B-NMDS was compared to that of the original B-NMDS. The discriminative power of the revised B-NMDS was also assessed to select the most relevant variables for data collection. This resulted in a beta version of the revised B-NMDS in December 2004. The records of the revised B-NMDS were linked to the Hospital Discharge Dataset and other mandatory datasets to integrate the revised B-NMDS into the overall healthcare management system. The fourth phase (January 2005-December 2005) is presently focusing on information management. Nationwide implementation is foreseen by January 2007.

  20. Classification of autistic individuals and controls using cross-task characterization of fMRI activity

    PubMed Central

    Chanel, Guillaume; Pichon, Swann; Conty, Laurence; Berthoz, Sylvie; Chevallier, Coralie; Grèzes, Julie

    2015-01-01

    Multivariate pattern analysis (MVPA) has been applied successfully to task-based and resting-based fMRI recordings to investigate which neural markers distinguish individuals with autistic spectrum disorders (ASD) from controls. While most studies have focused on brain connectivity during resting state episodes and regions of interest approaches (ROI), a wealth of task-based fMRI datasets have been acquired in these populations in the last decade. This calls for techniques that can leverage information not only from a single dataset, but from several existing datasets that might share some common features and biomarkers. We propose a fully data-driven (voxel-based) approach that we apply to two different fMRI experiments with social stimuli (faces and bodies). The method, based on Support Vector Machines (SVMs) and Recursive Feature Elimination (RFE), is first trained for each experiment independently and each output is then combined to obtain a final classification output. Second, this RFE output is used to determine which voxels are most often selected for classification to generate maps of significant discriminative activity. Finally, to further explore the clinical validity of the approach, we correlate phenotypic information with obtained classifier scores. The results reveal good classification accuracy (range between 69% and 92.3%). Moreover, we were able to identify discriminative activity patterns pertaining to the social brain without relying on a priori ROI definitions. Finally, social motivation was the only dimension which correlated with classifier scores, suggesting that it is the main dimension captured by the classifiers. Altogether, we believe that the present RFE method proves to be efficient and may help identifying relevant biomarkers by taking advantage of acquired task-based fMRI datasets in psychiatric populations. PMID:26793434

  1. A statistical metadata model for clinical trials' data management.

    PubMed

    Vardaki, Maria; Papageorgiou, Haralambos; Pentaris, Fragkiskos

    2009-08-01

    We introduce a statistical, process-oriented metadata model to describe the process of medical research data collection, management, results analysis and dissemination. Our approach explicitly provides a structure for pieces of information used in Clinical Study Data Management Systems, enabling a more active role for any associated metadata. Using the object-oriented paradigm, we describe the classes of our model that participate during the design of a clinical trial and the subsequent collection and management of the relevant data. The advantage of our approach is that we focus on presenting the structural inter-relation of these classes when used during datasets manipulation by proposing certain transformations that model the simultaneous processing of both data and metadata. Our solution reduces the possibility of human errors and allows for the tracking of all changes made during datasets lifecycle. The explicit modeling of processing steps improves data quality and assists in the problem of handling data collected in different clinical trials. The case study illustrates the applicability of the proposed framework demonstrating conceptually the simultaneous handling of datasets collected during two randomized clinical studies. Finally, we provide the main considerations for implementing the proposed framework into a modern Metadata-enabled Information System.

  2. Uvf - Unified Volume Format: A General System for Efficient Handling of Large Volumetric Datasets.

    PubMed

    Krüger, Jens; Potter, Kristin; Macleod, Rob S; Johnson, Christopher

    2008-01-01

    With the continual increase in computing power, volumetric datasets with sizes ranging from only a few megabytes to petascale are generated thousands of times per day. Such data may come from an ordinary source such as simple everyday medical imaging procedures, while larger datasets may be generated from cluster-based scientific simulations or measurements of large scale experiments. In computer science an incredible amount of work worldwide is put into the efficient visualization of these datasets. As researchers in the field of scientific visualization, we often have to face the task of handling very large data from various sources. This data usually comes in many different data formats. In medical imaging, the DICOM standard is well established, however, most research labs use their own data formats to store and process data. To simplify the task of reading the many different formats used with all of the different visualization programs, we present a system for the efficient handling of many types of large scientific datasets (see Figure 1 for just a few examples). While primarily targeted at structured volumetric data, UVF can store just about any type of structured and unstructured data. The system is composed of a file format specification with a reference implementation of a reader. It is not only a common, easy to implement format but also allows for efficient rendering of most datasets without the need to convert the data in memory.

  3. Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays

    NASA Astrophysics Data System (ADS)

    Guha, Rajarshi; Schürer, Stephan C.

    2008-06-01

    Computational toxicology is emerging as an encouraging alternative to experimental testing. The Molecular Libraries Screening Center Network (MLSCN) as part of the NIH Molecular Libraries Roadmap has recently started generating large and diverse screening datasets, which are publicly available in PubChem. In this report, we investigate various aspects of developing computational models to predict cell toxicity based on cell proliferation screening data generated in the MLSCN. By capturing feature-based information in those datasets, such predictive models would be useful in evaluating cell-based screening results in general (for example from reporter assays) and could be used as an aid to identify and eliminate potentially undesired compounds. Specifically we present the results of random forest ensemble models developed using different cell proliferation datasets and highlight protocols to take into account their extremely imbalanced nature. Depending on the nature of the datasets and the descriptors employed we were able to achieve percentage correct classification rates between 70% and 85% on the prediction set, though the accuracy rate dropped significantly when the models were applied to in vivo data. In this context we also compare the MLSCN cell proliferation results with animal acute toxicity data to investigate to what extent animal toxicity can be correlated and potentially predicted by proliferation results. Finally, we present a visualization technique that allows one to compare a new dataset to the training set of the models to decide whether the new dataset may be reliably predicted.

  4. Physical environment virtualization for human activities recognition

    NASA Astrophysics Data System (ADS)

    Poshtkar, Azin; Elangovan, Vinayak; Shirkhodaie, Amir; Chan, Alex; Hu, Shuowen

    2015-05-01

    Human activity recognition research relies heavily on extensive datasets to verify and validate performance of activity recognition algorithms. However, obtaining real datasets are expensive and highly time consuming. A physics-based virtual simulation can accelerate the development of context based human activity recognition algorithms and techniques by generating relevant training and testing videos simulating diverse operational scenarios. In this paper, we discuss in detail the requisite capabilities of a virtual environment to aid as a test bed for evaluating and enhancing activity recognition algorithms. To demonstrate the numerous advantages of virtual environment development, a newly developed virtual environment simulation modeling (VESM) environment is presented here to generate calibrated multisource imagery datasets suitable for development and testing of recognition algorithms for context-based human activities. The VESM environment serves as a versatile test bed to generate a vast amount of realistic data for training and testing of sensor processing algorithms. To demonstrate the effectiveness of VESM environment, we present various simulated scenarios and processed results to infer proper semantic annotations from the high fidelity imagery data for human-vehicle activity recognition under different operational contexts.

  5. Development of an internationally agreed minimal dataset for juvenile dermatomyositis (JDM) for clinical and research use.

    PubMed

    McCann, Liza J; Kirkham, Jamie J; Wedderburn, Lucy R; Pilkington, Clarissa; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Williamson, Paula R; Beresford, Michael W

    2015-06-12

    Juvenile dermatomyositis (JDM) is a rare autoimmune inflammatory disorder associated with significant morbidity and mortality. International collaboration is necessary to better understand the pathogenesis of the disease, response to treatment and long-term outcome. To aid international collaboration, it is essential to have a core set of data that all researchers and clinicians collect in a standardised way for clinical purposes and for research. This should include demographic details, diagnostic data and measures of disease activity, investigations and treatment. Variables in existing clinical registries have been compared to produce a provisional data set for JDM. We now aim to develop this into a consensus-approved minimum core dataset, tested in a wider setting, with the objective of achieving international agreement. A two-stage bespoke Delphi-process will engage the opinion of a large number of key stakeholders through Email distribution via established international paediatric rheumatology and myositis organisations. This, together with a formalised patient/parent participation process will help inform a consensus meeting of international experts that will utilise a nominal group technique (NGT). The resulting proposed minimal dataset will be tested for feasibility within existing database infrastructures. The developed minimal dataset will be sent to all internationally representative collaborators for final comment. The participants of the expert consensus group will be asked to draw together these comments, ratify and 'sign off' the final minimal dataset. An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. The final approved minimum core dataset could be rapidly incorporated into national and international collaborative efforts, including existing prospective databases, and be available for use in randomised controlled trials and for treatment/protocol comparisons in cohort studies.

  6. Creating a standardized watersheds database for the Lower Rio Grande/Río Bravo, Texas

    USGS Publications Warehouse

    Brown, J.R.; Ulery, Randy L.; Parcher, Jean W.

    2000-01-01

    This report describes the creation of a large-scale watershed database for the lower Rio Grande/Río Bravo Basin in Texas. The watershed database includes watersheds delineated to all 1:24,000-scale mapped stream confluences and other hydrologically significant points, selected watershed characteristics, and hydrologic derivative datasets.Computer technology allows generation of preliminary watershed boundaries in a fraction of the time needed for manual methods. This automated process reduces development time and results in quality improvements in watershed boundaries and characteristics. These data can then be compiled in a permanent database, eliminating the time-consuming step of data creation at the beginning of a project and providing a stable base dataset that can give users greater confidence when further subdividing watersheds.A standardized dataset of watershed characteristics is a valuable contribution to the understanding and management of natural resources. Vertical integration of the input datasets used to automatically generate watershed boundaries is crucial to the success of such an effort. The optimum situation would be to use the digital orthophoto quadrangles as the source of all the input datasets. While the hydrographic data from the digital line graphs can be revised to match the digital orthophoto quadrangles, hypsography data cannot be revised to match the digital orthophoto quadrangles. Revised hydrography from the digital orthophoto quadrangle should be used to create an updated digital elevation model that incorporates the stream channels as revised from the digital orthophoto quadrangle. Computer-generated, standardized watersheds that are vertically integrated with existing digital line graph hydrographic data will continue to be difficult to create until revisions can be made to existing source datasets. Until such time, manual editing will be necessary to make adjustments for man-made features and changes in the natural landscape that are not reflected in the digital elevation model data.

  7. Creating a standardized watersheds database for the lower Rio Grande/Rio Bravo, Texas

    USGS Publications Warehouse

    Brown, Julie R.; Ulery, Randy L.; Parcher, Jean W.

    2000-01-01

    This report describes the creation of a large-scale watershed database for the lower Rio Grande/Rio Bravo Basin in Texas. The watershed database includes watersheds delineated to all 1:24,000-scale mapped stream confluences and other hydrologically significant points, selected watershed characteristics, and hydrologic derivative datasets. Computer technology allows generation of preliminary watershed boundaries in a fraction of the time needed for manual methods. This automated process reduces development time and results in quality improvements in watershed boundaries and characteristics. These data can then be compiled in a permanent database, eliminating the time-consuming step of data creation at the beginning of a project and providing a stable base dataset that can give users greater confidence when further subdividing watersheds. A standardized dataset of watershed characteristics is a valuable contribution to the understanding and management of natural resources. Vertical integration of the input datasets used to automatically generate watershed boundaries is crucial to the success of such an effort. The optimum situation would be to use the digital orthophoto quadrangles as the source of all the input datasets. While the hydrographic data from the digital line graphs can be revised to match the digital orthophoto quadrangles, hypsography data cannot be revised to match the digital orthophoto quadrangles. Revised hydrography from the digital orthophoto quadrangle should be used to create an updated digital elevation model that incorporates the stream channels as revised from the digital orthophoto quadrangle. Computer-generated, standardized watersheds that are vertically integrated with existing digital line graph hydrographic data will continue to be difficult to create until revisions can be made to existing source datasets. Until such time, manual editing will be necessary to make adjustments for man-made features and changes in the natural landscape that are not reflected in the digital elevation model data.

  8. Creation of clinical research databases in the 21st century: a practical algorithm for HIPAA Compliance.

    PubMed

    Schell, Scott R

    2006-02-01

    Enforcement of the Health Insurance Portability and Accountability Act (HIPAA) began in April, 2003. Designed as a law mandating health insurance availability when coverage was lost, HIPAA imposed sweeping and broad-reaching protections of patient privacy. These changes dramatically altered clinical research by placing sizeable regulatory burdens upon investigators with threat of severe and costly federal and civil penalties. This report describes development of an algorithmic approach to clinical research database design based upon a central key-shared data (CK-SD) model allowing researchers to easily analyze, distribute, and publish clinical research without disclosure of HIPAA Protected Health Information (PHI). Three clinical database formats (small clinical trial, operating room performance, and genetic microchip array datasets) were modeled using standard structured query language (SQL)-compliant databases. The CK database was created to contain PHI data, whereas a shareable SD database was generated in real-time containing relevant clinical outcome information while protecting PHI items. Small (< 100 records), medium (< 50,000 records), and large (> 10(8) records) model databases were created, and the resultant data models were evaluated in consultation with an HIPAA compliance officer. The SD database models complied fully with HIPAA regulations, and resulting "shared" data could be distributed freely. Unique patient identifiers were not required for treatment or outcome analysis. Age data were resolved to single-integer years, grouping patients aged > 89 years. Admission, discharge, treatment, and follow-up dates were replaced with enrollment year, and follow-up/outcome intervals calculated eliminating original data. Two additional data fields identified as PHI (treating physician and facility) were replaced with integer values, and the original data corresponding to these values were stored in the CK database. Use of the algorithm at the time of database design did not increase cost or design effort. The CK-SD model for clinical database design provides an algorithm for investigators to create, maintain, and share clinical research data compliant with HIPAA regulations. This model is applicable to new projects and large institutional datasets, and should decrease regulatory efforts required for conduct of clinical research. Application of the design algorithm early in the clinical research enterprise does not increase cost or the effort of data collection.

  9. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

    PubMed Central

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

    2015-01-01

    Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595

  10. High-throughput ocular artifact reduction in multichannel electroencephalography (EEG) using component subspace projection.

    PubMed

    Ma, Junshui; Bayram, Sevinç; Tao, Peining; Svetnik, Vladimir

    2011-03-15

    After a review of the ocular artifact reduction literature, a high-throughput method designed to reduce the ocular artifacts in multichannel continuous EEG recordings acquired at clinical EEG laboratories worldwide is proposed. The proposed method belongs to the category of component-based methods, and does not rely on any electrooculography (EOG) signals. Based on a concept that all ocular artifact components exist in a signal component subspace, the method can uniformly handle all types of ocular artifacts, including eye-blinks, saccades, and other eye movements, by automatically identifying ocular components from decomposed signal components. This study also proposes an improved strategy to objectively and quantitatively evaluate artifact reduction methods. The evaluation strategy uses real EEG signals to synthesize realistic simulated datasets with different amounts of ocular artifacts. The simulated datasets enable us to objectively demonstrate that the proposed method outperforms some existing methods when no high-quality EOG signals are available. Moreover, the results of the simulated datasets improve our understanding of the involved signal decomposition algorithms, and provide us with insights into the inconsistency regarding the performance of different methods in the literature. The proposed method was also applied to two independent clinical EEG datasets involving 28 volunteers and over 1000 EEG recordings. This effort further confirms that the proposed method can effectively reduce ocular artifacts in large clinical EEG datasets in a high-throughput fashion. Copyright © 2011 Elsevier B.V. All rights reserved.

  11. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics.

    PubMed

    Liu, Jianfang; Lichtenberg, Tara; Hoadley, Katherine A; Poisson, Laila M; Lazar, Alexander J; Cherniack, Andrew D; Kovatich, Albert J; Benz, Christopher C; Levine, Douglas A; Lee, Adrian V; Omberg, Larsson; Wolf, Denise M; Shriver, Craig D; Thorsson, Vesteinn; Hu, Hai

    2018-04-05

    For a decade, The Cancer Genome Atlas (TCGA) program collected clinicopathologic annotation data along with multi-platform molecular profiles of more than 11,000 human tumors across 33 different cancer types. TCGA clinical data contain key features representing the democratized nature of the data collection process. To ensure proper use of this large clinical dataset associated with genomic features, we developed a standardized dataset named the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR), which includes four major clinical outcome endpoints. In addition to detailing major challenges and statistical limitations encountered during the effort of integrating the acquired clinical data, we present a summary that includes endpoint usage recommendations for each cancer type. These TCGA-CDR findings appear to be consistent with cancer genomics studies independent of the TCGA effort and provide opportunities for investigating cancer biology using clinical correlates at an unprecedented scale. Copyright © 2018 Elsevier Inc. All rights reserved.

  12. Applying quantitative adiposity feature analysis models to predict benefit of bevacizumab-based chemotherapy in ovarian cancer patients

    NASA Astrophysics Data System (ADS)

    Wang, Yunzhi; Qiu, Yuchen; Thai, Theresa; More, Kathleen; Ding, Kai; Liu, Hong; Zheng, Bin

    2016-03-01

    How to rationally identify epithelial ovarian cancer (EOC) patients who will benefit from bevacizumab or other antiangiogenic therapies is a critical issue in EOC treatments. The motivation of this study is to quantitatively measure adiposity features from CT images and investigate the feasibility of predicting potential benefit of EOC patients with or without receiving bevacizumab-based chemotherapy treatment using multivariate statistical models built based on quantitative adiposity image features. A dataset involving CT images from 59 advanced EOC patients were included. Among them, 32 patients received maintenance bevacizumab after primary chemotherapy and the remaining 27 patients did not. We developed a computer-aided detection (CAD) scheme to automatically segment subcutaneous fat areas (VFA) and visceral fat areas (SFA) and then extracted 7 adiposity-related quantitative features. Three multivariate data analysis models (linear regression, logistic regression and Cox proportional hazards regression) were performed respectively to investigate the potential association between the model-generated prediction results and the patients' progression-free survival (PFS) and overall survival (OS). The results show that using all 3 statistical models, a statistically significant association was detected between the model-generated results and both of the two clinical outcomes in the group of patients receiving maintenance bevacizumab (p<0.01), while there were no significant association for both PFS and OS in the group of patients without receiving maintenance bevacizumab. Therefore, this study demonstrated the feasibility of using quantitative adiposity-related CT image features based statistical prediction models to generate a new clinical marker and predict the clinical outcome of EOC patients receiving maintenance bevacizumab-based chemotherapy.

  13. An eFTD-VP framework for efficiently generating patient-specific anatomically detailed facial soft tissue FE mesh for craniomaxillofacial surgery simulation

    PubMed Central

    Zhang, Xiaoyan; Kim, Daeseung; Shen, Shunyao; Yuan, Peng; Liu, Siting; Tang, Zhen; Zhang, Guangming; Zhou, Xiaobo; Gateno, Jaime

    2017-01-01

    Accurate surgical planning and prediction of craniomaxillofacial surgery outcome requires simulation of soft tissue changes following osteotomy. This can only be achieved by using an anatomically detailed facial soft tissue model. The current state-of-the-art of model generation is not appropriate to clinical applications due to the time-intensive nature of manual segmentation and volumetric mesh generation. The conventional patient-specific finite element (FE) mesh generation methods are to deform a template FE mesh to match the shape of a patient based on registration. However, these methods commonly produce element distortion. Additionally, the mesh density for patients depends on that of the template model. It could not be adjusted to conduct mesh density sensitivity analysis. In this study, we propose a new framework of patient-specific facial soft tissue FE mesh generation. The goal of the developed method is to efficiently generate a high-quality patient-specific hexahedral FE mesh with adjustable mesh density while preserving the accuracy in anatomical structure correspondence. Our FE mesh is generated by eFace template deformation followed by volumetric parametrization. First, the patient-specific anatomically detailed facial soft tissue model (including skin, mucosa, and muscles) is generated by deforming an eFace template model. The adaptation of the eFace template model is achieved by using a hybrid landmark-based morphing and dense surface fitting approach followed by a thin-plate spline interpolation. Then, high-quality hexahedral mesh is constructed by using volumetric parameterization. The user can control the resolution of hexahedron mesh to best reflect clinicians’ need. Our approach was validated using 30 patient models and 4 visible human datasets. The generated patient-specific FE mesh showed high surface matching accuracy, element quality, and internal structure matching accuracy. They can be directly and effectively used for clinical simulation of facial soft tissue change. PMID:29027022

  14. An eFTD-VP framework for efficiently generating patient-specific anatomically detailed facial soft tissue FE mesh for craniomaxillofacial surgery simulation.

    PubMed

    Zhang, Xiaoyan; Kim, Daeseung; Shen, Shunyao; Yuan, Peng; Liu, Siting; Tang, Zhen; Zhang, Guangming; Zhou, Xiaobo; Gateno, Jaime; Liebschner, Michael A K; Xia, James J

    2018-04-01

    Accurate surgical planning and prediction of craniomaxillofacial surgery outcome requires simulation of soft tissue changes following osteotomy. This can only be achieved by using an anatomically detailed facial soft tissue model. The current state-of-the-art of model generation is not appropriate to clinical applications due to the time-intensive nature of manual segmentation and volumetric mesh generation. The conventional patient-specific finite element (FE) mesh generation methods are to deform a template FE mesh to match the shape of a patient based on registration. However, these methods commonly produce element distortion. Additionally, the mesh density for patients depends on that of the template model. It could not be adjusted to conduct mesh density sensitivity analysis. In this study, we propose a new framework of patient-specific facial soft tissue FE mesh generation. The goal of the developed method is to efficiently generate a high-quality patient-specific hexahedral FE mesh with adjustable mesh density while preserving the accuracy in anatomical structure correspondence. Our FE mesh is generated by eFace template deformation followed by volumetric parametrization. First, the patient-specific anatomically detailed facial soft tissue model (including skin, mucosa, and muscles) is generated by deforming an eFace template model. The adaptation of the eFace template model is achieved by using a hybrid landmark-based morphing and dense surface fitting approach followed by a thin-plate spline interpolation. Then, high-quality hexahedral mesh is constructed by using volumetric parameterization. The user can control the resolution of hexahedron mesh to best reflect clinicians' need. Our approach was validated using 30 patient models and 4 visible human datasets. The generated patient-specific FE mesh showed high surface matching accuracy, element quality, and internal structure matching accuracy. They can be directly and effectively used for clinical simulation of facial soft tissue change.

  15. MatchingLand, geospatial data testbed for the assessment of matching methods.

    PubMed

    Xavier, Emerson M A; Ariza-López, Francisco J; Ureña-Cámara, Manuel A

    2017-12-05

    This article presents datasets prepared with the aim of helping the evaluation of geospatial matching methods for vector data. These datasets were built up from mapping data produced by official Spanish mapping agencies. The testbed supplied encompasses the three geometry types: point, line and area. Initial datasets were submitted to geometric transformations in order to generate synthetic datasets. These transformations represent factors that might influence the performance of geospatial matching methods, like the morphology of linear or areal features, systematic transformations, and random disturbance over initial data. We call our 11 GiB benchmark data 'MatchingLand' and we hope it can be useful for the geographic information science research community.

  16. Gene network inference and visualization tools for biologists: application to new human transcriptome datasets

    PubMed Central

    Hurley, Daniel; Araki, Hiromitsu; Tamada, Yoshinori; Dunmore, Ben; Sanders, Deborah; Humphreys, Sally; Affara, Muna; Imoto, Seiya; Yasuda, Kaori; Tomiyasu, Yuki; Tashiro, Kosuke; Savoie, Christopher; Cho, Vicky; Smith, Stephen; Kuhara, Satoru; Miyano, Satoru; Charnock-Jones, D. Stephen; Crampin, Edmund J.; Print, Cristin G.

    2012-01-01

    Gene regulatory networks inferred from RNA abundance data have generated significant interest, but despite this, gene network approaches are used infrequently and often require input from bioinformaticians. We have assembled a suite of tools for analysing regulatory networks, and we illustrate their use with microarray datasets generated in human endothelial cells. We infer a range of regulatory networks, and based on this analysis discuss the strengths and limitations of network inference from RNA abundance data. We welcome contact from researchers interested in using our inference and visualization tools to answer biological questions. PMID:22121215

  17. Non-negative Tensor Factorization for Robust Exploratory Big-Data Analytics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alexandrov, Boian; Vesselinov, Velimir Valentinov; Djidjev, Hristo Nikolov

    Currently, large multidimensional datasets are being accumulated in almost every field. Data are: (1) collected by distributed sensor networks in real-time all over the globe, (2) produced by large-scale experimental measurements or engineering activities, (3) generated by high-performance simulations, and (4) gathered by electronic communications and socialnetwork activities, etc. Simultaneous analysis of these ultra-large heterogeneous multidimensional datasets is often critical for scientific discoveries, decision-making, emergency response, and national and global security. The importance of such analyses mandates the development of the next-generation of robust machine learning (ML) methods and tools for bigdata exploratory analysis.

  18. Analyzing How We Do Analysis and Consume Data, Results from the SciDAC-Data Project

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ding, P.; Aliaga, L.; Mubarak, M.

    One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data deliverymore » is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption« less

  19. Analyzing how we do Analysis and Consume Data, Results from the SciDAC-Data Project

    NASA Astrophysics Data System (ADS)

    Ding, P.; Aliaga, L.; Mubarak, M.; Tsaris, A.; Norman, A.; Lyon, A.; Ross, R.

    2017-10-01

    One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data delivery is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption.

  20. Quantitative workflow based on NN for weighting criteria in landfill suitability mapping

    NASA Astrophysics Data System (ADS)

    Abujayyab, Sohaib K. M.; Ahamad, Mohd Sanusi S.; Yahya, Ahmad Shukri; Ahmad, Siti Zubaidah; Alkhasawneh, Mutasem Sh.; Aziz, Hamidi Abdul

    2017-10-01

    Our study aims to introduce a new quantitative workflow that integrates neural networks (NNs) and multi criteria decision analysis (MCDA). Existing MCDA workflows reveal a number of drawbacks, because of the reliance on human knowledge in the weighting stage. Thus, new workflow presented to form suitability maps at the regional scale for solid waste planning based on NNs. A feed-forward neural network employed in the workflow. A total of 34 criteria were pre-processed to establish the input dataset for NN modelling. The final learned network used to acquire the weights of the criteria. Accuracies of 95.2% and 93.2% achieved for the training dataset and testing dataset, respectively. The workflow was found to be capable of reducing human interference to generate highly reliable maps. The proposed workflow reveals the applicability of NN in generating landfill suitability maps and the feasibility of integrating them with existing MCDA workflows.

  1. Final Technical Report for Collaborative Research: Developing and Implementing Ocean-Atmosphere Reanalyses for Climate Applications (OARCA)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Compo, Gilbert P

    As an important step toward a coupled data assimilation system for generating reanalysis fields needed to assess climate model projections, the Ocean Atmosphere Coupled Reanalysis for Climate Applications (OARCA) project assesses and improves the longest reanalyses currently available of the atmosphere and ocean: the 20th Century Reanalysis Project (20CR) and the Simple Ocean Data Assimilation with sparse observational input (SODAsi) system, respectively. In this project, we make off-line but coordinated improvements in the 20CR and SODAsi datasets, with improvements in one feeding into improvements of the other through an iterative generation of new versions. These datasets now span from themore » 19th to 21st centuries. We then study the extreme weather and variability from days to decades of the resulting datasets. A total of 24 publications have been produced in this project.« less

  2. Self-organizing maps: a versatile tool for the automatic analysis of untargeted imaging datasets.

    PubMed

    Franceschi, Pietro; Wehrens, Ron

    2014-04-01

    MS-based imaging approaches allow for location-specific identification of chemical components in biological samples, opening up possibilities of much more detailed understanding of biological processes and mechanisms. Data analysis, however, is challenging, mainly because of the sheer size of such datasets. This article presents a novel approach based on self-organizing maps, extending previous work in order to be able to handle the large number of variables present in high-resolution mass spectra. The key idea is to generate prototype images, representing spatial distributions of ions, rather than prototypical mass spectra. This allows for a two-stage approach, first generating typical spatial distributions and associated m/z bins, and later analyzing the interesting bins in more detail using accurate masses. The possibilities and advantages of the new approach are illustrated on an in-house dataset of apple slices. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  3. TSPmap, a tool making use of traveling salesperson problem solvers in the efficient and accurate construction of high-density genetic linkage maps.

    PubMed

    Monroe, J Grey; Allen, Zachariah A; Tanger, Paul; Mullen, Jack L; Lovell, John T; Moyers, Brook T; Whitley, Darrell; McKay, John K

    2017-01-01

    Recent advances in nucleic acid sequencing technologies have led to a dramatic increase in the number of markers available to generate genetic linkage maps. This increased marker density can be used to improve genome assemblies as well as add much needed resolution for loci controlling variation in ecologically and agriculturally important traits. However, traditional genetic map construction methods from these large marker datasets can be computationally prohibitive and highly error prone. We present TSPmap , a method which implements both approximate and exact Traveling Salesperson Problem solvers to generate linkage maps. We demonstrate that for datasets with large numbers of genomic markers (e.g. 10,000) and in multiple population types generated from inbred parents, TSPmap can rapidly produce high quality linkage maps with low sensitivity to missing and erroneous genotyping data compared to two other benchmark methods, JoinMap and MSTmap . TSPmap is open source and freely available as an R package. With the advancement of low cost sequencing technologies, the number of markers used in the generation of genetic maps is expected to continue to rise. TSPmap will be a useful tool to handle such large datasets into the future, quickly producing high quality maps using a large number of genomic markers.

  4. Automated validation of patient safety clinical incident classification: macro analysis.

    PubMed

    Gupta, Jaiprakash; Patrick, Jon

    2013-01-01

    Patient safety is the buzz word in healthcare. Incident Information Management System (IIMS) is electronic software that stores clinical mishaps narratives in places where patients are treated. It is estimated that in one state alone over one million electronic text documents are available in IIMS. In this paper we investigate the data density available in the fields entered to notify an incident and the validity of the built in classification used by clinician to categories the incidents. Waikato Environment for Knowledge Analysis (WEKA) software was used to test the classes. Four statistical classifier based on J48, Naïve Bayes (NB), Naïve Bayes Multinominal (NBM) and Support Vector Machine using radial basis function (SVM_RBF) algorithms were used to validate the classes. The data pool was 10,000 clinical incidents drawn from 7 hospitals in one state in Australia. In first part of the study 1000 clinical incidents were selected to determine type and number of fields worth investigating and in the second part another 5448 clinical incidents were randomly selected to validate 13 clinical incident types. Result shows 74.6% of the cells were empty and only 23 fields had content over 70% of the time. The percentage correctly classified classes on four algorithms using categorical dataset ranged from 42 to 49%, using free-text datasets from 65% to 77% and using both datasets from 72% to 79%. Kappa statistic ranged from 0.36 to 0.4. for categorical data, from 0.61 to 0.74. for free-text and from 0.67 to 0.77 for both datasets. Similar increases in performance in the 3 experiments was noted on true positive rate, precision, F-measure and area under curve (AUC) of receiver operating characteristics (ROC) scores. The study demonstrates only 14 of 73 fields in IIMS have data that is usable for machine learning experiments. Irrespective of the type of algorithms used when all datasets are used performance was better. Classifier NBM showed best performance. We think the classifier can be improved further by reclassifying the most confused classes and there is scope to apply text mining tool on patient safety classifications.

  5. Querying phenotype-genotype relationships on patient datasets using semantic web technology: the example of Cerebrotendinous xanthomatosis.

    PubMed

    Taboada, María; Martínez, Diego; Pilo, Belén; Jiménez-Escrig, Adriano; Robinson, Peter N; Sobrido, María J

    2012-07-31

    Semantic Web technology can considerably catalyze translational genetics and genomics research in medicine, where the interchange of information between basic research and clinical levels becomes crucial. This exchange involves mapping abstract phenotype descriptions from research resources, such as knowledge databases and catalogs, to unstructured datasets produced through experimental methods and clinical practice. This is especially true for the construction of mutation databases. This paper presents a way of harmonizing abstract phenotype descriptions with patient data from clinical practice, and querying this dataset about relationships between phenotypes and genetic variants, at different levels of abstraction. Due to the current availability of ontological and terminological resources that have already reached some consensus in biomedicine, a reuse-based ontology engineering approach was followed. The proposed approach uses the Ontology Web Language (OWL) to represent the phenotype ontology and the patient model, the Semantic Web Rule Language (SWRL) to bridge the gap between phenotype descriptions and clinical data, and the Semantic Query Web Rule Language (SQWRL) to query relevant phenotype-genotype bidirectional relationships. The work tests the use of semantic web technology in the biomedical research domain named cerebrotendinous xanthomatosis (CTX), using a real dataset and ontologies. A framework to query relevant phenotype-genotype bidirectional relationships is provided. Phenotype descriptions and patient data were harmonized by defining 28 Horn-like rules in terms of the OWL concepts. In total, 24 patterns of SWQRL queries were designed following the initial list of competency questions. As the approach is based on OWL, the semantic of the framework adapts the standard logical model of an open world assumption. This work demonstrates how semantic web technologies can be used to support flexible representation and computational inference mechanisms required to query patient datasets at different levels of abstraction. The open world assumption is especially good for describing only partially known phenotype-genotype relationships, in a way that is easily extensible. In future, this type of approach could offer researchers a valuable resource to infer new data from patient data for statistical analysis in translational research. In conclusion, phenotype description formalization and mapping to clinical data are two key elements for interchanging knowledge between basic and clinical research.

  6. The influence of climate change on Tanzania's hydropower sustainability

    NASA Astrophysics Data System (ADS)

    Sperna Weiland, Frederiek; Boehlert, Brent; Meijer, Karen; Schellekens, Jaap; Magnell, Jan-Petter; Helbrink, Jakob; Kassana, Leonard; Liden, Rikard

    2015-04-01

    Economic costs induced by current climate variability are large for Tanzania and may further increase due to future climate change. The Tanzanian National Climate Change Strategy addressed the need for stabilization of hydropower generation and strengthening of water resources management. Increased hydropower generation can contribute to sustainable use of energy resources and stabilization of the national electricity grid. To support Tanzania the World Bank financed this study in which the impact of climate change on the water resources and related hydropower generation capacity of Tanzania is assessed. To this end an ensemble of 78 GCM projections from both the CMIP3 and CMIP5 datasets was bias-corrected and down-scaled to 0.5 degrees resolution following the BCSD technique using the Princeton Global Meteorological Forcing Dataset as a reference. To quantify the hydrological impacts of climate change by 2035 the global hydrological model PCR-GLOBWB was set-up for Tanzania at a resolution of 3 minutes and run with all 78 GCM datasets. From the full set of projections a probable (median) and worst case scenario (95th percentile) were selected based upon (1) the country average Climate Moisture Index and (2) discharge statistics of relevance to hydropower generation. Although precipitation from the Princeton dataset shows deviations from local station measurements and the global hydrological model does not perfectly reproduce local scale hydrographs, the main discharge characteristics and precipitation patterns are represented well. The modeled natural river flows were adjusted for water demand and irrigation within the water resources model RIBASIM (both historical values and future scenarios). Potential hydropower capacity was assessed with the power market simulation model PoMo-C that considers both reservoir inflows obtained from RIBASIM and overall electricity generation costs. Results of the study show that climate change is unlikely to negatively affect the average potential of future hydropower production; it will likely make hydropower more profitable. Yet, the uncertainty in climate change projections remains large and risks are significant, adaptation strategies should ideally consider a worst case scenario to ensure robust power generation. Overall a diversified power generation portfolio, anchored in hydropower and supported by other renewables and fossil fuel-based energy sources, is the best solution for Tanzania

  7. Dissecting psychiatric spectrum disorders by generative embedding☆☆☆

    PubMed Central

    Brodersen, Kay H.; Deserno, Lorenz; Schlagenhauf, Florian; Lin, Zhihao; Penny, Will D.; Buhmann, Joachim M.; Stephan, Klaas E.

    2013-01-01

    This proof-of-concept study examines the feasibility of defining subgroups in psychiatric spectrum disorders by generative embedding, using dynamical system models which infer neuronal circuit mechanisms from neuroimaging data. To this end, we re-analysed an fMRI dataset of 41 patients diagnosed with schizophrenia and 42 healthy controls performing a numerical n-back working-memory task. In our generative-embedding approach, we used parameter estimates from a dynamic causal model (DCM) of a visual–parietal–prefrontal network to define a model-based feature space for the subsequent application of supervised and unsupervised learning techniques. First, using a linear support vector machine for classification, we were able to predict individual diagnostic labels significantly more accurately (78%) from DCM-based effective connectivity estimates than from functional connectivity between (62%) or local activity within the same regions (55%). Second, an unsupervised approach based on variational Bayesian Gaussian mixture modelling provided evidence for two clusters which mapped onto patients and controls with nearly the same accuracy (71%) as the supervised approach. Finally, when restricting the analysis only to the patients, Gaussian mixture modelling suggested the existence of three patient subgroups, each of which was characterised by a different architecture of the visual–parietal–prefrontal working-memory network. Critically, even though this analysis did not have access to information about the patients' clinical symptoms, the three neurophysiologically defined subgroups mapped onto three clinically distinct subgroups, distinguished by significant differences in negative symptom severity, as assessed on the Positive and Negative Syndrome Scale (PANSS). In summary, this study provides a concrete example of how psychiatric spectrum diseases may be split into subgroups that are defined in terms of neurophysiological mechanisms specified by a generative model of network dynamics such as DCM. The results corroborate our previous findings in stroke patients that generative embedding, compared to analyses of more conventional measures such as functional connectivity or regional activity, can significantly enhance both the interpretability and performance of computational approaches to clinical classification. PMID:24363992

  8. The Harvard organic photovoltaic dataset

    DOE PAGES

    Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; ...

    2016-09-27

    Presented in this work is the Harvard Organic Photovoltaic Dataset (HOPV15), a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

  9. The Harvard organic photovoltaic dataset

    PubMed Central

    Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán

    2016-01-01

    The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312

  10. The Harvard organic photovoltaic dataset.

    PubMed

    Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán

    2016-09-27

    The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

  11. Four dimensional hybrid ultrasound and optoacoustic imaging via passive element optical excitation in a hand-held probe

    NASA Astrophysics Data System (ADS)

    Fehm, Thomas Felix; Deán-Ben, Xosé Luís; Razansky, Daniel

    2014-10-01

    Ultrasonography and optoacoustic imaging share powerful advantages related to the natural aptitude for real-time image rendering with high resolution, the hand-held operation, and lack of ionizing radiation. The two methods also possess very different yet highly complementary advantages of the mechanical and optical contrast in living tissues. Nonetheless, efficient integration of these modalities remains challenging owing to the fundamental differences in the underlying physical contrast, optimal signal acquisition, and image reconstruction approaches. We report on a method for hybrid acquisition and reconstruction of three-dimensional pulse-echo ultrasound and optoacoustic images in real time based on passive ultrasound generation with an optical absorber, thus avoiding the hardware complexity of active ultrasound generation. In this way, complete hybrid datasets are generated with a single laser interrogation pulse, resulting in simultaneous rendering of ultrasound and optoacoustic images at an unprecedented rate of 10 volumetric frames per second. Performance is subsequently showcased in phantom experiments and in-vivo measurements from a healthy human volunteer, confirming general clinical applicability of the method.

  12. Four dimensional hybrid ultrasound and optoacoustic imaging via passive element optical excitation in a hand-held probe

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fehm, Thomas Felix; Razansky, Daniel, E-mail: dr@tum.de; Faculty of Medicine, Technische Universität München, Munich

    2014-10-27

    Ultrasonography and optoacoustic imaging share powerful advantages related to the natural aptitude for real-time image rendering with high resolution, the hand-held operation, and lack of ionizing radiation. The two methods also possess very different yet highly complementary advantages of the mechanical and optical contrast in living tissues. Nonetheless, efficient integration of these modalities remains challenging owing to the fundamental differences in the underlying physical contrast, optimal signal acquisition, and image reconstruction approaches. We report on a method for hybrid acquisition and reconstruction of three-dimensional pulse-echo ultrasound and optoacoustic images in real time based on passive ultrasound generation with an opticalmore » absorber, thus avoiding the hardware complexity of active ultrasound generation. In this way, complete hybrid datasets are generated with a single laser interrogation pulse, resulting in simultaneous rendering of ultrasound and optoacoustic images at an unprecedented rate of 10 volumetric frames per second. Performance is subsequently showcased in phantom experiments and in-vivo measurements from a healthy human volunteer, confirming general clinical applicability of the method.« less

  13. Indirectly Estimating International Net Migration Flows by Age and Gender: The Community Demographic Model International Migration (CDM-IM) Dataset

    PubMed Central

    Nawrotzki, Raphael J.; Jiang, Leiwen

    2015-01-01

    Although data for the total number of international migrant flows is now available, no global dataset concerning demographic characteristics, such as the age and gender composition of migrant flows exists. This paper reports on the methods used to generate the CDM-IM dataset of age and gender specific profiles of bilateral net (not gross) migrant flows. We employ raw data from the United Nations Global Migration Database and estimate net migrant flows by age and gender between two time points around the year 2000, accounting for various demographic processes (fertility, mortality). The dataset contains information on 3,713 net migrant flows. Validation analyses against existing data sets and the historical, geopolitical context demonstrate that the CDM-IM dataset is of reasonably high quality. PMID:26692590

  14. SU-E-J-92: Validating Dose Uncertainty Estimates Produced by AUTODIRECT, An Automated Program to Evaluate Deformable Image Registration Accuracy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kim, H; Chen, J; Pouliot, J

    2015-06-15

    Purpose: Deformable image registration (DIR) is a powerful tool with the potential to deformably map dose from one computed-tomography (CT) image to another. Errors in the DIR, however, will produce errors in the transferred dose distribution. We have proposed a software tool, called AUTODIRECT (automated DIR evaluation of confidence tool), which predicts voxel-specific dose mapping errors on a patient-by-patient basis. This work validates the effectiveness of AUTODIRECT to predict dose mapping errors with virtual and physical phantom datasets. Methods: AUTODIRECT requires 4 inputs: moving and fixed CT images and two noise scans of a water phantom (for noise characterization). Then,more » AUTODIRECT uses algorithms to generate test deformations and applies them to the moving and fixed images (along with processing) to digitally create sets of test images, with known ground-truth deformations that are similar to the actual one. The clinical DIR algorithm is then applied to these test image sets (currently 4) . From these tests, AUTODIRECT generates spatial and dose uncertainty estimates for each image voxel based on a Student’s t distribution. This work compares these uncertainty estimates to the actual errors made by the Velocity Deformable Multi Pass algorithm on 11 virtual and 1 physical phantom datasets. Results: For 11 of the 12 tests, the predicted dose error distributions from AUTODIRECT are well matched to the actual error distributions within 1–6% for 10 virtual phantoms, and 9% for the physical phantom. For one of the cases though, the predictions underestimated the errors in the tail of the distribution. Conclusion: Overall, the AUTODIRECT algorithm performed well on the 12 phantom cases for Velocity and was shown to generate accurate estimates of dose warping uncertainty. AUTODIRECT is able to automatically generate patient-, organ- , and voxel-specific DIR uncertainty estimates. This ability would be useful for patient-specific DIR quality assurance.« less

  15. The theoretical and practical determination of clinical cut-offs for the British Sign Language versions of PHQ-9 and GAD-7.

    PubMed

    Belk, Rachel A; Pilling, Mark; Rogers, Katherine D; Lovell, Karina; Young, Alys

    2016-11-03

    The PHQ-9 and the GAD-7 assess depression and anxiety respectively. There are standardised, reliability-tested versions in BSL (British Sign Language) that are used with Deaf users of the IAPT service. The aim of this study is to determine their appropriate clinical cut-offs when used with Deaf people who sign and to examine the operating characteristics for PHQ-9 BSL and GAD-7 BSL with a clinical Deaf population. Two datasets were compared: (i) dataset (n = 502) from a specialist IAPT service for Deaf people; and (ii) dataset (n = 85) from our existing study of Deaf people who self-reported having no mental health difficulties. Parameter estimates, with the precision of AUC value, sensitivity, specificity, positive predicted value (ppv) and negative predicted value (npv), were carried out to provide the details of the clinical cut-offs. Three statistical choices were included: Maximising (Youden: maximising sensitivity + specificity), Equalising (Sensitivity = Specificity) and Prioritising treatment (False Negative twice as bad as False Positive). Standard measures (as defined by IAPT) were applied to examine caseness, recovery, reliable change and reliable recovery for the first dataset. The clinical cut-offs for PHQ-9 BSL and GAD-7 BSL are 8 and 6 respectively. This compares with the original English version cut-offs in the hearing population of 10 and 8 respectively. The three different statistical choices for calculating clinical cut-offs all showed a lower clinical cut-off for the Deaf population with respect to the PHQ-9 BSL and GAD-7 BSL with the exception of the Maximising criteria when used with the PHQ-9 BSL. Applying the new clinical cut-offs, the percentage of Deaf BSL IAPT service users showing reliable recovery is 54.0 % compared to 63.7 % using the cut-off scores used for English speaking hearing people. These compare favourably with national IAPT data for the general population. The correct clinical cut-offs for the PHQ-9 BSL and GAD-7 BSL enable meaningful measures of clinical effectiveness and facilitate appropriate access to treatment when required.

  16. The international primary ciliary dyskinesia cohort (iPCD Cohort): methods and first results.

    PubMed

    Goutaki, Myrofora; Maurer, Elisabeth; Halbeisen, Florian S; Amirav, Israel; Barbato, Angelo; Behan, Laura; Boon, Mieke; Casaulta, Carmen; Clement, Annick; Crowley, Suzanne; Haarman, Eric; Hogg, Claire; Karadag, Bulent; Koerner-Rettberg, Cordula; Leigh, Margaret W; Loebinger, Michael R; Mazurek, Henryk; Morgan, Lucy; Nielsen, Kim G; Omran, Heymut; Schwerk, Nicolaus; Scigliano, Sergio; Werner, Claudius; Yiallouros, Panayiotis; Zivkovic, Zorica; Lucas, Jane S; Kuehni, Claudia E

    2017-01-01

    Data on primary ciliary dyskinesia (PCD) epidemiology is scarce and published studies are characterised by low numbers. In the framework of the European Union project BESTCILIA we aimed to combine all available datasets in a retrospective international PCD cohort (iPCD Cohort).We identified eligible datasets by performing a systematic review of published studies containing clinical information on PCD, and by contacting members of past and current European Respiratory Society Task Forces on PCD. We compared the contents of the datasets, clarified definitions and pooled them in a standardised format.As of April 2016 the iPCD Cohort includes data on 3013 patients from 18 countries. It includes data on diagnostic evaluations, symptoms, lung function, growth and treatments. Longitudinal data are currently available for 542 patients. The extent of clinical details per patient varies between centres. More than 50% of patients have a definite PCD diagnosis based on recent guidelines. Children aged 10-19 years are the largest age group, followed by younger children (≤9 years) and young adults (20-29 years).This is the largest observational PCD dataset available to date. It will allow us to answer pertinent questions on clinical phenotype, disease severity, prognosis and effect of treatments, and to investigate genotype-phenotype correlations. Copyright ©ERS 2017.

  17. The international primary ciliary dyskinesia cohort (iPCD Cohort): methods and first results

    PubMed Central

    Goutaki, Myrofora; Maurer, Elisabeth; Halbeisen, Florian S.; Amirav, Israel; Barbato, Angelo; Behan, Laura; Boon, Mieke; Casaulta, Carmen; Clement, Annick; Crowley, Suzanne; Haarman, Eric; Hogg, Claire; Karadag, Bulent; Koerner-Rettberg, Cordula; Leigh, Margaret W.; Loebinger, Michael R.; Mazurek, Henryk; Morgan, Lucy; Nielsen, Kim G.; Omran, Heymut; Schwerk, Nicolaus; Scigliano, Sergio; Werner, Claudius; Yiallouros, Panayiotis; Zivkovic, Zorica; Lucas, Jane S.

    2017-01-01

    Data on primary ciliary dyskinesia (PCD) epidemiology is scarce and published studies are characterised by low numbers. In the framework of the European Union project BESTCILIA we aimed to combine all available datasets in a retrospective international PCD cohort (iPCD Cohort). We identified eligible datasets by performing a systematic review of published studies containing clinical information on PCD, and by contacting members of past and current European Respiratory Society Task Forces on PCD. We compared the contents of the datasets, clarified definitions and pooled them in a standardised format. As of April 2016 the iPCD Cohort includes data on 3013 patients from 18 countries. It includes data on diagnostic evaluations, symptoms, lung function, growth and treatments. Longitudinal data are currently available for 542 patients. The extent of clinical details per patient varies between centres. More than 50% of patients have a definite PCD diagnosis based on recent guidelines. Children aged 10–19 years are the largest age group, followed by younger children (≤9 years) and young adults (20–29 years). This is the largest observational PCD dataset available to date. It will allow us to answer pertinent questions on clinical phenotype, disease severity, prognosis and effect of treatments, and to investigate genotype–phenotype correlations. PMID:28052956

  18. A comparative study of RNA-Seq and microarray data analysis on the two examples of rectal-cancer patients and Burkitt Lymphoma cells.

    PubMed

    Wolff, Alexander; Bayerlová, Michaela; Gaedcke, Jochen; Kube, Dieter; Beißbarth, Tim

    2018-01-01

    Pipeline comparisons for gene expression data are highly valuable for applied real data analyses, as they enable the selection of suitable analysis strategies for the dataset at hand. Such pipelines for RNA-Seq data should include mapping of reads, counting and differential gene expression analysis or preprocessing, normalization and differential gene expression in case of microarray analysis, in order to give a global insight into pipeline performances. Four commonly used RNA-Seq pipelines (STAR/HTSeq-Count/edgeR, STAR/RSEM/edgeR, Sailfish/edgeR, TopHat2/Cufflinks/CuffDiff)) were investigated on multiple levels (alignment and counting) and cross-compared with the microarray counterpart on the level of gene expression and gene ontology enrichment. For these comparisons we generated two matched microarray and RNA-Seq datasets: Burkitt Lymphoma cell line data and rectal cancer patient data. The overall mapping rate of STAR was 98.98% for the cell line dataset and 98.49% for the patient dataset. Tophat's overall mapping rate was 97.02% and 96.73%, respectively, while Sailfish had only an overall mapping rate of 84.81% and 54.44%. The correlation of gene expression in microarray and RNA-Seq data was moderately worse for the patient dataset (ρ = 0.67-0.69) than for the cell line dataset (ρ = 0.87-0.88). An exception were the correlation results of Cufflinks, which were substantially lower (ρ = 0.21-0.29 and 0.34-0.53). For both datasets we identified very low numbers of differentially expressed genes using the microarray platform. For RNA-Seq we checked the agreement of differentially expressed genes identified in the different pipelines and of GO-term enrichment results. In conclusion the combination of STAR aligner with HTSeq-Count followed by STAR aligner with RSEM and Sailfish generated differentially expressed genes best suited for the dataset at hand and in agreement with most of the other transcriptomics pipelines.

  19. ODM Data Analysis-A tool for the automatic validation, monitoring and generation of generic descriptive statistics of patient data.

    PubMed

    Brix, Tobias Johannes; Bruland, Philipp; Sarfraz, Saad; Ernsting, Jan; Neuhaus, Philipp; Storck, Michael; Doods, Justin; Ständer, Sonja; Dugas, Martin

    2018-01-01

    A required step for presenting results of clinical studies is the declaration of participants demographic and baseline characteristics as claimed by the FDAAA 801. The common workflow to accomplish this task is to export the clinical data from the used electronic data capture system and import it into statistical software like SAS software or IBM SPSS. This software requires trained users, who have to implement the analysis individually for each item. These expenditures may become an obstacle for small studies. Objective of this work is to design, implement and evaluate an open source application, called ODM Data Analysis, for the semi-automatic analysis of clinical study data. The system requires clinical data in the CDISC Operational Data Model format. After uploading the file, its syntax and data type conformity of the collected data is validated. The completeness of the study data is determined and basic statistics, including illustrative charts for each item, are generated. Datasets from four clinical studies have been used to evaluate the application's performance and functionality. The system is implemented as an open source web application (available at https://odmanalysis.uni-muenster.de) and also provided as Docker image which enables an easy distribution and installation on local systems. Study data is only stored in the application as long as the calculations are performed which is compliant with data protection endeavors. Analysis times are below half an hour, even for larger studies with over 6000 subjects. Medical experts have ensured the usefulness of this application to grant an overview of their collected study data for monitoring purposes and to generate descriptive statistics without further user interaction. The semi-automatic analysis has its limitations and cannot replace the complex analysis of statisticians, but it can be used as a starting point for their examination and reporting.

  20. A method for normalizing pathology images to improve feature extraction for quantitative pathology.

    PubMed

    Tam, Allison; Barker, Jocelyn; Rubin, Daniel

    2016-01-01

    With the advent of digital slide scanning technologies and the potential proliferation of large repositories of digital pathology images, many research studies can leverage these data for biomedical discovery and to develop clinical applications. However, quantitative analysis of digital pathology images is impeded by batch effects generated by varied staining protocols and staining conditions of pathological slides. To overcome this problem, this paper proposes a novel, fully automated stain normalization method to reduce batch effects and thus aid research in digital pathology applications. Their method, intensity centering and histogram equalization (ICHE), normalizes a diverse set of pathology images by first scaling the centroids of the intensity histograms to a common point and then applying a modified version of contrast-limited adaptive histogram equalization. Normalization was performed on two datasets of digitized hematoxylin and eosin (H&E) slides of different tissue slices from the same lung tumor, and one immunohistochemistry dataset of digitized slides created by restaining one of the H&E datasets. The ICHE method was evaluated based on image intensity values, quantitative features, and the effect on downstream applications, such as a computer aided diagnosis. For comparison, three methods from the literature were reimplemented and evaluated using the same criteria. The authors found that ICHE not only improved performance compared with un-normalized images, but in most cases showed improvement compared with previous methods for correcting batch effects in the literature. ICHE may be a useful preprocessing step a digital pathology image processing pipeline.

  1. DeepMitosis: Mitosis detection via deep detection, verification and segmentation networks.

    PubMed

    Li, Chao; Wang, Xinggang; Liu, Wenyu; Latecki, Longin Jan

    2018-04-01

    Mitotic count is a critical predictor of tumor aggressiveness in the breast cancer diagnosis. Nowadays mitosis counting is mainly performed by pathologists manually, which is extremely arduous and time-consuming. In this paper, we propose an accurate method for detecting the mitotic cells from histopathological slides using a novel multi-stage deep learning framework. Our method consists of a deep segmentation network for generating mitosis region when only a weak label is given (i.e., only the centroid pixel of mitosis is annotated), an elaborately designed deep detection network for localizing mitosis by using contextual region information, and a deep verification network for improving detection accuracy by removing false positives. We validate the proposed deep learning method on two widely used Mitosis Detection in Breast Cancer Histological Images (MITOSIS) datasets. Experimental results show that we can achieve the highest F-score on the MITOSIS dataset from ICPR 2012 grand challenge merely using the deep detection network. For the ICPR 2014 MITOSIS dataset that only provides the centroid location of mitosis, we employ the segmentation model to estimate the bounding box annotation for training the deep detection network. We also apply the verification model to eliminate some false positives produced from the detection model. By fusing scores of the detection and verification models, we achieve the state-of-the-art results. Moreover, our method is very fast with GPU computing, which makes it feasible for clinical practice. Copyright © 2018 Elsevier B.V. All rights reserved.

  2. Exploring syndrome differentiation using non-negative matrix factorization and cluster analysis in patients with atopic dermatitis.

    PubMed

    Yun, Younghee; Jung, Wonmo; Kim, Hyunho; Jang, Bo-Hyoung; Kim, Min-Hee; Noh, Jiseong; Ko, Seong-Gyu; Choi, Inhwa

    2017-08-01

    Syndrome differentiation (SD) results in a diagnostic conclusion based on a cluster of concurrent symptoms and signs, including pulse form and tongue color. In Korea, there is a strong interest in the standardization of Traditional Medicine (TM). In order to standardize TM treatment, standardization of SD should be given priority. The aim of this study was to explore the SD, or symptom clusters, of patients with atopic dermatitis (AD) using non-negative factorization methods and k-means clustering analysis. We screened 80 patients and enrolled 73 eligible patients. One TM dermatologist evaluated the symptoms/signs using an existing clinical dataset from patients with AD. This dataset was designed to collect 15 dermatologic and 18 systemic symptoms/signs associated with AD. Non-negative matrix factorization was used to decompose the original data into a matrix with three features and a weight matrix. The point of intersection of the three coordinates from each patient was placed in three-dimensional space. With five clusters, the silhouette score reached 0.484, and this was the best silhouette score obtained from two to nine clusters. Patients were clustered according to the varying severity of concurrent symptoms/signs. Through the distribution of the null hypothesis generated by 10,000 permutation tests, we found significant cluster-specific symptoms/signs from the confidence intervals in the upper and lower 2.5% of the distribution. Patients in each cluster showed differences in symptoms/signs and severity. In a clinical situation, SD and treatment are based on the practitioners' observations and clinical experience. SD, identified through informatics, can contribute to development of standardized, objective, and consistent SD for each disease. Copyright © 2017. Published by Elsevier Ltd.

  3. TH-EF-BRA-03: Assessment of Data-Driven Respiratory Motion-Compensation Methods for 4D-CBCT Image Registration and Reconstruction Using Clinical Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Riblett, MJ; Weiss, E; Hugo, GD

    Purpose: To evaluate the performance of a 4D-CBCT registration and reconstruction method that corrects for respiratory motion and enhances image quality under clinically relevant conditions. Methods: Building on previous work, which tested feasibility of a motion-compensation workflow using image datasets superior to clinical acquisitions, this study assesses workflow performance under clinical conditions in terms of image quality improvement. Evaluated workflows utilized a combination of groupwise deformable image registration (DIR) and image reconstruction. Four-dimensional cone beam CT (4D-CBCT) FDK reconstructions were registered to either mean or respiratory phase reference frame images to model respiratory motion. The resulting 4D transformation was usedmore » to deform projection data during the FDK backprojection operation to create a motion-compensated reconstruction. To simulate clinically realistic conditions, superior quality projection datasets were sampled using a phase-binned striding method. Tissue interface sharpness (TIS) was defined as the slope of a sigmoid curve fit to the lung-diaphragm boundary or to the carina tissue-airway boundary when no diaphragm was discernable. Image quality improvement was assessed in 19 clinical cases by evaluating mitigation of view-aliasing artifacts, tissue interface sharpness recovery, and noise reduction. Results: For clinical datasets, evaluated average TIS recovery relative to base 4D-CBCT reconstructions was observed to be 87% using fixed-frame registration alone; 87% using fixed-frame with motion-compensated reconstruction; 92% using mean-frame registration alone; and 90% using mean-frame with motion-compensated reconstruction. Soft tissue noise was reduced on average by 43% and 44% for the fixed-frame registration and registration with motion-compensation methods, respectively, and by 40% and 42% for the corresponding mean-frame methods. Considerable reductions in view aliasing artifacts were observed for each method. Conclusion: Data-driven groupwise registration and motion-compensated reconstruction have the potential to improve the quality of 4D-CBCT images acquired under clinical conditions. For clinical image datasets, the addition of motion compensation after groupwise registration visibly reduced artifact impact. This work was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA166119. Hugo and Weiss hold a research agreement with Philips Healthcare and license agreement with Varian Medical Systems. Weiss receives royalties from UpToDate. Christensen receives funds from Roger Koch to support research.« less

  4. Evaluation of the sparse coding super-resolution method for improving image quality of up-sampled images in computed tomography

    NASA Astrophysics Data System (ADS)

    Ota, Junko; Umehara, Kensuke; Ishimaru, Naoki; Ohno, Shunsuke; Okamoto, Kentaro; Suzuki, Takanori; Shirai, Naoki; Ishida, Takayuki

    2017-02-01

    As the capability of high-resolution displays grows, high-resolution images are often required in Computed Tomography (CT). However, acquiring high-resolution images takes a higher radiation dose and a longer scanning time. In this study, we applied the Sparse-coding-based Super-Resolution (ScSR) method to generate high-resolution images without increasing the radiation dose. We prepared the over-complete dictionary learned the mapping between low- and highresolution patches and seek a sparse representation of each patch of the low-resolution input. These coefficients were used to generate the high-resolution output. For evaluation, 44 CT cases were used as the test dataset. We up-sampled images up to 2 or 4 times and compared the image quality of the ScSR scheme and bilinear and bicubic interpolations, which are the traditional interpolation schemes. We also compared the image quality of three learning datasets. A total of 45 CT images, 91 non-medical images, and 93 chest radiographs were used for dictionary preparation respectively. The image quality was evaluated by measuring peak signal-to-noise ratio (PSNR) and structure similarity (SSIM). The differences of PSNRs and SSIMs between the ScSR method and interpolation methods were statistically significant. Visual assessment confirmed that the ScSR method generated a high-resolution image with sharpness, whereas conventional interpolation methods generated over-smoothed images. To compare three different training datasets, there were no significance between the CT, the CXR and non-medical datasets. These results suggest that the ScSR provides a robust approach for application of up-sampling CT images and yields substantial high image quality of extended images in CT.

  5. Real-world datasets for portfolio selection and solutions of some stochastic dominance portfolio models.

    PubMed

    Bruni, Renato; Cesarone, Francesco; Scozzari, Andrea; Tardella, Fabio

    2016-09-01

    A large number of portfolio selection models have appeared in the literature since the pioneering work of Markowitz. However, even when computational and empirical results are described, they are often hard to replicate and compare due to the unavailability of the datasets used in the experiments. We provide here several datasets for portfolio selection generated using real-world price values from several major stock markets. The datasets contain weekly return values, adjusted for dividends and for stock splits, which are cleaned from errors as much as possible. The datasets are available in different formats, and can be used as benchmarks for testing the performances of portfolio selection models and for comparing the efficiency of the algorithms used to solve them. We also provide, for these datasets, the portfolios obtained by several selection strategies based on Stochastic Dominance models (see "On Exact and Approximate Stochastic Dominance Strategies for Portfolio Selection" (Bruni et al. [2])). We believe that testing portfolio models on publicly available datasets greatly simplifies the comparison of the different portfolio selection strategies.

  6. Generation of High Resolution Global DSM from ALOS PRISM

    NASA Astrophysics Data System (ADS)

    Takaku, J.; Tadono, T.; Tsutsui, K.

    2014-04-01

    Panchromatic Remote-sensing Instrument for Stereo Mapping (PRISM), one of onboard sensors carried on the Advanced Land Observing Satellite (ALOS), was designed to generate worldwide topographic data with its optical stereoscopic observation. The sensor consists of three independent panchromatic radiometers for viewing forward, nadir, and backward in 2.5 m ground resolution producing a triplet stereoscopic image along its track. The sensor had observed huge amount of stereo images all over the world during the mission life of the satellite from 2006 through 2011. We have semi-automatically processed Digital Surface Model (DSM) data with the image archives in some limited areas. The height accuracy of the dataset was estimated at less than 5 m (rms) from the evaluation with ground control points (GCPs) or reference DSMs derived from the Light Detection and Ranging (LiDAR). Then, we decided to process the global DSM datasets from all available archives of PRISM stereo images by the end of March 2016. This paper briefly reports on the latest processing algorithms for the global DSM datasets as well as their preliminary results on some test sites. The accuracies and error characteristics of datasets are analyzed and discussed on various fields by the comparison with existing global datasets such as Ice, Cloud, and land Elevation Satellite (ICESat) data and Shuttle Radar Topography Mission (SRTM) data, as well as the GCPs and the reference airborne LiDAR/DSM.

  7. The South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case register: development and descriptive data.

    PubMed

    Stewart, Robert; Soremekun, Mishael; Perera, Gayan; Broadbent, Matthew; Callard, Felicity; Denis, Mike; Hotopf, Matthew; Thornicroft, Graham; Lovestone, Simon

    2009-08-12

    Case registers have been used extensively in mental health research. Recent developments in electronic medical records, and in computer software to search and analyse these in anonymised format, have the potential to revolutionise this research tool. We describe the development of the South London and Maudsley NHS Foundation Trust (SLAM) Biomedical Research Centre (BRC) Case Register Interactive Search tool (CRIS) which allows research-accessible datasets to be derived from SLAM, the largest provider of secondary mental healthcare in Europe. All clinical data, including free text, are available for analysis in the form of anonymised datasets. Development involved both the building of the system and setting in place the necessary security (with both functional and procedural elements). Descriptive data are presented for the Register database as of October 2008. The database at that point included 122,440 cases, 35,396 of whom were receiving active case management under the Care Programme Approach. In terms of gender and ethnicity, the database was reasonably representative of the source population. The most common assigned primary diagnoses were within the ICD mood disorders (n = 12,756) category followed by schizophrenia and related disorders (8158), substance misuse (7749), neuroses (7105) and organic disorders (6414). The SLAM BRC Case Register represents a 'new generation' of this research design, built on a long-running system of fully electronic clinical records and allowing in-depth secondary analysis of both numerical, string and free text data, whilst preserving anonymity through technical and procedural safeguards.

  8. Sequence Data for Clostridium autoethanogenum using Three Generations of Sequencing Technologies

    DOE PAGES

    Utturkar, Sagar M.; Klingeman, Dawn Marie; Bruno-Barcena, José M.; ...

    2015-04-14

    During the past decade, DNA sequencing output has been mostly dominated by the second generation sequencing platforms which are characterized by low cost, high throughput and shorter read lengths for example, Illumina. The emergence and development of so called third generation sequencing platforms such as PacBio has permitted exceptionally long reads (over 20 kb) to be generated. Due to read length increases, algorithm improvements and hybrid assembly approaches, the concept of one chromosome, one contig and automated finishing of microbial genomes is now a realistic and achievable task for many microbial laboratories. In this paper, we describe high quality sequencemore » datasets which span three generations of sequencing technologies, containing six types of data from four NGS platforms and originating from a single microorganism, Clostridium autoethanogenum. The dataset reported here will be useful for the scientific community to evaluate upcoming NGS platforms, enabling comparison of existing and novel bioinformatics approaches and will encourage interest in the development of innovative experimental and computational methods for NGS data.« less

  9. A reference human genome dataset of the BGISEQ-500 sequencer.

    PubMed

    Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian

    2017-05-01

    BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.

  10. Cross-Cultural Detection of Depression from Nonverbal Behaviour.

    PubMed

    Alghowinem, Sharifa; Goecke, Roland; Cohn, Jeffrey F; Wagner, Michael; Parker, Gordon; Breakspear, Michael

    2015-05-01

    Millions of people worldwide suffer from depression. Do commonalities exist in their nonverbal behavior that would enable cross-culturally viable screening and assessment of severity? We investigated the generalisability of an approach to detect depression severity cross-culturally using video-recorded clinical interviews from Australia, the USA and Germany. The material varied in type of interview, subtypes of depression and inclusion healthy control subjects, cultural background, and recording environment. The analysis focussed on temporal features of participants' eye gaze and head pose. Several approaches to training and testing within and between datasets were evaluated. The strongest results were found for training across all datasets and testing across datasets using leave-one-subject-out cross-validation. In contrast, generalisability was attenuated when training on only one or two of the three datasets and testing on subjects from the dataset(s) not used in training. These findings highlight the importance of using training data exhibiting the expected range of variability.

  11. Using Functional or Structural Magnetic Resonance Images and Personal Characteristic Data to Identify ADHD and Autism

    PubMed Central

    Ghiassian, Sina; Greiner, Russell; Jin, Ping; Brown, Matthew R. G.

    2016-01-01

    A clinical tool that can diagnose psychiatric illness using functional or structural magnetic resonance (MR) brain images has the potential to greatly assist physicians and improve treatment efficacy. Working toward the goal of automated diagnosis, we propose an approach for automated classification of ADHD and autism based on histogram of oriented gradients (HOG) features extracted from MR brain images, as well as personal characteristic data features. We describe a learning algorithm that can produce effective classifiers for ADHD and autism when run on two large public datasets. The algorithm is able to distinguish ADHD from control with hold-out accuracy of 69.6% (over baseline 55.0%) using personal characteristics and structural brain scan features when trained on the ADHD-200 dataset (769 participants in training set, 171 in test set). It is able to distinguish autism from control with hold-out accuracy of 65.0% (over baseline 51.6%) using functional images with personal characteristic data when trained on the Autism Brain Imaging Data Exchange (ABIDE) dataset (889 participants in training set, 222 in test set). These results outperform all previously presented methods on both datasets. To our knowledge, this is the first demonstration of a single automated learning process that can produce classifiers for distinguishing patients vs. controls from brain imaging data with above-chance accuracy on large datasets for two different psychiatric illnesses (ADHD and autism). Working toward clinical applications requires robustness against real-world conditions, including the substantial variability that often exists among data collected at different institutions. It is therefore important that our algorithm was successful with the large ADHD-200 and ABIDE datasets, which include data from hundreds of participants collected at multiple institutions. While the resulting classifiers are not yet clinically relevant, this work shows that there is a signal in the (f)MRI data that a learning algorithm is able to find. We anticipate this will lead to yet more accurate classifiers, over these and other psychiatric disorders, working toward the goal of a clinical tool for high accuracy differential diagnosis. PMID:28030565

  12. A DICOM-based 2nd generation Molecular Imaging Data Grid implementing the IHE XDS-i integration profile.

    PubMed

    Lee, Jasper; Zhang, Jianguo; Park, Ryan; Dagliyan, Grant; Liu, Brent; Huang, H K

    2012-07-01

    A Molecular Imaging Data Grid (MIDG) was developed to address current informatics challenges in archival, sharing, search, and distribution of preclinical imaging studies between animal imaging facilities and investigator sites. This manuscript presents a 2nd generation MIDG replacing the Globus Toolkit with a new system architecture that implements the IHE XDS-i integration profile. Implementation and evaluation were conducted using a 3-site interdisciplinary test-bed at the University of Southern California. The 2nd generation MIDG design architecture replaces the initial design's Globus Toolkit with dedicated web services and XML-based messaging for dedicated management and delivery of multi-modality DICOM imaging datasets. The Cross-enterprise Document Sharing for Imaging (XDS-i) integration profile from the field of enterprise radiology informatics was adopted into the MIDG design because streamlined image registration, management, and distribution dataflow are likewise needed in preclinical imaging informatics systems as in enterprise PACS application. Implementation of the MIDG is demonstrated at the University of Southern California Molecular Imaging Center (MIC) and two other sites with specified hardware, software, and network bandwidth. Evaluation of the MIDG involves data upload, download, and fault-tolerance testing scenarios using multi-modality animal imaging datasets collected at the USC Molecular Imaging Center. The upload, download, and fault-tolerance tests of the MIDG were performed multiple times using 12 collected animal study datasets. Upload and download times demonstrated reproducibility and improved real-world performance. Fault-tolerance tests showed that automated failover between Grid Node Servers has minimal impact on normal download times. Building upon the 1st generation concepts and experiences, the 2nd generation MIDG system improves accessibility of disparate animal-model molecular imaging datasets to users outside a molecular imaging facility's LAN using a new architecture, dataflow, and dedicated DICOM-based management web services. Productivity and efficiency of preclinical research for translational sciences investigators has been further streamlined for multi-center study data registration, management, and distribution.

  13. A strategy for evaluating pathway analysis methods.

    PubMed

    Yu, Chenggang; Woo, Hyung Jun; Yu, Xueping; Oyama, Tatsuya; Wallqvist, Anders; Reifman, Jaques

    2017-10-13

    Researchers have previously developed a multitude of methods designed to identify biological pathways associated with specific clinical or experimental conditions of interest, with the aim of facilitating biological interpretation of high-throughput data. Before practically applying such pathway analysis (PA) methods, we must first evaluate their performance and reliability, using datasets where the pathways perturbed by the conditions of interest have been well characterized in advance. However, such 'ground truths' (or gold standards) are often unavailable. Furthermore, previous evaluation strategies that have focused on defining 'true answers' are unable to systematically and objectively assess PA methods under a wide range of conditions. In this work, we propose a novel strategy for evaluating PA methods independently of any gold standard, either established or assumed. The strategy involves the use of two mutually complementary metrics, recall and discrimination. Recall measures the consistency of the perturbed pathways identified by applying a particular analysis method to an original large dataset and those identified by the same method to a sub-dataset of the original dataset. In contrast, discrimination measures specificity-the degree to which the perturbed pathways identified by a particular method to a dataset from one experiment differ from those identifying by the same method to a dataset from a different experiment. We used these metrics and 24 datasets to evaluate six widely used PA methods. The results highlighted the common challenge in reliably identifying significant pathways from small datasets. Importantly, we confirmed the effectiveness of our proposed dual-metric strategy by showing that previous comparative studies corroborate the performance evaluations of the six methods obtained by our strategy. Unlike any previously proposed strategy for evaluating the performance of PA methods, our dual-metric strategy does not rely on any ground truth, either established or assumed, of the pathways perturbed by a specific clinical or experimental condition. As such, our strategy allows researchers to systematically and objectively evaluate pathway analysis methods by employing any number of datasets for a variety of conditions.

  14. Comparing adult cannabis treatment-seekers enrolled in a clinical trial with national samples of cannabis users in the United States

    PubMed Central

    McClure, Erin A.; King, Jacqueline S.; Wahle, Aimee; Matthews, Abigail G.; Sonne, Susan C.; Lofwall, Michelle R.; McRae-Clark, Aimee L.; Ghitza, Udi E.; Martinez, Melissa; Cloud, Kasie; Virk, Harvir S.; Gray, Kevin M.

    2017-01-01

    Background Cannabis use rates are increasing among adults in the United States (US) while the perception of harm is declining. This may result in an increased prevalence of cannabis use disorder and the need for more clinical trials to evaluate efficacious treatment strategies. Clinical trials are the gold standard for evaluating treatment, yet study samples are rarely representative of the target population. This finding has not yet been established for cannabis treatment trials. This study compared demographic and cannabis use characteristics of a cannabis cessation clinical trial sample (run through National Drug Abuse Treatment Clinical Trials Network) with three nationally representative datasets from the US; 1) National Survey on Drug Use and Health, 2) National Epidemiologic Survey on Alcohol and Related Conditions-III, and 3) Treatment Episodes Data Set – Admissions. Methods Comparisons were made between the clinical trial sample and appropriate cannabis using sub-samples from the national datasets, and propensity scores were calculated to determine the degree of similarity between samples. Results Results showed that the clinical trial sample was significantly different from all three national datasets, with the clinical trial sample having greater representation among older adults, African Americans, Hispanic/Latinos, adults with more education, non-tobacco users, and daily and almost daily cannabis users. Conclusions These results are consistent with previous studies of other substance use disorder populations and extend sample representation issues to a cannabis use disorder population. This illustrates the need to ensure representative samples within cannabis treatment clinical trials to improve the generalizability of promising findings. PMID:28511033

  15. Comparing adult cannabis treatment-seekers enrolled in a clinical trial with national samples of cannabis users in the United States.

    PubMed

    McClure, Erin A; King, Jacqueline S; Wahle, Aimee; Matthews, Abigail G; Sonne, Susan C; Lofwall, Michelle R; McRae-Clark, Aimee L; Ghitza, Udi E; Martinez, Melissa; Cloud, Kasie; Virk, Harvir S; Gray, Kevin M

    2017-07-01

    Cannabis use rates are increasing among adults in the United States (US) while the perception of harm is declining. This may result in an increased prevalence of cannabis use disorder and the need for more clinical trials to evaluate efficacious treatment strategies. Clinical trials are the gold standard for evaluating treatment, yet study samples are rarely representative of the target population. This finding has not yet been established for cannabis treatment trials. This study compared demographic and cannabis use characteristics of a cannabis cessation clinical trial sample (run through National Drug Abuse Treatment Clinical Trials Network) with three nationally representative datasets from the US; 1) National Survey on Drug Use and Health, 2) National Epidemiologic Survey on Alcohol and Related Conditions-III, and 3) Treatment: Episodes Data Set - Admissions. Comparisons were made between the clinical trial sample and appropriate cannabis using sub-samples from the national datasets, and propensity scores were calculated to determine the degree of similarity between samples. showed that the clinical trial sample was significantly different from all three national datasets, with the clinical trial sample having greater representation among older adults, African Americans, Hispanic/Latinos, adults with more education, non-tobacco users, and daily and almost daily cannabis users. These results are consistent with previous studies of other substance use disorder populations and extend sample representation issues to a cannabis use disorder population. This illustrates the need to ensure representative samples within cannabis treatment clinical trials to improve the generalizability of promising findings. Copyright © 2017 Elsevier B.V. All rights reserved.

  16. GeoNotebook: Browser based Interactive analysis and visualization workflow for very large climate and geospatial datasets

    NASA Astrophysics Data System (ADS)

    Ozturk, D.; Chaudhary, A.; Votava, P.; Kotfila, C.

    2016-12-01

    Jointly developed by Kitware and NASA Ames, GeoNotebook is an open source tool designed to give the maximum amount of flexibility to analysts, while dramatically simplifying the process of exploring geospatially indexed datasets. Packages like Fiona (backed by GDAL), Shapely, Descartes, Geopandas, and PySAL provide a stack of technologies for reading, transforming, and analyzing geospatial data. Combined with the Jupyter notebook and libraries like matplotlib/Basemap it is possible to generate detailed geospatial visualizations. Unfortunately, visualizations generated is either static or does not perform well for very large datasets. Also, this setup requires a great deal of boilerplate code to create and maintain. Other extensions exist to remedy these problems, but they provide a separate map for each input cell and do not support map interactions that feed back into the python environment. To support interactive data exploration and visualization on large datasets we have developed an extension to the Jupyter notebook that provides a single dynamic map that can be managed from the Python environment, and that can communicate back with a server which can perform operations like data subsetting on a cloud-based cluster.

  17. Microbial bebop: creating music from complex dynamics in microbial ecology.

    PubMed

    Larsen, Peter; Gilbert, Jack

    2013-01-01

    In order for society to make effective policy decisions on complex and far-reaching subjects, such as appropriate responses to global climate change, scientists must effectively communicate complex results to the non-scientifically specialized public. However, there are few ways however to transform highly complicated scientific data into formats that are engaging to the general community. Taking inspiration from patterns observed in nature and from some of the principles of jazz bebop improvisation, we have generated Microbial Bebop, a method by which microbial environmental data are transformed into music. Microbial Bebop uses meter, pitch, duration, and harmony to highlight the relationships between multiple data types in complex biological datasets. We use a comprehensive microbial ecology, time course dataset collected at the L4 marine monitoring station in the Western English Channel as an example of microbial ecological data that can be transformed into music. Four compositions were generated (www.bio.anl.gov/MicrobialBebop.htm.) from L4 Station data using Microbial Bebop. Each composition, though deriving from the same dataset, is created to highlight different relationships between environmental conditions and microbial community structure. The approach presented here can be applied to a wide variety of complex biological datasets.

  18. Thermalnet: a Deep Convolutional Network for Synthetic Thermal Image Generation

    NASA Astrophysics Data System (ADS)

    Kniaz, V. V.; Gorbatsevich, V. S.; Mizginov, V. A.

    2017-05-01

    Deep convolutional neural networks have dramatically changed the landscape of the modern computer vision. Nowadays methods based on deep neural networks show the best performance among image recognition and object detection algorithms. While polishing of network architectures received a lot of scholar attention, from the practical point of view the preparation of a large image dataset for a successful training of a neural network became one of major challenges. This challenge is particularly profound for image recognition in wavelengths lying outside the visible spectrum. For example no infrared or radar image datasets large enough for successful training of a deep neural network are available to date in public domain. Recent advances of deep neural networks prove that they are also capable to do arbitrary image transformations such as super-resolution image generation, grayscale image colorisation and imitation of style of a given artist. Thus a natural question arise: how could be deep neural networks used for augmentation of existing large image datasets? This paper is focused on the development of the Thermalnet deep convolutional neural network for augmentation of existing large visible image datasets with synthetic thermal images. The Thermalnet network architecture is inspired by colorisation deep neural networks.

  19. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.

    PubMed

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

    2015-01-01

    The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  20. BEAT: Bioinformatics Exon Array Tool to store, analyze and visualize Affymetrix GeneChip Human Exon Array data from disease experiments

    PubMed Central

    2012-01-01

    Background It is known from recent studies that more than 90% of human multi-exon genes are subject to Alternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a single gene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiation and pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon 1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns. The exon array technology, with more than five million data points, can detect approximately one million exons, and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated user-friendly bioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a data warehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases. Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at http://beat.ba.itb.cnr.it. Results BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statistical methods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples and tuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samples produced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza. To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemical pathways annotations are integrated with exon and gene level expression plots. The user can customize the results choosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samples for a multivariate AS analysis. Conclusions Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysis tools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendly platform for a comprehensive study of AS events in human diseases, displaying the analysis results with easily interpretable and interactive tables and graphics. PMID:22536968

  1. 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data.

    PubMed

    Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M

    2018-06-01

    A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.

  2. Next-Generation Machine Learning for Biological Networks.

    PubMed

    Camacho, Diogo M; Collins, Katherine M; Powers, Rani K; Costello, James C; Collins, James J

    2018-06-14

    Machine learning, a collection of data-analytical techniques aimed at building predictive models from multi-dimensional datasets, is becoming integral to modern biological research. By enabling one to generate models that learn from large datasets and make predictions on likely outcomes, machine learning can be used to study complex cellular systems such as biological networks. Here, we provide a primer on machine learning for life scientists, including an introduction to deep learning. We discuss opportunities and challenges at the intersection of machine learning and network biology, which could impact disease biology, drug discovery, microbiome research, and synthetic biology. Copyright © 2018 Elsevier Inc. All rights reserved.

  3. Neuropsychological Test Selection for Cognitive Impairment Classification: A Machine Learning Approach

    PubMed Central

    Williams, Jennifer A.; Schmitter-Edgecombe, Maureen; Cook, Diane J.

    2016-01-01

    Introduction Reducing the amount of testing required to accurately detect cognitive impairment is clinically relevant. The aim of this research was to determine the fewest number of clinical measures required to accurately classify participants as healthy older adult, mild cognitive impairment (MCI) or dementia using a suite of classification techniques. Methods Two variable selection machine learning models (i.e., naive Bayes, decision tree), a logistic regression, and two participant datasets (i.e., clinical diagnosis, clinical dementia rating; CDR) were explored. Participants classified using clinical diagnosis criteria included 52 individuals with dementia, 97 with MCI, and 161 cognitively healthy older adults. Participants classified using CDR included 154 individuals CDR = 0, 93 individuals with CDR = 0.5, and 25 individuals with CDR = 1.0+. Twenty-seven demographic, psychological, and neuropsychological variables were available for variable selection. Results No significant difference was observed between naive Bayes, decision tree, and logistic regression models for classification of both clinical diagnosis and CDR datasets. Participant classification (70.0 – 99.1%), geometric mean (60.9 – 98.1%), sensitivity (44.2 – 100%), and specificity (52.7 – 100%) were generally satisfactory. Unsurprisingly, the MCI/CDR = 0.5 participant group was the most challenging to classify. Through variable selection only 2 – 9 variables were required for classification and varied between datasets in a clinically meaningful way. Conclusions The current study results reveal that machine learning techniques can accurately classifying cognitive impairment and reduce the number of measures required for diagnosis. PMID:26332171

  4. Automatic segmentation of coronary arteries from computed tomography angiography data cloud using optimal thresholding

    NASA Astrophysics Data System (ADS)

    Ansari, Muhammad Ahsan; Zai, Sammer; Moon, Young Shik

    2017-01-01

    Manual analysis of the bulk data generated by computed tomography angiography (CTA) is time consuming, and interpretation of such data requires previous knowledge and expertise of the radiologist. Therefore, an automatic method that can isolate the coronary arteries from a given CTA dataset is required. We present an automatic yet effective segmentation method to delineate the coronary arteries from a three-dimensional CTA data cloud. Instead of a region growing process, which is usually time consuming and prone to leakages, the method is based on the optimal thresholding, which is applied globally on the Hessian-based vesselness measure in a localized way (slice by slice) to track the coronaries carefully to their distal ends. Moreover, to make the process automatic, we detect the aorta using the Hough transform technique. The proposed segmentation method is independent of the starting point to initiate its process and is fast in the sense that coronary arteries are obtained without any preprocessing or postprocessing steps. We used 12 real clinical datasets to show the efficiency and accuracy of the presented method. Experimental results reveal that the proposed method achieves 95% average accuracy.

  5. A Web Server and Mobile App for Computing Hemolytic Potency of Peptides

    NASA Astrophysics Data System (ADS)

    Chaudhary, Kumardeep; Kumar, Ritesh; Singh, Sandeep; Tuknait, Abhishek; Gautam, Ankur; Mathur, Deepika; Anand, Priya; Varshney, Grish C.; Raghava, Gajendra P. S.

    2016-03-01

    Numerous therapeutic peptides do not enter the clinical trials just because of their high hemolytic activity. Recently, we developed a database, Hemolytik, for maintaining experimentally validated hemolytic and non-hemolytic peptides. The present study describes a web server and mobile app developed for predicting, and screening of peptides having hemolytic potency. Firstly, we generated a dataset HemoPI-1 that contains 552 hemolytic peptides extracted from Hemolytik database and 552 random non-hemolytic peptides (from Swiss-Prot). The sequence analysis of these peptides revealed that certain residues (e.g., L, K, F, W) and motifs (e.g., “FKK”, “LKL”, “KKLL”, “KWK”, “VLK”, “CYCR”, “CRR”, “RFC”, “RRR”, “LKKL”) are more abundant in hemolytic peptides. Therefore, we developed models for discriminating hemolytic and non-hemolytic peptides using various machine learning techniques and achieved more than 95% accuracy. We also developed models for discriminating peptides having high and low hemolytic potential on different datasets called HemoPI-2 and HemoPI-3. In order to serve the scientific community, we developed a web server, mobile app and JAVA-based standalone software (http://crdd.osdd.net/raghava/hemopi/).

  6. Understanding user intents in online health forums.

    PubMed

    Zhang, Thomas; Cho, Jason H D; Zhai, Chengxiang

    2015-07-01

    Online health forums provide a convenient way for patients to obtain medical information and connect with physicians and peers outside of clinical settings. However, large quantities of unstructured and diversified content generated on these forums make it difficult for users to digest and extract useful information. Understanding user intents would enable forums to find and recommend relevant information to users by filtering out threads that do not match particular intents. In this paper, we derive a taxonomy of intents to capture user information needs in online health forums and propose novel pattern-based features for use with a multiclass support vector machine (SVM) classifier to classify original thread posts according to their underlying intents. Since no dataset existed for this task, we employ three annotators to manually label a dataset of 1192 HealthBoards posts spanning four forum topics. Experimental results show that a SVM using pattern-based features is highly capable of identifying user intents in forum posts, reaching a maximum precision of 75%, and that a SVM-based hierarchical classifier using both pattern and word features outperforms its SVM counterpart that uses only word features. Furthermore, comparable classification performance can be achieved by training and testing on posts from different forum topics.

  7. Fully automated macular pathology detection in retina optical coherence tomography images using sparse coding and dictionary learning

    NASA Astrophysics Data System (ADS)

    Sun, Yankui; Li, Shan; Sun, Zhongyang

    2017-01-01

    We propose a framework for automated detection of dry age-related macular degeneration (AMD) and diabetic macular edema (DME) from retina optical coherence tomography (OCT) images, based on sparse coding and dictionary learning. The study aims to improve the classification performance of state-of-the-art methods. First, our method presents a general approach to automatically align and crop retina regions; then it obtains global representations of images by using sparse coding and a spatial pyramid; finally, a multiclass linear support vector machine classifier is employed for classification. We apply two datasets for validating our algorithm: Duke spectral domain OCT (SD-OCT) dataset, consisting of volumetric scans acquired from 45 subjects-15 normal subjects, 15 AMD patients, and 15 DME patients; and clinical SD-OCT dataset, consisting of 678 OCT retina scans acquired from clinics in Beijing-168, 297, and 213 OCT images for AMD, DME, and normal retinas, respectively. For the former dataset, our classifier correctly identifies 100%, 100%, and 93.33% of the volumes with DME, AMD, and normal subjects, respectively, and thus performs much better than the conventional method; for the latter dataset, our classifier leads to a correct classification rate of 99.67%, 99.67%, and 100.00% for DME, AMD, and normal images, respectively.

  8. Fully automated macular pathology detection in retina optical coherence tomography images using sparse coding and dictionary learning.

    PubMed

    Sun, Yankui; Li, Shan; Sun, Zhongyang

    2017-01-01

    We propose a framework for automated detection of dry age-related macular degeneration (AMD) and diabetic macular edema (DME) from retina optical coherence tomography (OCT) images, based on sparse coding and dictionary learning. The study aims to improve the classification performance of state-of-the-art methods. First, our method presents a general approach to automatically align and crop retina regions; then it obtains global representations of images by using sparse coding and a spatial pyramid; finally, a multiclass linear support vector machine classifier is employed for classification. We apply two datasets for validating our algorithm: Duke spectral domain OCT (SD-OCT) dataset, consisting of volumetric scans acquired from 45 subjects—15 normal subjects, 15 AMD patients, and 15 DME patients; and clinical SD-OCT dataset, consisting of 678 OCT retina scans acquired from clinics in Beijing—168, 297, and 213 OCT images for AMD, DME, and normal retinas, respectively. For the former dataset, our classifier correctly identifies 100%, 100%, and 93.33% of the volumes with DME, AMD, and normal subjects, respectively, and thus performs much better than the conventional method; for the latter dataset, our classifier leads to a correct classification rate of 99.67%, 99.67%, and 100.00% for DME, AMD, and normal images, respectively.

  9. MVIRI/SEVIRI TOA Radiation Datasets within the Climate Monitoring SAF

    NASA Astrophysics Data System (ADS)

    Urbain, Manon; Clerbaux, Nicolas; Ipe, Alessandro; Baudrez, Edward; Velazquez Blazquez, Almudena; Moreels, Johan

    2016-04-01

    Within CM SAF, Interim Climate Data Records (ICDR) of Top-Of-Atmosphere (TOA) radiation products from the Geostationary Earth Radiation Budget (GERB) instruments on the Meteosat Second Generation (MSG) satellites have been released in 2013. These datasets (referred to as CM-113 and CM-115, resp. for shortwave (SW) and longwave (LW) radiation) are based on the instantaneous TOA fluxes from the GERB Edition-1 dataset. They cover the time period 2004-2011. Extending these datasets backward in the past is not possible as no GERB instruments were available on the Meteosat First Generation (MFG) satellites. As an alternative, it is proposed to rely on the Meteosat Visible and InfraRed Imager (MVIRI - from 1982 until 2004) and the Spinning Enhanced Visible and Infrared Imager (SEVIRI - from 2004 onward) to generate a long Thematic Climate Data Record (TCDR) from Meteosat instruments. Combining MVIRI and SEVIRI allows an unprecedented temporal (30 minutes / 15 minutes) and spatial (2.5 km / 3 km) resolution compared to the Clouds and the Earth's Radiant Energy System (CERES) products. This is a step forward as it helps to increase the knowledge of the diurnal cycle and the small-scale spatial variations of radiation. The MVIRI/SEVIRI datasets (referred to as CM-23311 and CM-23341, resp. for SW and LW radiation) will provide daily and monthly averaged TOA Reflected Solar (TRS) and Emitted Thermal (TET) radiation in "all-sky" conditions (no clear-sky conditions for this first version of the datasets), as well as monthly averaged of the hourly integrated values. The SEVIRI Solar Channels Calibration (SSCC) and the operational calibration have been used resp. for the SW and LW channels. For MFG, it is foreseen to replace the latter by the EUMETSAT/GSICS recalibration of MVIRI using HIRS. The CERES TRMM angular dependency models have been used to compute TRS fluxes while theoretical models have been used for TET fluxes. The CM-23311 and CM-23341 datasets will cover a 32 years time period, from 1st February 1982 to 31st January 2014. TRS and TET fluxes will be provided on a regular latitude-longitude grid at a spatial resolution of 0.05° (i.e. about 5.5 km) to ensure consistency with other CM SAF products. Validation will be performed at lower resolution (e.g. 1° x 1°) by intercomparison with several other datasets (CERES EBAF, CERES SYN 1deg-day, HIRS OLR, ISCCP-FD, NCDC daily OLR, etc.).

  10. Automated identification of wound information in clinical notes of patients with heart diseases: Developing and validating a natural language processing application.

    PubMed

    Topaz, Maxim; Lai, Kenneth; Dowding, Dawn; Lei, Victor J; Zisberg, Anna; Bowles, Kathryn H; Zhou, Li

    2016-12-01

    Electronic health records are being increasingly used by nurses with up to 80% of the health data recorded as free text. However, only a few studies have developed nursing-relevant tools that help busy clinicians to identify information they need at the point of care. This study developed and validated one of the first automated natural language processing applications to extract wound information (wound type, pressure ulcer stage, wound size, anatomic location, and wound treatment) from free text clinical notes. First, two human annotators manually reviewed a purposeful training sample (n=360) and random test sample (n=1100) of clinical notes (including 50% discharge summaries and 50% outpatient notes), identified wound cases, and created a gold standard dataset. We then trained and tested our natural language processing system (known as MTERMS) to process the wound information. Finally, we assessed our automated approach by comparing system-generated findings against the gold standard. We also compared the prevalence of wound cases identified from free-text data with coded diagnoses in the structured data. The testing dataset included 101 notes (9.2%) with wound information. The overall system performance was good (F-measure is a compiled measure of system's accuracy=92.7%), with best results for wound treatment (F-measure=95.7%) and poorest results for wound size (F-measure=81.9%). Only 46.5% of wound notes had a structured code for a wound diagnosis. The natural language processing system achieved good performance on a subset of randomly selected discharge summaries and outpatient notes. In more than half of the wound notes, there were no coded wound diagnoses, which highlight the significance of using natural language processing to enrich clinical decision making. Our future steps will include expansion of the application's information coverage to other relevant wound factors and validation of the model with external data. Copyright © 2016 Elsevier Ltd. All rights reserved.

  11. A gene expression signature of RAS pathway dependence predicts response to PI3K and RAS pathway inhibitors and expands the population of RAS pathway activated tumors.

    PubMed

    Loboda, Andrey; Nebozhyn, Michael; Klinghoffer, Rich; Frazier, Jason; Chastain, Michael; Arthur, William; Roberts, Brian; Zhang, Theresa; Chenard, Melissa; Haines, Brian; Andersen, Jannik; Nagashima, Kumiko; Paweletz, Cloud; Lynch, Bethany; Feldman, Igor; Dai, Hongyue; Huang, Pearl; Watters, James

    2010-06-30

    Hyperactivation of the Ras signaling pathway is a driver of many cancers, and RAS pathway activation can predict response to targeted therapies. Therefore, optimal methods for measuring Ras pathway activation are critical. The main focus of our work was to develop a gene expression signature that is predictive of RAS pathway dependence. We used the coherent expression of RAS pathway-related genes across multiple datasets to derive a RAS pathway gene expression signature and generate RAS pathway activation scores in pre-clinical cancer models and human tumors. We then related this signature to KRAS mutation status and drug response data in pre-clinical and clinical datasets. The RAS signature score is predictive of KRAS mutation status in lung tumors and cell lines with high (> 90%) sensitivity but relatively low (50%) specificity due to samples that have apparent RAS pathway activation in the absence of a KRAS mutation. In lung and breast cancer cell line panels, the RAS pathway signature score correlates with pMEK and pERK expression, and predicts resistance to AKT inhibition and sensitivity to MEK inhibition within both KRAS mutant and KRAS wild-type groups. The RAS pathway signature is upregulated in breast cancer cell lines that have acquired resistance to AKT inhibition, and is downregulated by inhibition of MEK. In lung cancer cell lines knockdown of KRAS using siRNA demonstrates that the RAS pathway signature is a better measure of dependence on RAS compared to KRAS mutation status. In human tumors, the RAS pathway signature is elevated in ER negative breast tumors and lung adenocarcinomas, and predicts resistance to cetuximab in metastatic colorectal cancer. These data demonstrate that the RAS pathway signature is superior to KRAS mutation status for the prediction of dependence on RAS signaling, can predict response to PI3K and RAS pathway inhibitors, and is likely to have the most clinical utility in lung and breast tumors.

  12. Web-based platform for collaborative medical imaging research

    NASA Astrophysics Data System (ADS)

    Rittner, Leticia; Bento, Mariana P.; Costa, André L.; Souza, Roberto M.; Machado, Rubens C.; Lotufo, Roberto A.

    2015-03-01

    Medical imaging research depends basically on the availability of large image collections, image processing and analysis algorithms, hardware and a multidisciplinary research team. It has to be reproducible, free of errors, fast, accessible through a large variety of devices spread around research centers and conducted simultaneously by a multidisciplinary team. Therefore, we propose a collaborative research environment, named Adessowiki, where tools and datasets are integrated and readily available in the Internet through a web browser. Moreover, processing history and all intermediate results are stored and displayed in automatic generated web pages for each object in the research project or clinical study. It requires no installation or configuration from the client side and offers centralized tools and specialized hardware resources, since processing takes place in the cloud.

  13. Data science for mental health: a UK perspective on a global challenge.

    PubMed

    McIntosh, Andrew M; Stewart, Robert; John, Ann; Smith, Daniel J; Davis, Katrina; Sudlow, Cathie; Corvin, Aiden; Nicodemus, Kristin K; Kingdon, David; Hassan, Lamiece; Hotopf, Matthew; Lawrie, Stephen M; Russ, Tom C; Geddes, John R; Wolpert, Miranda; Wölbert, Eva; Porteous, David J

    2016-10-01

    Data science uses computer science and statistics to extract new knowledge from high-dimensional datasets (ie, those with many different variables and data types). Mental health research, diagnosis, and treatment could benefit from data science that uses cohort studies, genomics, and routine health-care and administrative data. The UK is well placed to trial these approaches through robust NHS-linked data science projects, such as the UK Biobank, Generation Scotland, and the Clinical Record Interactive Search (CRIS) programme. Data science has great potential as a low-cost, high-return catalyst for improved mental health recognition, understanding, support, and outcomes. Lessons learnt from such studies could have global implications. Copyright © 2016 Elsevier Ltd. All rights reserved.

  14. All the World's a Stage: Facilitating Discovery Science and Improved Cancer Care through the Global Alliance for Genomics and Health.

    PubMed

    Lawler, Mark; Siu, Lillian L; Rehm, Heidi L; Chanock, Stephen J; Alterovitz, Gil; Burn, John; Calvo, Fabien; Lacombe, Denis; Teh, Bin Tean; North, Kathryn N; Sawyers, Charles L

    2015-11-01

    The recent explosion of genetic and clinical data generated from tumor genome analysis presents an unparalleled opportunity to enhance our understanding of cancer, but this opportunity is compromised by the reluctance of many in the scientific community to share datasets and the lack of interoperability between different data platforms. The Global Alliance for Genomics and Health is addressing these barriers and challenges through a cooperative framework that encourages "team science" and responsible data sharing, complemented by the development of a series of application program interfaces that link different data platforms, thus breaking down traditional silos and liberating the data to enable new discoveries and ultimately benefit patients. ©2015 American Association for Cancer Research.

  15. Combining independent de novo assemblies optimizes the coding transcriptome for nonconventional model eukaryotic organisms.

    PubMed

    Cerveau, Nicolas; Jackson, Daniel J

    2016-12-09

    Next-generation sequencing (NGS) technologies are arguably the most revolutionary technical development to join the list of tools available to molecular biologists since PCR. For researchers working with nonconventional model organisms one major problem with the currently dominant NGS platform (Illumina) stems from the obligatory fragmentation of nucleic acid material that occurs prior to sequencing during library preparation. This step creates a significant bioinformatic challenge for accurate de novo assembly of novel transcriptome data. This challenge becomes apparent when a variety of modern assembly tools (of which there is no shortage) are applied to the same raw NGS dataset. With the same assembly parameters these tools can generate markedly different assembly outputs. In this study we present an approach that generates an optimized consensus de novo assembly of eukaryotic coding transcriptomes. This approach does not represent a new assembler, rather it combines the outputs of a variety of established assembly packages, and removes redundancy via a series of clustering steps. We test and validate our approach using Illumina datasets from six phylogenetically diverse eukaryotes (three metazoans, two plants and a yeast) and two simulated datasets derived from metazoan reference genome annotations. All of these datasets were assembled using three currently popular assembly packages (CLC, Trinity and IDBA-tran). In addition, we experimentally demonstrate that transcripts unique to one particular assembly package are likely to be bioinformatic artefacts. For all eight datasets our pipeline generates more concise transcriptomes that in fact possess more unique annotatable protein domains than any of the three individual assemblers we employed. Another measure of assembly completeness (using the purpose built BUSCO databases) also confirmed that our approach yields more information. Our approach yields coding transcriptome assemblies that are more likely to be closer to biological reality than any of the three individual assembly packages we investigated. This approach (freely available as a simple perl script) will be of use to researchers working with species for which there is little or no reference data against which the assembly of a transcriptome can be performed.

  16. Automatic segmentation of male pelvic anatomy on computed tomography images: a comparison with multiple observers in the context of a multicentre clinical trial.

    PubMed

    Geraghty, John P; Grogan, Garry; Ebert, Martin A

    2013-04-30

    This study investigates the variation in segmentation of several pelvic anatomical structures on computed tomography (CT) between multiple observers and a commercial automatic segmentation method, in the context of quality assurance and evaluation during a multicentre clinical trial. CT scans of two prostate cancer patients ('benchmarking cases'), one high risk (HR) and one intermediate risk (IR), were sent to multiple radiotherapy centres for segmentation of prostate, rectum and bladder structures according to the TROG 03.04 "RADAR" trial protocol definitions. The same structures were automatically segmented using iPlan software for the same two patients, allowing structures defined by automatic segmentation to be quantitatively compared with those defined by multiple observers. A sample of twenty trial patient datasets were also used to automatically generate anatomical structures for quantitative comparison with structures defined by individual observers for the same datasets. There was considerable agreement amongst all observers and automatic segmentation of the benchmarking cases for bladder (mean spatial variations < 0.4 cm across the majority of image slices). Although there was some variation in interpretation of the superior-inferior (cranio-caudal) extent of rectum, human-observer contours were typically within a mean 0.6 cm of automatically-defined contours. Prostate structures were more consistent for the HR case than the IR case with all human observers segmenting a prostate with considerably more volume (mean +113.3%) than that automatically segmented. Similar results were seen across the twenty sample datasets, with disagreement between iPlan and observers dominant at the prostatic apex and superior part of the rectum, which is consistent with observations made during quality assurance reviews during the trial. This study has demonstrated quantitative analysis for comparison of multi-observer segmentation studies. For automatic segmentation algorithms based on image-registration as in iPlan, it is apparent that agreement between observer and automatic segmentation will be a function of patient-specific image characteristics, particularly for anatomy with poor contrast definition. For this reason, it is suggested that automatic registration based on transformation of a single reference dataset adds a significant systematic bias to the resulting volumes and their use in the context of a multicentre trial should be carefully considered.

  17. Generation of Ground Truth Datasets for the Analysis of 3d Point Clouds in Urban Scenes Acquired via Different Sensors

    NASA Astrophysics Data System (ADS)

    Xu, Y.; Sun, Z.; Boerner, R.; Koch, T.; Hoegner, L.; Stilla, U.

    2018-04-01

    In this work, we report a novel way of generating ground truth dataset for analyzing point cloud from different sensors and the validation of algorithms. Instead of directly labeling large amount of 3D points requiring time consuming manual work, a multi-resolution 3D voxel grid for the testing site is generated. Then, with the help of a set of basic labeled points from the reference dataset, we can generate a 3D labeled space of the entire testing site with different resolutions. Specifically, an octree-based voxel structure is applied to voxelize the annotated reference point cloud, by which all the points are organized by 3D grids of multi-resolutions. When automatically annotating the new testing point clouds, a voting based approach is adopted to the labeled points within multiple resolution voxels, in order to assign a semantic label to the 3D space represented by the voxel. Lastly, robust line- and plane-based fast registration methods are developed for aligning point clouds obtained via various sensors. Benefiting from the labeled 3D spatial information, we can easily create new annotated 3D point clouds of different sensors of the same scene directly by considering the corresponding labels of 3D space the points located, which would be convenient for the validation and evaluation of algorithms related to point cloud interpretation and semantic segmentation.

  18. Assessing the role of a medication-indication resource in the treatment relation extraction from clinical text

    PubMed Central

    Bejan, Cosmin Adrian; Wei, Wei-Qi; Denny, Joshua C

    2015-01-01

    Objective To evaluate the contribution of the MEDication Indication (MEDI) resource and SemRep for identifying treatment relations in clinical text. Materials and methods We first processed clinical documents with SemRep to extract the Unified Medical Language System (UMLS) concepts and the treatment relations between them. Then, we incorporated MEDI into a simple algorithm that identifies treatment relations between two concepts if they match a medication-indication pair in this resource. For a better coverage, we expanded MEDI using ontology relationships from RxNorm and UMLS Metathesaurus. We also developed two ensemble methods, which combined the predictions of SemRep and the MEDI algorithm. We evaluated our selected methods on two datasets, a Vanderbilt corpus of 6864 discharge summaries and the 2010 Informatics for Integrating Biology and the Bedside (i2b2)/Veteran's Affairs (VA) challenge dataset. Results The Vanderbilt dataset included 958 manually annotated treatment relations. A double annotation was performed on 25% of relations with high agreement (Cohen's κ = 0.86). The evaluation consisted of comparing the manual annotated relations with the relations identified by SemRep, the MEDI algorithm, and the two ensemble methods. On the first dataset, the best F1-measure results achieved by the MEDI algorithm and the union of the two resources (78.7 and 80, respectively) were significantly higher than the SemRep results (72.3). On the second dataset, the MEDI algorithm achieved better precision and significantly lower recall values than the best system in the i2b2 challenge. The two systems obtained comparable F1-measure values on the subset of i2b2 relations with both arguments in MEDI. Conclusions Both SemRep and MEDI can be used to extract treatment relations from clinical text. Knowledge-based extraction with MEDI outperformed use of SemRep alone, but superior performance was achieved by integrating both systems. The integration of knowledge-based resources such as MEDI into information extraction systems such as SemRep and the i2b2 relation extractors may improve treatment relation extraction from clinical text. PMID:25336593

  19. Lightweight fuzzy processes in clinical computing.

    PubMed

    Hurdle, J F

    1997-09-01

    In spite of advances in computing hardware, many hospitals still have a hard time finding extra capacity in their production clinical information system to run artificial intelligence (AI) modules, for example: to support real-time drug-drug or drug-lab interactions; to track infection trends; to monitor compliance with case specific clinical guidelines; or to monitor/ control biomedical devices like an intelligent ventilator. Historically, adding AI functionality was not a major design concern when a typical clinical system is originally specified. AI technology is usually retrofitted 'on top of the old system' or 'run off line' in tandem with the old system to ensure that the routine work load would still get done (with as little impact from the AI side as possible). To compound the burden on system performance, most institutions have witnessed a long and increasing trend for intramural and extramural reporting, (e.g. the collection of data for a quality-control report in microbiology, or a meta-analysis of a suite of coronary artery bypass grafts techniques, etc.) and these place an ever-growing burden on typical the computer system's performance. We discuss a promising approach to adding extra AI processing power to a heavily-used system based on the notion 'lightweight fuzzy processing (LFP)', that is, fuzzy modules designed from the outset to impose a small computational load. A formal model for a useful subclass of fuzzy systems is defined below and is used as a framework for the automated generation of LFPs. By seeking to reduce the arithmetic complexity of the model (a hand-crafted process) and the data complexity of the model (an automated process), we show how LFPs can be generated for three sample datasets of clinical relevance.

  20. Global evaluation of ammonia bidirectional exchange and livestock diurnal variation schemes

    EPA Pesticide Factsheets

    There is no EPA generated dataset in this study.This dataset is associated with the following publication:Zhu, L., D. Henze, J. Bash , G. Jeong, K. Cady-Pereira, M. Shephard, M. Luo, F. Poulot, and S. Capps. Global evaluation of ammonia bidirectional exchange and livestock diurnal variation schemes. Atmospheric Chemistry and Physics. Copernicus Publications, Katlenburg-Lindau, GERMANY, 15: 12823-12843, (2015).

  1. Automatic estimation of heart boundaries and cardiothoracic ratio from chest x-ray images

    NASA Astrophysics Data System (ADS)

    Dallal, Ahmed H.; Agarwal, Chirag; Arbabshirani, Mohammad R.; Patel, Aalpen; Moore, Gregory

    2017-03-01

    Cardiothoracic ratio (CTR) is a widely used radiographic index to assess heart size on chest X-rays (CXRs). Recent studies have suggested that also two-dimensional CTR might contain clinical information about the heart function. However, manual measurement of such indices is both subjective and time consuming. This study proposes a fast algorithm to automatically estimate CTR indices based on CXRs. The algorithm has three main steps: 1) model based lung segmentation, 2) estimation of heart boundaries from lung contours, and 3) computation of cardiothoracic indices from the estimated boundaries. We extended a previously employed lung detection algorithm to automatically estimate heart boundaries without using ground truth heart markings. We used two datasets: a publicly available dataset with 247 images as well as clinical dataset with 167 studies from Geisinger Health System. The models of lung fields are learned from both datasets. The lung regions in a given test image are estimated by registering the learned models to patient CXRs. Then, heart region is estimated by applying Harris operator on segmented lung fields to detect the corner points corresponding to the heart boundaries. The algorithm calculates three indices, CTR1D, CTR2D, and cardiothoracic area ratio (CTAR). The method was tested on 103 clinical CXRs and average error rates of 7.9%, 25.5%, and 26.4% (for CTR1D, CTR2D, and CTAR respectively) were achieved. The proposed method outperforms previous CTR estimation methods without using any heart templates. This method can have important clinical implications as it can provide fast and accurate estimate of cardiothoracic indices.

  2. Magnetic resonance-transcranial ultrasound fusion imaging: A novel tool for brain electrode location.

    PubMed

    Walter, Uwe; Müller, Jan-Uwe; Rösche, Johannes; Kirsch, Michael; Grossmann, Annette; Benecke, Reiner; Wittstock, Matthias; Wolters, Alexander

    2016-03-01

    A combination of preoperative magnetic resonance imaging (MRI) with real-time transcranial ultrasound, known as fusion imaging, may improve postoperative control of deep brain stimulation (DBS) electrode location. Fusion imaging, however, employs a weak magnetic field for tracking the position of the ultrasound transducer and the patient's head. Here we assessed its feasibility, safety, and clinical relevance in patients with DBS. Eighteen imaging sessions were conducted in 15 patients (7 women; aged 52.4 ± 14.4 y) with DBS of subthalamic nucleus (n = 6), globus pallidus interna (n = 5), ventro-intermediate (n = 3), or anterior (n = 1) thalamic nucleus and clinically suspected lead displacement. Minimum distance between DBS generator and magnetic field transmitter was kept at 65 cm. The pre-implantation MRI dataset was loaded into the ultrasound system for the fusion imaging examination. The DBS lead position was rated using validated criteria. Generator DBS parameters and neurological state of patients were monitored. Magnetic resonance-ultrasound fusion imaging and volume navigation were feasible in all cases and provided with real-time imaging capabilities of DBS lead and its location within the superimposed magnetic resonance images. Of 35 assessed lead locations, 30 were rated optimal, three suboptimal, and two displaced. In two cases, electrodes were re-implanted after confirming their inappropriate location on computed tomography (CT) scan. No influence of fusion imaging on clinical state of patients, or on DBS implantable pulse generator function, was found. Magnetic resonance-ultrasound real-time fusion imaging of DBS electrodes is safe with distinct precautions and improves assessment of electrode location. It may lower the need for repeated CT or MRI scans in DBS patients. © 2015 International Parkinson and Movement Disorder Society.

  3. A Neuroelectrical Brain Imaging Study on the Perception of Figurative Paintings against Only their Color or Shape Contents.

    PubMed

    Maglione, Anton G; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio

    2017-01-01

    In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas.

  4. A Neuroelectrical Brain Imaging Study on the Perception of Figurative Paintings against Only their Color or Shape Contents

    PubMed Central

    Maglione, Anton G.; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio

    2017-01-01

    In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas. PMID:28790907

  5. Predicting Classifier Performance with Limited Training Data: Applications to Computer-Aided Diagnosis in Breast and Prostate Cancer

    PubMed Central

    Basavanhally, Ajay; Viswanath, Satish; Madabhushi, Anant

    2015-01-01

    Clinical trials increasingly employ medical imaging data in conjunction with supervised classifiers, where the latter require large amounts of training data to accurately model the system. Yet, a classifier selected at the start of the trial based on smaller and more accessible datasets may yield inaccurate and unstable classification performance. In this paper, we aim to address two common concerns in classifier selection for clinical trials: (1) predicting expected classifier performance for large datasets based on error rates calculated from smaller datasets and (2) the selection of appropriate classifiers based on expected performance for larger datasets. We present a framework for comparative evaluation of classifiers using only limited amounts of training data by using random repeated sampling (RRS) in conjunction with a cross-validation sampling strategy. Extrapolated error rates are subsequently validated via comparison with leave-one-out cross-validation performed on a larger dataset. The ability to predict error rates as dataset size increases is demonstrated on both synthetic data as well as three different computational imaging tasks: detecting cancerous image regions in prostate histopathology, differentiating high and low grade cancer in breast histopathology, and detecting cancerous metavoxels in prostate magnetic resonance spectroscopy. For each task, the relationships between 3 distinct classifiers (k-nearest neighbor, naive Bayes, Support Vector Machine) are explored. Further quantitative evaluation in terms of interquartile range (IQR) suggests that our approach consistently yields error rates with lower variability (mean IQRs of 0.0070, 0.0127, and 0.0140) than a traditional RRS approach (mean IQRs of 0.0297, 0.0779, and 0.305) that does not employ cross-validation sampling for all three datasets. PMID:25993029

  6. Machine Learning Techniques in Clinical Vision Sciences.

    PubMed

    Caixinha, Miguel; Nunes, Sandrina

    2017-01-01

    This review presents and discusses the contribution of machine learning techniques for diagnosis and disease monitoring in the context of clinical vision science. Many ocular diseases leading to blindness can be halted or delayed when detected and treated at its earliest stages. With the recent developments in diagnostic devices, imaging and genomics, new sources of data for early disease detection and patients' management are now available. Machine learning techniques emerged in the biomedical sciences as clinical decision-support techniques to improve sensitivity and specificity of disease detection and monitoring, increasing objectively the clinical decision-making process. This manuscript presents a review in multimodal ocular disease diagnosis and monitoring based on machine learning approaches. In the first section, the technical issues related to the different machine learning approaches will be present. Machine learning techniques are used to automatically recognize complex patterns in a given dataset. These techniques allows creating homogeneous groups (unsupervised learning), or creating a classifier predicting group membership of new cases (supervised learning), when a group label is available for each case. To ensure a good performance of the machine learning techniques in a given dataset, all possible sources of bias should be removed or minimized. For that, the representativeness of the input dataset for the true population should be confirmed, the noise should be removed, the missing data should be treated and the data dimensionally (i.e., the number of parameters/features and the number of cases in the dataset) should be adjusted. The application of machine learning techniques in ocular disease diagnosis and monitoring will be presented and discussed in the second section of this manuscript. To show the clinical benefits of machine learning in clinical vision sciences, several examples will be presented in glaucoma, age-related macular degeneration, and diabetic retinopathy, these ocular pathologies being the major causes of irreversible visual impairment.

  7. The use of Data Mining in the categorization of patients with Azoospermia.

    PubMed

    Mikos, Themistoklis; Maglaveras, Nikolaos; Pantazis, Konstantinos; Goulis, Dimitrios G; Bontis, John N; Papadimas, John

    2005-01-01

    Data Mining is a relatively new field of Medical Informatics. The aim of this study was to compare Data Mining diagnosis with clinical diagnosis by applying a Data Miner (DM) to a clinical dataset of infertile men with azoospermia. One hundred and forty-seven azoospermic men were clinically classified into four groups: a) obstructive azoospermia (n=63), b) non-obstructive azoospermia (n=71), c) hypergonadotropic hypogonadism (n=2), and d) hypogonadotropic hypogonadism (n=11). The DM (IBM's DB2/Intelligent Miner for Data 6.1) was asked to reproduce a four-cluster model. DM formed four groups of patients: a) eugonadal men with normal testicular volume and normal FSH levels (n=86), b) eugonadal men with significantly reduced testicular volume (median 6.5 cm3) and very high FSH levels (n=29), c) eugonadal men with moderately reduced testicular volume (median 14.5 cm3) and raised FSH levels (n=20), and d) hypogonadal men (n=12). Overall DM concordance rate in hypogonadal men was 92%, in obstructive azoospermia 73%, and in non-obstructive azoospermia 69%. Data Mining produces clinically meaningful results but different from those of the clinical diagnosis. It is possible that the use of large sets of structured and formalised data and continuous evaluation of DM results will generate a useful methodology for the Clinician.

  8. The creation of future daily gridded datasets of precipitation and temperature with a spatial weather generator, Cyprus 2020-2050

    NASA Astrophysics Data System (ADS)

    Camera, Corrado; Bruggeman, Adriana; Hadjinicolaou, Panos; Pashiardis, Stelios; Lange, Manfred

    2014-05-01

    High-resolution gridded daily datasets are essential for natural resource management and the analysis of climate changes and their effects. This study aimed to create gridded datasets of daily precipitation and daily minimum and maximum temperature, for the future (2020-2050). The horizontal resolution of the developed datasets is 1 x 1 km2, covering the area under control of the Republic of Cyprus (5.760 km2). The study is divided into two parts. The first consists of the evaluation of the performance of different interpolation techniques for daily rainfall and temperature data (1980-2010) for the creation of the gridded datasets. Rainfall data recorded at 145 stations and temperature data from 34 stations were used. For precipitation, inverse distance weighting (IDW) performs best for local events, while a combination of step-wise geographically weighted regression and IDW proves to be the best method for large scale events. For minimum and maximum temperature, a combination of step-wise linear multiple regression and thin plate splines is recognized as the best method. Six Regional Climate Models (RCMs) for the A1B SRES emission scenario from the EU ENSEMBLE project database were selected as sources for future climate projections. The RCMs were evaluated for their capacity to simulate Cyprus climatology for the period 1980-2010. Data for the period 2020-2050 from the three best performing RCMs were downscaled, using the change factors approach, at the location of observational stations. Daily time series were created with a stochastic rainfall and temperature generator. The RainSim V3 software (Burton et al., 2008) was used to generate spatial-temporal coherent rainfall fields. The temperature generator was developed in R and modeled temperature as a weakly stationary process with the daily mean and standard deviation conditioned on the wet and dry state of the day (Richardson, 1981). Finally gridded datasets depicting projected future climate conditions were created with the identified best interpolation methods. The difference between the input and simulated mean daily rainfall, averaged over all the stations, was 0.03 mm (2.2%), while the error related to the number of dry days was 2 (0.6%). For mean daily minimum temperature the error was 0.005 ºC (0.04%), while for maximum temperature it was 0.01 ºC (0.04%). Overall, the weather generators were found to be reliable instruments for the downscaling of precipitation and temperature. The resulting datasets indicate a decrease of the mean annual rainfall over the study area between 5 and 70 mm (1-15%) for 2020-2050, relative to 1980-2010. Average annual minimum and maximum temperature over the Republic of Cyprus are projected to increase between 1.2 and 1.5 ºC. The dataset is currently used to compute agricultural production and water use indicators, as part of the AGWATER project (AEIFORIA/GEORGO/0311(BIE)/06), co-financed by the European Regional Development Fund and the Republic of Cyprus through the Research Promotion Foundation. Burton, A., Kilsby, C.G., Fowler, H.J., Cowpertwait, P.S.P., and O'Connell, P.E.: RainSim: A spatial-temporal stochastic rainfall modelling system. Environ. Model. Software 23, 1356-1369, 2008 Richardson, C.W.: Stochastic simulation of daily precipitation, temperature, and solar radiation. Water Resour. Res. 17, 182-190, 1981.

  9. PIVOT: platform for interactive analysis and visualization of transcriptomics data.

    PubMed

    Zhu, Qin; Fisher, Stephen A; Dueck, Hannah; Middleton, Sarah; Khaladkar, Mugdha; Kim, Junhyong

    2018-01-05

    Many R packages have been developed for transcriptome analysis but their use often requires familiarity with R and integrating results of different packages requires scripts to wrangle the datatypes. Furthermore, exploratory data analyses often generate multiple derived datasets such as data subsets or data transformations, which can be difficult to track. Here we present PIVOT, an R-based platform that wraps open source transcriptome analysis packages with a uniform user interface and graphical data management that allows non-programmers to interactively explore transcriptomics data. PIVOT supports more than 40 popular open source packages for transcriptome analysis and provides an extensive set of tools for statistical data manipulations. A graph-based visual interface is used to represent the links between derived datasets, allowing easy tracking of data versions. PIVOT further supports automatic report generation, publication-quality plots, and program/data state saving, such that all analysis can be saved, shared and reproduced. PIVOT will allow researchers with broad background to easily access sophisticated transcriptome analysis tools and interactively explore transcriptome datasets.

  10. Polyester: simulating RNA-seq datasets with differential transcript expression.

    PubMed

    Frazee, Alyssa C; Jaffe, Andrew E; Langmead, Ben; Leek, Jeffrey T

    2015-09-01

    Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user. Polyester is freely available from Bioconductor (http://bioconductor.org/). jtleek@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  11. Benchmarking Deep Learning Models on Large Healthcare Datasets.

    PubMed

    Purushotham, Sanjay; Meng, Chuizheng; Che, Zhengping; Liu, Yan

    2018-06-04

    Deep learning models (aka Deep Neural Networks) have revolutionized many fields including computer vision, natural language processing, speech recognition, and is being increasingly used in clinical healthcare applications. However, few works exist which have benchmarked the performance of the deep learning models with respect to the state-of-the-art machine learning models and prognostic scoring systems on publicly available healthcare datasets. In this paper, we present the benchmarking results for several clinical prediction tasks such as mortality prediction, length of stay prediction, and ICD-9 code group prediction using Deep Learning models, ensemble of machine learning models (Super Learner algorithm), SAPS II and SOFA scores. We used the Medical Information Mart for Intensive Care III (MIMIC-III) (v1.4) publicly available dataset, which includes all patients admitted to an ICU at the Beth Israel Deaconess Medical Center from 2001 to 2012, for the benchmarking tasks. Our results show that deep learning models consistently outperform all the other approaches especially when the 'raw' clinical time series data is used as input features to the models. Copyright © 2018 Elsevier Inc. All rights reserved.

  12. A Locally Adaptive Regularization Based on Anisotropic Diffusion for Deformable Image Registration of Sliding Organs

    PubMed Central

    Pace, Danielle F.; Aylward, Stephen R.; Niethammer, Marc

    2014-01-01

    We propose a deformable image registration algorithm that uses anisotropic smoothing for regularization to find correspondences between images of sliding organs. In particular, we apply the method for respiratory motion estimation in longitudinal thoracic and abdominal computed tomography scans. The algorithm uses locally adaptive diffusion tensors to determine the direction and magnitude with which to smooth the components of the displacement field that are normal and tangential to an expected sliding boundary. Validation was performed using synthetic, phantom, and 14 clinical datasets, including the publicly available DIR-Lab dataset. We show that motion discontinuities caused by sliding can be effectively recovered, unlike conventional regularizations that enforce globally smooth motion. In the clinical datasets, target registration error showed improved accuracy for lung landmarks compared to the diffusive regularization. We also present a generalization of our algorithm to other sliding geometries, including sliding tubes (e.g., needles sliding through tissue, or contrast agent flowing through a vessel). Potential clinical applications of this method include longitudinal change detection and radiotherapy for lung or abdominal tumours, especially those near the chest or abdominal wall. PMID:23899632

  13. A locally adaptive regularization based on anisotropic diffusion for deformable image registration of sliding organs.

    PubMed

    Pace, Danielle F; Aylward, Stephen R; Niethammer, Marc

    2013-11-01

    We propose a deformable image registration algorithm that uses anisotropic smoothing for regularization to find correspondences between images of sliding organs. In particular, we apply the method for respiratory motion estimation in longitudinal thoracic and abdominal computed tomography scans. The algorithm uses locally adaptive diffusion tensors to determine the direction and magnitude with which to smooth the components of the displacement field that are normal and tangential to an expected sliding boundary. Validation was performed using synthetic, phantom, and 14 clinical datasets, including the publicly available DIR-Lab dataset. We show that motion discontinuities caused by sliding can be effectively recovered, unlike conventional regularizations that enforce globally smooth motion. In the clinical datasets, target registration error showed improved accuracy for lung landmarks compared to the diffusive regularization. We also present a generalization of our algorithm to other sliding geometries, including sliding tubes (e.g., needles sliding through tissue, or contrast agent flowing through a vessel). Potential clinical applications of this method include longitudinal change detection and radiotherapy for lung or abdominal tumours, especially those near the chest or abdominal wall.

  14. Sensitivity of a numerical wave model on wind re-analysis datasets

    NASA Astrophysics Data System (ADS)

    Lavidas, George; Venugopal, Vengatesan; Friedrich, Daniel

    2017-03-01

    Wind is the dominant process for wave generation. Detailed evaluation of metocean conditions strengthens our understanding of issues concerning potential offshore applications. However, the scarcity of buoys and high cost of monitoring systems pose a barrier to properly defining offshore conditions. Through use of numerical wave models, metocean conditions can be hindcasted and forecasted providing reliable characterisations. This study reports the sensitivity of wind inputs on a numerical wave model for the Scottish region. Two re-analysis wind datasets with different spatio-temporal characteristics are used, the ERA-Interim Re-Analysis and the CFSR-NCEP Re-Analysis dataset. Different wind products alter results, affecting the accuracy obtained. The scope of this study is to assess different available wind databases and provide information concerning the most appropriate wind dataset for the specific region, based on temporal, spatial and geographic terms for wave modelling and offshore applications. Both wind input datasets delivered results from the numerical wave model with good correlation. Wave results by the 1-h dataset have higher peaks and lower biases, in expense of a high scatter index. On the other hand, the 6-h dataset has lower scatter but higher biases. The study shows how wind dataset affects the numerical wave modelling performance, and that depending on location and study needs, different wind inputs should be considered.

  15. Querying phenotype-genotype relationships on patient datasets using semantic web technology: the example of cerebrotendinous xanthomatosis

    PubMed Central

    2012-01-01

    Background Semantic Web technology can considerably catalyze translational genetics and genomics research in medicine, where the interchange of information between basic research and clinical levels becomes crucial. This exchange involves mapping abstract phenotype descriptions from research resources, such as knowledge databases and catalogs, to unstructured datasets produced through experimental methods and clinical practice. This is especially true for the construction of mutation databases. This paper presents a way of harmonizing abstract phenotype descriptions with patient data from clinical practice, and querying this dataset about relationships between phenotypes and genetic variants, at different levels of abstraction. Methods Due to the current availability of ontological and terminological resources that have already reached some consensus in biomedicine, a reuse-based ontology engineering approach was followed. The proposed approach uses the Ontology Web Language (OWL) to represent the phenotype ontology and the patient model, the Semantic Web Rule Language (SWRL) to bridge the gap between phenotype descriptions and clinical data, and the Semantic Query Web Rule Language (SQWRL) to query relevant phenotype-genotype bidirectional relationships. The work tests the use of semantic web technology in the biomedical research domain named cerebrotendinous xanthomatosis (CTX), using a real dataset and ontologies. Results A framework to query relevant phenotype-genotype bidirectional relationships is provided. Phenotype descriptions and patient data were harmonized by defining 28 Horn-like rules in terms of the OWL concepts. In total, 24 patterns of SWQRL queries were designed following the initial list of competency questions. As the approach is based on OWL, the semantic of the framework adapts the standard logical model of an open world assumption. Conclusions This work demonstrates how semantic web technologies can be used to support flexible representation and computational inference mechanisms required to query patient datasets at different levels of abstraction. The open world assumption is especially good for describing only partially known phenotype-genotype relationships, in a way that is easily extensible. In future, this type of approach could offer researchers a valuable resource to infer new data from patient data for statistical analysis in translational research. In conclusion, phenotype description formalization and mapping to clinical data are two key elements for interchanging knowledge between basic and clinical research. PMID:22849591

  16. Multimodal Event Detection in Twitter Hashtag Networks

    DOE PAGES

    Yilmaz, Yasin; Hero, Alfred O.

    2016-07-01

    In this study, event detection in a multimodal Twitter dataset is considered. We treat the hashtags in the dataset as instances with two modes: text and geolocation features. The text feature consists of a bag-of-words representation. The geolocation feature consists of geotags (i.e., geographical coordinates) of the tweets. Fusing the multimodal data we aim to detect, in terms of topic and geolocation, the interesting events and the associated hashtags. To this end, a generative latent variable model is assumed, and a generalized expectation-maximization (EM) algorithm is derived to learn the model parameters. The proposed method is computationally efficient, and lendsmore » itself to big datasets. Lastly, experimental results on a Twitter dataset from August 2014 show the efficacy of the proposed method.« less

  17. A Merged Dataset for Solar Probe Plus FIELDS Magnetometers

    NASA Astrophysics Data System (ADS)

    Bowen, T. A.; Dudok de Wit, T.; Bale, S. D.; Revillet, C.; MacDowall, R. J.; Sheppard, D.

    2016-12-01

    The Solar Probe Plus FIELDS experiment will observe turbulent magnetic fluctuations deep in the inner heliosphere. The FIELDS magnetometer suite implements a set of three magnetometers: two vector DC fluxgate magnetometers (MAGs), sensitive from DC- 100Hz, as well as a vector search coil magnetometer (SCM), sensitive from 10Hz-50kHz. Single axis measurements are additionally made up to 1MHz. To study the full range of observations, we propose merging data from the individual magnetometers into a single dataset. A merged dataset will improve the quality of observations in the range of frequencies observed by both magnetometers ( 10-100 Hz). Here we present updates on the individual MAG and SCM calibrations as well as our results on generating a cross-calibrated and merged dataset.

  18. Genomics dataset of unidentified disclosed isolates.

    PubMed

    Rekadwad, Bhagwan N

    2016-09-01

    Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis.

  19. Development of large scale riverine terrain-bathymetry dataset by integrating NHDPlus HR with NED,CoNED and HAND data

    NASA Astrophysics Data System (ADS)

    Li, Z.; Clark, E. P.

    2017-12-01

    Large scale and fine resolution riverine bathymetry data is critical for flood inundation modelingbut not available over the continental United States (CONUS). Previously we implementedbankfull hydraulic geometry based approaches to simulate bathymetry for individual riversusing NHDPlus v2.1 data and 10 m National Elevation Dataset (NED). USGS has recentlydeveloped High Resolution NHD data (NHDPlus HR Beta) (USGS, 2017), and thisenhanced dataset has a significant improvement on its spatial correspondence with 10 m DEM.In this study, we used this high resolution data, specifically NHDFlowline and NHDArea,to create bathymetry/terrain for CONUS river channels and floodplains. A software packageNHDPlus Inundation Modeler v5.0 Beta was developed for this project as an Esri ArcGIShydrological analysis extension. With the updated tools, raw 10 m DEM was first hydrologicallytreated to remove artificial blockages (e.g., overpasses, bridges and eve roadways, etc.) usinglow pass moving window filters. Cross sections were then automatically constructed along eachflowline to extract elevation from the hydrologically treated DEM. In this study, river channelshapes were approximated using quadratic curves to reduce uncertainties from commonly usedtrapezoids. We calculated underneath water channel elevation at each cross section samplingpoint using bankfull channel dimensions that were estimated from physiographicprovince/division based regression equations (Bieger et al. 2015). These elevation points werethen interpolated to generate bathymetry raster. The simulated bathymetry raster wasintegrated with USGS NED and Coastal National Elevation Database (CoNED) (whereveravailable) to make seamless terrain-bathymetry dataset. Channel bathymetry was alsointegrated to the HAND (Height above Nearest Drainage) dataset to improve large scaleinundation modeling. The generated terrain-bathymetry was processed at WatershedBoundary Dataset Hydrologic Unit 4 (WBDHU4) level.

  20. Measurement properties of comorbidity indices in maternal health research: a systematic review.

    PubMed

    Aoyama, Kazuyoshi; D'Souza, Rohan; Inada, Eiichi; Lapinsky, Stephen E; Fowler, Robert A

    2017-11-13

    Maternal critical illness occurs in 1.2 to 4.7 of every 1000 live births in the United States and approximately 1 in 100 women who become critically ill will die. Patient characteristics and comorbid conditions are commonly summarized as an index or score for the purpose of predicting the likelihood of dying; however, most such indices have arisen from non-pregnant patient populations. We sought to systematically review comorbidity indices used in health administrative datasets of pregnant women, in order to critically appraise their measurement properties and recommend optimal tools for clinicians and maternal health researchers. We conducted a systematic search of MEDLINE and EMBASE to identify studies published from 1946 and 1947, respectively, to May 2017 that describe predictive validity of comorbidity indices using health administrative datasets in the field of maternal health research. We applied a methodological PubMed search filter to identify all studies of measurement properties for each index. Our initial search retrieved 8944 citations. The full text of 61 articles were identified and assessed for final eligibility. Finally, two eligible articles, describing three comorbidity indices appropriate for health administrative data remained: The Maternal comorbidity index, the Charlson comorbidity index and the Elixhauser Comorbidity Index. These studies of identified indices had a low risk of bias. The lack of an established consensus-building methodology in generating each index resulted in marginal sensibility for all indices. Only the Maternal Comorbidity Index was derived and validated specifically from a cohort of pregnant and postpartum women, using an administrative dataset, and had an associated c-statistic of 0.675 (95% Confidence Interval 0.647-0.666) in predicting mortality. Only the Maternal Comorbidity Index directly evaluated measurement properties relevant to pregnant women in health administrative datasets; however, it has only modest predictive ability for mortality among development and validation studies. Further research to investigate the feasibility of applying this index in clinical research, and its reliability across a variety of health administrative datasets would be incrementally helpful. Evolution of this and other tools for risk prediction and risk adjustment in pregnant and post-partum patients is an important area for ongoing study.

  1. Discovering New Global Climate Patterns: Curating a 21-Year High Temporal (Hourly) and Spatial (40km) Resolution Reanalysis Dataset

    NASA Astrophysics Data System (ADS)

    Hou, C. Y.; Dattore, R.; Peng, G. S.

    2014-12-01

    The National Center for Atmospheric Research's Global Climate Four-Dimensional Data Assimilation (CFDDA) Hourly 40km Reanalysis dataset is a dynamically downscaled dataset with high temporal and spatial resolution. The dataset contains three-dimensional hourly analyses in netCDF format for the global atmospheric state from 1985 to 2005 on a 40km horizontal grid (0.4°grid increment) with 28 vertical levels, providing good representation of local forcing and diurnal variation of processes in the planetary boundary layer. This project aimed to make the dataset publicly available, accessible, and usable in order to provide a unique resource to allow and promote studies of new climate characteristics. When the curation project started, it had been five years since the data files were generated. Also, although the Principal Investigator (PI) had generated a user document at the end of the project in 2009, the document had not been maintained. Furthermore, the PI had moved to a new institution, and the remaining team members were reassigned to other projects. These factors made data curation in the areas of verifying data quality, harvest metadata descriptions, documenting provenance information especially challenging. As a result, the project's curation process found that: Data curator's skill and knowledge helped make decisions, such as file format and structure and workflow documentation, that had significant, positive impact on the ease of the dataset's management and long term preservation. Use of data curation tools, such as the Data Curation Profiles Toolkit's guidelines, revealed important information for promoting the data's usability and enhancing preservation planning. Involving data curators during each stage of the data curation life cycle instead of at the end could improve the curation process' efficiency. Overall, the project showed that proper resources invested in the curation process would give datasets the best chance to fulfill their potential to help with new climate pattern discovery.

  2. Clinical high-resolution mapping of the proteoglycan-bound water fraction in articular cartilage of the human knee joint.

    PubMed

    Bouhrara, Mustapha; Reiter, David A; Sexton, Kyle W; Bergeron, Christopher M; Zukley, Linda M; Spencer, Richard G

    2017-11-01

    We applied our recently introduced Bayesian analytic method to achieve clinically-feasible in-vivo mapping of the proteoglycan water fraction (PgWF) of human knee cartilage with improved spatial resolution and stability as compared to existing methods. Multicomponent driven equilibrium single-pulse observation of T 1 and T 2 (mcDESPOT) datasets were acquired from the knees of two healthy young subjects and one older subject with previous knee injury. Each dataset was processed using Bayesian Monte Carlo (BMC) analysis incorporating a two-component tissue model. We assessed the performance and reproducibility of BMC and of the conventional analysis of stochastic region contraction (SRC) in the estimation of PgWF. Stability of the BMC analysis of PgWF was tested by comparing independent high-resolution (HR) datasets from each of the two young subjects. Unlike SRC, the BMC-derived maps from the two HR datasets were essentially identical. Furthermore, SRC maps showed substantial random variation in estimated PgWF, and mean values that differed from those obtained using BMC. In addition, PgWF maps derived from conventional low-resolution (LR) datasets exhibited partial volume and magnetic susceptibility effects. These artifacts were absent in HR PgWF images. Finally, our analysis showed regional variation in PgWF estimates, and substantially higher values in the younger subjects as compared to the older subject. BMC-mcDESPOT permits HR in-vivo mapping of PgWF in human knee cartilage in a clinically-feasible acquisition time. HR mapping reduces the impact of partial volume and magnetic susceptibility artifacts compared to LR mapping. Finally, BMC-mcDESPOT demonstrated excellent reproducibility in the determination of PgWF. Published by Elsevier Inc.

  3. On Interestingness Measures for Mining Statistically Significant and Novel Clinical Associations from EMRs

    PubMed Central

    Abar, Orhan; Charnigo, Richard J.; Rayapati, Abner

    2017-01-01

    Association rule mining has received significant attention from both the data mining and machine learning communities. While data mining researchers focus more on designing efficient algorithms to mine rules from large datasets, the learning community has explored applications of rule mining to classification. A major problem with rule mining algorithms is the explosion of rules even for moderate sized datasets making it very difficult for end users to identify both statistically significant and potentially novel rules that could lead to interesting new insights and hypotheses. Researchers have proposed many domain independent interestingness measures using which, one can rank the rules and potentially glean useful rules from the top ranked ones. However, these measures have not been fully explored for rule mining in clinical datasets owing to the relatively large sizes of the datasets often encountered in healthcare and also due to limited access to domain experts for review/analysis. In this paper, using an electronic medical record (EMR) dataset of diagnoses and medications from over three million patient visits to the University of Kentucky medical center and affiliated clinics, we conduct a thorough evaluation of dozens of interestingness measures proposed in data mining literature, including some new composite measures. Using cumulative relevance metrics from information retrieval, we compare these interestingness measures against human judgments obtained from a practicing psychiatrist for association rules involving the depressive disorders class as the consequent. Our results not only surface new interesting associations for depressive disorders but also indicate classes of interestingness measures that weight rule novelty and statistical strength in contrasting ways, offering new insights for end users in identifying interesting rules. PMID:28736771

  4. An interactive website for analytical method comparison and bias estimation.

    PubMed

    Bahar, Burak; Tuncel, Ayse F; Holmes, Earle W; Holmes, Daniel T

    2017-12-01

    Regulatory standards mandate laboratories to perform studies to ensure accuracy and reliability of their test results. Method comparison and bias estimation are important components of these studies. We developed an interactive website for evaluating the relative performance of two analytical methods using R programming language tools. The website can be accessed at https://bahar.shinyapps.io/method_compare/. The site has an easy-to-use interface that allows both copy-pasting and manual entry of data. It also allows selection of a regression model and creation of regression and difference plots. Available regression models include Ordinary Least Squares, Weighted-Ordinary Least Squares, Deming, Weighted-Deming, Passing-Bablok and Passing-Bablok for large datasets. The server processes the data and generates downloadable reports in PDF or HTML format. Our website provides clinical laboratories a practical way to assess the relative performance of two analytical methods. Copyright © 2017 The Canadian Society of Clinical Chemists. Published by Elsevier Inc. All rights reserved.

  5. Non-invasive pressure difference estimation from PC-MRI using the work-energy equation

    PubMed Central

    Donati, Fabrizio; Figueroa, C. Alberto; Smith, Nicolas P.; Lamata, Pablo; Nordsletten, David A.

    2015-01-01

    Pressure difference is an accepted clinical biomarker for cardiovascular disease conditions such as aortic coarctation. Currently, measurements of pressure differences in the clinic rely on invasive techniques (catheterization), prompting development of non-invasive estimates based on blood flow. In this work, we propose a non-invasive estimation procedure deriving pressure difference from the work-energy equation for a Newtonian fluid. Spatial and temporal convergence is demonstrated on in silico Phase Contrast Magnetic Resonance Image (PC-MRI) phantoms with steady and transient flow fields. The method is also tested on an image dataset generated in silico from a 3D patient-specific Computational Fluid Dynamics (CFD) simulation and finally evaluated on a cohort of 9 subjects. The performance is compared to existing approaches based on steady and unsteady Bernoulli formulations as well as the pressure Poisson equation. The new technique shows good accuracy, robustness to noise, and robustness to the image segmentation process, illustrating the potential of this approach for non-invasive pressure difference estimation. PMID:26409245

  6. Information management systems for pharmacogenomics.

    PubMed

    Thallinger, Gerhard G; Trajanoski, Slave; Stocker, Gernot; Trajanoski, Zlatko

    2002-09-01

    The value of high-throughput genomic research is dramatically enhanced by association with key patient data. These data are generally available but of disparate quality and not typically directly associated. A system that could bring these disparate data sources into a common resource connected with functional genomic data would be tremendously advantageous. However, the integration of clinical and accurate interpretation of the generated functional genomic data requires the development of information management systems capable of effectively capturing the data as well as tools to make that data accessible to the laboratory scientist or to the clinician. In this review these challenges and current information technology solutions associated with the management, storage and analysis of high-throughput data are highlighted. It is suggested that the development of a pharmacogenomic data management system which integrates public and proprietary databases, clinical datasets, and data mining tools embedded in a high-performance computing environment should include the following components: parallel processing systems, storage technologies, network technologies, databases and database management systems (DBMS), and application services.

  7. Integrity, standards, and QC-related issues with big data in pre-clinical drug discovery.

    PubMed

    Brothers, John F; Ung, Matthew; Escalante-Chong, Renan; Ross, Jermaine; Zhang, Jenny; Cha, Yoonjeong; Lysaght, Andrew; Funt, Jason; Kusko, Rebecca

    2018-06-01

    The tremendous expansion of data analytics and public and private big datasets presents an important opportunity for pre-clinical drug discovery and development. In the field of life sciences, the growth of genetic, genomic, transcriptomic and proteomic data is partly driven by a rapid decline in experimental costs as biotechnology improves throughput, scalability, and speed. Yet far too many researchers tend to underestimate the challenges and consequences involving data integrity and quality standards. Given the effect of data integrity on scientific interpretation, these issues have significant implications during preclinical drug development. We describe standardized approaches for maximizing the utility of publicly available or privately generated biological data and address some of the common pitfalls. We also discuss the increasing interest to integrate and interpret cross-platform data. Principles outlined here should serve as a useful broad guide for existing analytical practices and pipelines and as a tool for developing additional insights into therapeutics using big data. Copyright © 2018 Elsevier Inc. All rights reserved.

  8. Cheminformatic models based on machine learning for pyruvate kinase inhibitors of Leishmania mexicana.

    PubMed

    Jamal, Salma; Scaria, Vinod

    2013-11-19

    Leishmaniasis is a neglected tropical disease which affects approx. 12 million individuals worldwide and caused by parasite Leishmania. The current drugs used in the treatment of Leishmaniasis are highly toxic and has seen widespread emergence of drug resistant strains which necessitates the need for the development of new therapeutic options. The high throughput screen data available has made it possible to generate computational predictive models which have the ability to assess the active scaffolds in a chemical library followed by its ADME/toxicity properties in the biological trials. In the present study, we have used publicly available, high-throughput screen datasets of chemical moieties which have been adjudged to target the pyruvate kinase enzyme of L. mexicana (LmPK). The machine learning approach was used to create computational models capable of predicting the biological activity of novel antileishmanial compounds. Further, we evaluated the molecules using the substructure based approach to identify the common substructures contributing to their activity. We generated computational models based on machine learning methods and evaluated the performance of these models based on various statistical figures of merit. Random forest based approach was determined to be the most sensitive, better accuracy as well as ROC. We further added a substructure based approach to analyze the molecules to identify potentially enriched substructures in the active dataset. We believe that the models developed in the present study would lead to reduction in cost and length of clinical studies and hence newer drugs would appear faster in the market providing better healthcare options to the patients.

  9. A vascular biology network model focused on inflammatory processes to investigate atherogenesis and plaque instability

    PubMed Central

    2014-01-01

    Background Numerous inflammation-related pathways have been shown to play important roles in atherogenesis. Rapid and efficient assessment of the relative influence of each of those pathways is a challenge in the era of “omics” data generation. The aim of the present work was to develop a network model of inflammation-related molecular pathways underlying vascular disease to assess the degree of translatability of preclinical molecular data to the human clinical setting. Methods We constructed and evaluated the Vascular Inflammatory Processes Network (V-IPN), a model representing a collection of vascular processes modulated by inflammatory stimuli that lead to the development of atherosclerosis. Results Utilizing the V-IPN as a platform for biological discovery, we have identified key vascular processes and mechanisms captured by gene expression profiling data from four independent datasets from human endothelial cells (ECs) and human and murine intact vessels. Primary ECs in culture from multiple donors revealed a richer mapping of mechanisms identified by the V-IPN compared to an immortalized EC line. Furthermore, an evaluation of gene expression datasets from aortas of old ApoE-/- mice (78 weeks) and human coronary arteries with advanced atherosclerotic lesions identified significant commonalities in the two species, as well as several mechanisms specific to human arteries that are consistent with the development of unstable atherosclerotic plaques. Conclusions We have generated a new biological network model of atherogenic processes that demonstrates the power of network analysis to advance integrative, systems biology-based knowledge of cross-species translatability, plaque development and potential mechanisms leading to plaque instability. PMID:24965703

  10. Poster — Thur Eve — 66: Robustness Assessment of a Novel IMRT Planning Method for Lung Radiotherapy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ahanj, M.; Bissonnette, J.-P.; Heath, E.

    2014-08-15

    Conventional radiotherapy treatment planning for lung cancer accounts for tumour motion by increasing the beam apertures. We recently developed an IMRT planning strategy which uses reduced beam apertures in combination with an edge enhancing boost of 110% of the prescription dose to the volume that corresponds to the portion of the CTV that moves outside of the reduced beam. Previous results showed that this approach ensures target coverage while reducing lung dose. In the current study, we evaluate the robustness of this boost volume approach to changes in respiratory motion, including amplitude and phase weight variations. ITV and boost volumemore » plans were generated for 5 NSCLC patients with respiratory motion amplitudes ranging from 1 to 2 cm. A standard 5mm PTV margin was used for all plans. The ORBIT treatment planning tool was used to plan and accumulate dose over 10 respiratory phases defined by the 4DCT datasets. For the phase weight variation study, dose was accumulated for three scenarios: equally-weighted-phases, higher weight assigned to exhale phases and higher weight assigned to inhale phases. For the amplitude variation study, a numerical phantom was used to generate 4DCT datasets corresponding to 7 mm, 10 mm and 14 mm motion amplitudes. Preliminary results found that delivered plans for all phase weight scenarios were clinically acceptable. When normalized to mean lung dose, the boost volume plan delivered 5% more dose to the CTV which indicates the potential for dose escalation using this approach.« less

  11. Identification of Patients with Family History of Pancreatic Cancer--Investigation of an NLP System Portability.

    PubMed

    Mehrabi, Saeed; Krishnan, Anand; Roch, Alexandra M; Schmidt, Heidi; Li, DingCheng; Kesterson, Joe; Beesley, Chris; Dexter, Paul; Schmidt, Max; Palakal, Mathew; Liu, Hongfang

    2015-01-01

    In this study we have developed a rule-based natural language processing (NLP) system to identify patients with family history of pancreatic cancer. The algorithm was developed in a Unstructured Information Management Architecture (UIMA) framework and consisted of section segmentation, relation discovery, and negation detection. The system was evaluated on data from two institutions. The family history identification precision was consistent across the institutions shifting from 88.9% on Indiana University (IU) dataset to 87.8% on Mayo Clinic dataset. Customizing the algorithm on the the Mayo Clinic data, increased its precision to 88.1%. The family member relation discovery achieved precision, recall, and F-measure of 75.3%, 91.6% and 82.6% respectively. Negation detection resulted in precision of 99.1%. The results show that rule-based NLP approaches for specific information extraction tasks are portable across institutions; however customization of the algorithm on the new dataset improves its performance.

  12. User's Guide for the Agricultural Non-Point Source (AGNPS) Pollution Model Data Generator

    USGS Publications Warehouse

    Finn, Michael P.; Scheidt, Douglas J.; Jaromack, Gregory M.

    2003-01-01

    BACKGROUND Throughout this user guide, we refer to datasets that we used in conjunction with developing of this software for supporting cartographic research and producing the datasets to conduct research. However, this software can be used with these datasets or with more 'generic' versions of data of the appropriate type. For example, throughout the guide, we refer to national land cover data (NLCD) and digital elevation model (DEM) data from the U.S. Geological Survey (USGS) at a 30-m resolution, but any digital terrain model or land cover data at any appropriate resolution will produce results. Another key point to keep in mind is to use a consistent data resolution for all the datasets per model run. The U.S. Department of Agriculture (USDA) developed the Agricultural Nonpoint Source (AGNPS) pollution model of watershed hydrology in response to the complex problem of managing nonpoint sources of pollution. AGNPS simulates the behavior of runoff, sediment, and nutrient transport from watersheds that have agriculture as their prime use. The model operates on a cell basis and is a distributed parameter, event-based model. The model requires 22 input parameters. Output parameters are grouped primarily by hydrology, sediment, and chemical output (Young and others, 1995.) Elevation, land cover, and soil are the base data from which to extract the 22 input parameters required by the AGNPS. For automatic parameter extraction, follow the general process described in this guide of extraction from the geospatial data through the AGNPS Data Generator to generate input parameters required by the pollution model (Finn and others, 2002.)

  13. Evaluation and inter-comparison of modern day reanalysis datasets over Africa and the Middle East

    NASA Astrophysics Data System (ADS)

    Shukla, S.; Arsenault, K. R.; Hobbins, M.; Peters-Lidard, C. D.; Verdin, J. P.

    2015-12-01

    Reanalysis datasets are potentially very valuable for otherwise data-sparse regions such as Africa and the Middle East. They are potentially useful for long-term climate and hydrologic analyses and, given their availability in real-time, they are particularity attractive for real-time hydrologic monitoring purposes (e.g. to monitor flood and drought events). Generally in data-sparse regions, reanalysis variables such as precipitation, temperature, radiation and humidity are used in conjunction with in-situ and/or satellite-based datasets to generate long-term gridded atmospheric forcing datasets. These atmospheric forcing datasets are used to drive offline land surface models and simulate soil moisture and runoff, which are natural indicators of hydrologic conditions. Therefore, any uncertainty or bias in the reanalysis datasets contributes to uncertainties in hydrologic monitoring estimates. In this presentation, we report on a comprehensive analysis that evaluates several modern-day reanalysis products (such as NASA's MERRA-1 and -2, ECMWF's ERA-Interim and NCEP's CFS Reanalysis) over Africa and the Middle East region. We compare the precipitation and temperature from the reanalysis products with other independent gridded datasets such as GPCC, CRU, and USGS/UCSB's CHIRPS precipitation datasets, and CRU's temperature datasets. The evaluations are conducted at a monthly time scale, since some of these independent datasets are only available at this temporal resolution. The evaluations range from the comparison of the monthly mean climatology to inter-annual variability and long-term changes. Finally, we also present the results of inter-comparisons of radiation and humidity variables from the different reanalysis datasets.

  14. Data-driven gating in PET: Influence of respiratory signal noise on motion resolution.

    PubMed

    Büther, Florian; Ernst, Iris; Frohwein, Lynn Johann; Pouw, Joost; Schäfers, Klaus Peter; Stegger, Lars

    2018-05-21

    Data-driven gating (DDG) approaches for positron emission tomography (PET) are interesting alternatives to conventional hardware-based gating methods. In DDG, the measured PET data themselves are utilized to calculate a respiratory signal, that is, subsequently used for gating purposes. The success of gating is then highly dependent on the statistical quality of the PET data. In this study, we investigate how this quality determines signal noise and thus motion resolution in clinical PET scans using a center-of-mass-based (COM) DDG approach, specifically with regard to motion management of target structures in future radiotherapy planning applications. PET list mode datasets acquired in one bed position of 19 different radiotherapy patients undergoing pretreatment [ 18 F]FDG PET/CT or [ 18 F]FDG PET/MRI were included into this retrospective study. All scans were performed over a region with organs (myocardium, kidneys) or tumor lesions of high tracer uptake and under free breathing. Aside from the original list mode data, datasets with progressively decreasing PET statistics were generated. From these, COM DDG signals were derived for subsequent amplitude-based gating of the original list mode file. The apparent respiratory shift d from end-expiration to end-inspiration was determined from the gated images and expressed as a function of signal-to-noise ratio SNR of the determined gating signals. This relation was tested against additional 25 [ 18 F]FDG PET/MRI list mode datasets where high-precision MR navigator-like respiratory signals were available as reference signal for respiratory gating of PET data, and data from a dedicated thorax phantom scan. All original 19 high-quality list mode datasets demonstrated the same behavior in terms of motion resolution when reducing the amount of list mode events for DDG signal generation. Ratios and directions of respiratory shifts between end-respiratory gates and the respective nongated image were constant over all statistic levels. Motion resolution d/d max could be modeled as d/dmax=1-e-1.52(SNR-1)0.52, with d max as the actual respiratory shift. Determining d max from d and SNR in the 25 test datasets and the phantom scan demonstrated no significant differences to the MR navigator-derived shift values and the predefined shift, respectively. The SNR can serve as a general metric to assess the success of COM-based DDG, even in different scanners and patients. The derived formula for motion resolution can be used to estimate the actual motion extent reasonably well in cases of limited PET raw data statistics. This may be of interest for individualized radiotherapy treatment planning procedures of target structures subjected to respiratory motion. © 2018 American Association of Physicists in Medicine.

  15. A hybrid approach to select features and classify diseases based on medical data

    NASA Astrophysics Data System (ADS)

    AbdelLatif, Hisham; Luo, Jiawei

    2018-03-01

    Feature selection is popular problem in the classification of diseases in clinical medicine. Here, we developing a hybrid methodology to classify diseases, based on three medical datasets, Arrhythmia, Breast cancer, and Hepatitis datasets. This methodology called k-means ANOVA Support Vector Machine (K-ANOVA-SVM) uses K-means cluster with ANOVA statistical to preprocessing data and selection the significant features, and Support Vector Machines in the classification process. To compare and evaluate the performance, we choice three classification algorithms, decision tree Naïve Bayes, Support Vector Machines and applied the medical datasets direct to these algorithms. Our methodology was a much better classification accuracy is given of 98% in Arrhythmia datasets, 92% in Breast cancer datasets and 88% in Hepatitis datasets, Compare to use the medical data directly with decision tree Naïve Bayes, and Support Vector Machines. Also, the ROC curve and precision with (K-ANOVA-SVM) Achieved best results than other algorithms

  16. Use of Electronic Health-Related Datasets in Nursing and Health-Related Research.

    PubMed

    Al-Rawajfah, Omar M; Aloush, Sami; Hewitt, Jeanne Beauchamp

    2015-07-01

    Datasets of gigabyte size are common in medical sciences. There is increasing consensus that significant untapped knowledge lies hidden in these large datasets. This review article aims to discuss Electronic Health-Related Datasets (EHRDs) in terms of types, features, advantages, limitations, and possible use in nursing and health-related research. Major scientific databases, MEDLINE, ScienceDirect, and Scopus, were searched for studies or review articles regarding using EHRDs in research. A total number of 442 articles were located. After application of study inclusion criteria, 113 articles were included in the final review. EHRDs were categorized into Electronic Administrative Health-Related Datasets and Electronic Clinical Health-Related Datasets. Subcategories of each major category were identified. EHRDs are invaluable assets for nursing the health-related research. Advanced research skills such as using analytical softwares, advanced statistical procedures, dealing with missing data and missing variables will maximize the efficient utilization of EHRDs in research. © The Author(s) 2014.

  17. ATACseqQC: a Bioconductor package for post-alignment quality assessment of ATAC-seq data.

    PubMed

    Ou, Jianhong; Liu, Haibo; Yu, Jun; Kelliher, Michelle A; Castilla, Lucio H; Lawson, Nathan D; Zhu, Lihua Julie

    2018-03-01

    ATAC-seq (Assays for Transposase-Accessible Chromatin using sequencing) is a recently developed technique for genome-wide analysis of chromatin accessibility. Compared to earlier methods for assaying chromatin accessibility, ATAC-seq is faster and easier to perform, does not require cross-linking, has higher signal to noise ratio, and can be performed on small cell numbers. However, to ensure a successful ATAC-seq experiment, step-by-step quality assurance processes, including both wet lab quality control and in silico quality assessment, are essential. While several tools have been developed or adopted for assessing read quality, identifying nucleosome occupancy and accessible regions from ATAC-seq data, none of the tools provide a comprehensive set of functionalities for preprocessing and quality assessment of aligned ATAC-seq datasets. We have developed a Bioconductor package, ATACseqQC, for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data. In addition, this package contains functions to preprocess aligned ATAC-seq data for subsequent peak calling. Here we demonstrate the utilities of our package using 25 publicly available ATAC-seq datasets from four studies. We also provide guidelines on what the diagnostic plots should look like for an ideal ATAC-seq dataset. This software package has been used successfully for preprocessing and assessing several in-house and public ATAC-seq datasets. Diagnostic plots generated by this package will facilitate the quality assessment of ATAC-seq data, and help researchers to evaluate their own ATAC-seq experiments as well as select high-quality ATAC-seq datasets from public repositories such as GEO to avoid generating hypotheses or drawing conclusions from low-quality ATAC-seq experiments. The software, source code, and documentation are freely available as a Bioconductor package at https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html .

  18. Measuring Performance with Library Automated Systems.

    ERIC Educational Resources Information Center

    OFarrell, John P.

    2000-01-01

    Investigates the capability of three library automated systems to generate some of the datasets necessary to form the ISO (International Standards Organization) standard on performance measurement within libraries, based on research in Liverpool John Moores University (United Kingdom). Concludes that the systems are weak in generating the…

  19. Direct reconstruction of parametric images for brain PET with event-by-event motion correction: evaluation in two tracers across count levels

    NASA Astrophysics Data System (ADS)

    Germino, Mary; Gallezot, Jean-Dominque; Yan, Jianhua; Carson, Richard E.

    2017-07-01

    Parametric images for dynamic positron emission tomography (PET) are typically generated by an indirect method, i.e. reconstructing a time series of emission images, then fitting a kinetic model to each voxel time activity curve. Alternatively, ‘direct reconstruction’, incorporates the kinetic model into the reconstruction algorithm itself, directly producing parametric images from projection data. Direct reconstruction has been shown to achieve parametric images with lower standard error than the indirect method. Here, we present direct reconstruction for brain PET using event-by-event motion correction of list-mode data, applied to two tracers. Event-by-event motion correction was implemented for direct reconstruction in the Parametric Motion-compensation OSEM List-mode Algorithm for Resolution-recovery reconstruction. The direct implementation was tested on simulated and human datasets with tracers [11C]AFM (serotonin transporter) and [11C]UCB-J (synaptic density), which follow the 1-tissue compartment model. Rigid head motion was tracked with the Vicra system. Parametric images of K 1 and distribution volume (V T  =  K 1/k 2) were compared to those generated by the indirect method by regional coefficient of variation (CoV). Performance across count levels was assessed using sub-sampled datasets. For simulated and real datasets at high counts, the two methods estimated K 1 and V T with comparable accuracy. At lower count levels, the direct method was substantially more robust to outliers than the indirect method. Compared to the indirect method, direct reconstruction reduced regional K 1 CoV by 35-48% (simulated dataset), 39-43% ([11C]AFM dataset) and 30-36% ([11C]UCB-J dataset) across count levels (averaged over regions at matched iteration); V T CoV was reduced by 51-58%, 54-60% and 30-46%, respectively. Motion correction played an important role in the dataset with larger motion: correction increased regional V T by 51% on average in the [11C]UCB-J dataset. Direct reconstruction of dynamic brain PET with event-by-event motion correction is achievable and dramatically more robust to noise in V T images than the indirect method.

  20. Emory University: High-Throughput Protein-Protein Interaction Dataset for Lung Cancer-Associated Genes | Office of Cancer Genomics

    Cancer.gov

    To discover novel PPI signaling hubs for lung cancer, CTD2 Center at Emory utilized large-scale genomics datasets and literature to compile a set of lung cancer-associated genes. A library of expression vectors were generated for these genes and utilized for detecting pairwise PPIs with cell lysate-based TR-FRET assays in high-throughput screening format. Read the abstract.

  1. Bayesian correlated clustering to integrate multiple datasets

    PubMed Central

    Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.

    2012-01-01

    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558

  2. State-of-the-Art Fusion-Finder Algorithms Sensitivity and Specificity

    PubMed Central

    Carrara, Matteo; Beccuti, Marco; Lazzarato, Fulvio; Cavallo, Federica; Cordero, Francesca; Donatelli, Susanna; Calogero, Raffaele A.

    2013-01-01

    Background. Gene fusions arising from chromosomal translocations have been implicated in cancer. RNA-seq has the potential to discover such rearrangements generating functional proteins (chimera/fusion). Recently, many methods for chimeras detection have been published. However, specificity and sensitivity of those tools were not extensively investigated in a comparative way. Results. We tested eight fusion-detection tools (FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse, Bellerophontes, ChimeraScan, and TopHat-fusion) to detect fusion events using synthetic and real datasets encompassing chimeras. The comparison analysis run only on synthetic data could generate misleading results since we found no counterpart on real dataset. Furthermore, most tools report a very high number of false positive chimeras. In particular, the most sensitive tool, ChimeraScan, reports a large number of false positives that we were able to significantly reduce by devising and applying two filters to remove fusions not supported by fusion junction-spanning reads or encompassing large intronic regions. Conclusions. The discordant results obtained using synthetic and real datasets suggest that synthetic datasets encompassing fusion events may not fully catch the complexity of RNA-seq experiment. Moreover, fusion detection tools are still limited in sensitivity or specificity; thus, there is space for further improvement in the fusion-finder algorithms. PMID:23555082

  3. Flood Inundation Modelling in the Kuantan River Basin using 1D-2D Flood Modeller coupled with ASTER-GDEM

    NASA Astrophysics Data System (ADS)

    Ng, Z. F.; Gisen, J. I.; Akbari, A.

    2018-03-01

    Topography dataset is an important input in performing flood inundation modelling. However, it is always difficult to obtain high resolution topography that provide accurate elevation information. Fortunately, there are some open source topography datasets available with reasonable resolution such as SRTM and ASTER-GDEM. In Malaysia particularly in Kuantan, the modelling research on the floodplain area is still lacking. This research aims to: a) to investigate the suitability of ASTER-GDEM to be applied in the 1D-2D flood inundation modelling for the Kuantan River Basin; b) to generate flood inundation map for Kuantan river basin. The topography dataset used in this study is ASTER-GDEM to generate physical characteristics of watershed in the basin. It is used to perform rainfall runoff modelling for hydrological studies and to delineate flood inundation area in the Flood Modeller. The results obtained have shown that a 30m resolution ASTER-GDEM is applicable as an input for the 1D-2D flood modelling. The simulated water level in 2013 has NSE of 0.644 and RSME of 1.259. As a conclusion, ASTER-GDEM can be used as one alternative topography datasets for flood inundation modelling. However, the flood level obtained from the hydraulic modelling shows low accuracy at flat urban areas.

  4. Big data and data repurposing - using existing data to answer new questions in vascular dementia research.

    PubMed

    Doubal, Fergus N; Ali, Myzoon; Batty, G David; Charidimou, Andreas; Eriksdotter, Maria; Hofmann-Apitius, Martin; Kim, Yun-Hee; Levine, Deborah A; Mead, Gillian; Mucke, Hermann A M; Ritchie, Craig W; Roberts, Charlotte J; Russ, Tom C; Stewart, Robert; Whiteley, William; Quinn, Terence J

    2017-04-17

    Traditional approaches to clinical research have, as yet, failed to provide effective treatments for vascular dementia (VaD). Novel approaches to collation and synthesis of data may allow for time and cost efficient hypothesis generating and testing. These approaches may have particular utility in helping us understand and treat a complex condition such as VaD. We present an overview of new uses for existing data to progress VaD research. The overview is the result of consultation with various stakeholders, focused literature review and learning from the group's experience of successful approaches to data repurposing. In particular, we benefitted from the expert discussion and input of delegates at the 9 th International Congress on Vascular Dementia (Ljubljana, 16-18 th October 2015). We agreed on key areas that could be of relevance to VaD research: systematic review of existing studies; individual patient level analyses of existing trials and cohorts and linking electronic health record data to other datasets. We illustrated each theme with a case-study of an existing project that has utilised this approach. There are many opportunities for the VaD research community to make better use of existing data. The volume of potentially available data is increasing and the opportunities for using these resources to progress the VaD research agenda are exciting. Of course, these approaches come with inherent limitations and biases, as bigger datasets are not necessarily better datasets and maintaining rigour and critical analysis will be key to optimising data use.

  5. Computer-aided classification of mammographic masses using the deep learning technology: a preliminary study

    NASA Astrophysics Data System (ADS)

    Qiu, Yuchen; Yan, Shiju; Tan, Maxine; Cheng, Samuel; Liu, Hong; Zheng, Bin

    2016-03-01

    Although mammography is the only clinically acceptable imaging modality used in the population-based breast cancer screening, its efficacy is quite controversy. One of the major challenges is how to help radiologists more accurately classify between benign and malignant lesions. The purpose of this study is to investigate a new mammographic mass classification scheme based on a deep learning method. In this study, we used an image dataset involving 560 regions of interest (ROIs) extracted from digital mammograms, which includes 280 malignant and 280 benign mass ROIs, respectively. An eight layer deep learning network was applied, which employs three pairs of convolution-max-pooling layers for automatic feature extraction and a multiple layer perception (MLP) classifier for feature categorization. In order to improve robustness of selected features, each convolution layer is connected with a max-pooling layer. A number of 20, 10, and 5 feature maps were utilized for the 1st, 2nd and 3rd convolution layer, respectively. The convolution networks are followed by a MLP classifier, which generates a classification score to predict likelihood of a ROI depicting a malignant mass. Among 560 ROIs, 420 ROIs were used as a training dataset and the remaining 140 ROIs were used as a validation dataset. The result shows that the new deep learning based classifier yielded an area under the receiver operation characteristic curve (AUC) of 0.810+/-0.036. This study demonstrated the potential superiority of using a deep learning based classifier to distinguish malignant and benign breast masses without segmenting the lesions and extracting the pre-defined image features.

  6. MassImager: A software for interactive and in-depth analysis of mass spectrometry imaging data.

    PubMed

    He, Jiuming; Huang, Luojiao; Tian, Runtao; Li, Tiegang; Sun, Chenglong; Song, Xiaowei; Lv, Yiwei; Luo, Zhigang; Li, Xin; Abliz, Zeper

    2018-07-26

    Mass spectrometry imaging (MSI) has become a powerful tool to probe molecule events in biological tissue. However, it is a widely held viewpoint that one of the biggest challenges is an easy-to-use data processing software for discovering the underlying biological information from complicated and huge MSI dataset. Here, a user-friendly and full-featured MSI software including three subsystems, Solution, Visualization and Intelligence, named MassImager, is developed focusing on interactive visualization, in-situ biomarker discovery and artificial intelligent pathological diagnosis. Simplified data preprocessing and high-throughput MSI data exchange, serialization jointly guarantee the quick reconstruction of ion image and rapid analysis of dozens of gigabytes datasets. It also offers diverse self-defined operations for visual processing, including multiple ion visualization, multiple channel superposition, image normalization, visual resolution enhancement and image filter. Regions-of-interest analysis can be performed precisely through the interactive visualization between the ion images and mass spectra, also the overlaid optical image guide, to directly find out the region-specific biomarkers. Moreover, automatic pattern recognition can be achieved immediately upon the supervised or unsupervised multivariate statistical modeling. Clear discrimination between cancer tissue and adjacent tissue within a MSI dataset can be seen in the generated pattern image, which shows great potential in visually in-situ biomarker discovery and artificial intelligent pathological diagnosis of cancer. All the features are integrated together in MassImager to provide a deep MSI processing solution at the in-situ metabolomics level for biomarker discovery and future clinical pathological diagnosis. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.

  7. A method for normalizing pathology images to improve feature extraction for quantitative pathology

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tam, Allison; Barker, Jocelyn; Rubin, Daniel

    Purpose: With the advent of digital slide scanning technologies and the potential proliferation of large repositories of digital pathology images, many research studies can leverage these data for biomedical discovery and to develop clinical applications. However, quantitative analysis of digital pathology images is impeded by batch effects generated by varied staining protocols and staining conditions of pathological slides. Methods: To overcome this problem, this paper proposes a novel, fully automated stain normalization method to reduce batch effects and thus aid research in digital pathology applications. Their method, intensity centering and histogram equalization (ICHE), normalizes a diverse set of pathology imagesmore » by first scaling the centroids of the intensity histograms to a common point and then applying a modified version of contrast-limited adaptive histogram equalization. Normalization was performed on two datasets of digitized hematoxylin and eosin (H&E) slides of different tissue slices from the same lung tumor, and one immunohistochemistry dataset of digitized slides created by restaining one of the H&E datasets. Results: The ICHE method was evaluated based on image intensity values, quantitative features, and the effect on downstream applications, such as a computer aided diagnosis. For comparison, three methods from the literature were reimplemented and evaluated using the same criteria. The authors found that ICHE not only improved performance compared with un-normalized images, but in most cases showed improvement compared with previous methods for correcting batch effects in the literature. Conclusions: ICHE may be a useful preprocessing step a digital pathology image processing pipeline.« less

  8. Multilayered complex network datasets for three supply chain network archetypes on an urban road grid.

    PubMed

    Viljoen, Nadia M; Joubert, Johan W

    2018-02-01

    This article presents the multilayered complex network formulation for three different supply chain network archetypes on an urban road grid and describes how 500 instances were randomly generated for each archetype. Both the supply chain network layer and the urban road network layer are directed unweighted networks. The shortest path set is calculated for each of the 1 500 experimental instances. The datasets are used to empirically explore the impact that the supply chain's dependence on the transport network has on its vulnerability in Viljoen and Joubert (2017) [1]. The datasets are publicly available on Mendeley (Joubert and Viljoen, 2017) [2].

  9. Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets.

    PubMed

    Washburne, Alex D; Silverman, Justin D; Leff, Jonathan W; Bennett, Dominic J; Darcy, John L; Mukherjee, Sayan; Fierer, Noah; David, Lawrence A

    2017-01-01

    Marker gene sequencing of microbial communities has generated big datasets of microbial relative abundances varying across environmental conditions, sample sites and treatments. These data often come with putative phylogenies, providing unique opportunities to investigate how shared evolutionary history affects microbial abundance patterns. Here, we present a method to identify the phylogenetic factors driving patterns in microbial community composition. We use the method, "phylofactorization," to re-analyze datasets from the human body and soil microbial communities, demonstrating how phylofactorization is a dimensionality-reducing tool, an ordination-visualization tool, and an inferential tool for identifying edges in the phylogeny along which putative functional ecological traits may have arisen.

  10. Interpolation of diffusion weighted imaging datasets.

    PubMed

    Dyrby, Tim B; Lundell, Henrik; Burke, Mark W; Reislev, Nina L; Paulson, Olaf B; Ptito, Maurice; Siebner, Hartwig R

    2014-12-01

    Diffusion weighted imaging (DWI) is used to study white-matter fibre organisation, orientation and structural connectivity by means of fibre reconstruction algorithms and tractography. For clinical settings, limited scan time compromises the possibilities to achieve high image resolution for finer anatomical details and signal-to-noise-ratio for reliable fibre reconstruction. We assessed the potential benefits of interpolating DWI datasets to a higher image resolution before fibre reconstruction using a diffusion tensor model. Simulations of straight and curved crossing tracts smaller than or equal to the voxel size showed that conventional higher-order interpolation methods improved the geometrical representation of white-matter tracts with reduced partial-volume-effect (PVE), except at tract boundaries. Simulations and interpolation of ex-vivo monkey brain DWI datasets revealed that conventional interpolation methods fail to disentangle fine anatomical details if PVE is too pronounced in the original data. As for validation we used ex-vivo DWI datasets acquired at various image resolutions as well as Nissl-stained sections. Increasing the image resolution by a factor of eight yielded finer geometrical resolution and more anatomical details in complex regions such as tract boundaries and cortical layers, which are normally only visualized at higher image resolutions. Similar results were found with typical clinical human DWI dataset. However, a possible bias in quantitative values imposed by the interpolation method used should be considered. The results indicate that conventional interpolation methods can be successfully applied to DWI datasets for mining anatomical details that are normally seen only at higher resolutions, which will aid in tractography and microstructural mapping of tissue compartments. Copyright © 2014. Published by Elsevier Inc.

  11. A framework for automatic creation of gold-standard rigid 3D-2D registration datasets.

    PubMed

    Madan, Hennadii; Pernuš, Franjo; Likar, Boštjan; Špiclin, Žiga

    2017-02-01

    Advanced image-guided medical procedures incorporate 2D intra-interventional information into pre-interventional 3D image and plan of the procedure through 3D/2D image registration (32R). To enter clinical use, and even for publication purposes, novel and existing 32R methods have to be rigorously validated. The performance of a 32R method can be estimated by comparing it to an accurate reference or gold standard method (usually based on fiducial markers) on the same set of images (gold standard dataset). Objective validation and comparison of methods are possible only if evaluation methodology is standardized, and the gold standard  dataset is made publicly available. Currently, very few such datasets exist and only one contains images of multiple patients acquired during a procedure. To encourage the creation of gold standard 32R datasets, we propose an automatic framework. The framework is based on rigid registration of fiducial markers. The main novelty is spatial grouping of fiducial markers on the carrier device, which enables automatic marker localization and identification across the 3D and 2D images. The proposed framework was demonstrated on clinical angiograms of 20 patients. Rigid 32R computed by the framework was more accurate than that obtained manually, with the respective target registration error below 0.027 mm compared to 0.040 mm. The framework is applicable for gold standard setup on any rigid anatomy, provided that the acquired images contain spatially grouped fiducial markers. The gold standard datasets and software will be made publicly available.

  12. Software for illustrative presentation of basic clinical characteristics of laboratory tests--GraphROC for Windows.

    PubMed

    Kairisto, V; Poola, A

    1995-01-01

    GraphROC for Windows is a program for clinical test evaluation. It was designed for the handling of large datasets obtained from clinical laboratory databases. In the user interface, graphical and numerical presentations are combined. For simplicity, numerical data is not shown unless requested. Relevant numbers can be "picked up" from the graph by simple mouse operations. Reference distributions can be displayed by using automatically optimized bin widths. Any percentile of the distribution with corresponding confidence limits can be chosen for display. In sensitivity-specificity analysis, both illness- and health-related distributions are shown in the same graph. The following data for any cutoff limit can be shown in a separate click window: clinical sensitivity and specificity with corresponding confidence limits, positive and negative likelihood ratios, positive and negative predictive values and efficiency. Predictive values and clinical efficiency of the cutoff limit can be updated for any prior probability of disease. Receiver Operating Characteristics (ROC) curves can be generated and combined into the same graph for comparison of several different tests. The area under the curve with corresponding confidence interval is calculated for each ROC curve. Numerical results of analyses and graphs can be printed or exported to other Microsoft Windows programs. GraphROC for Windows also employs a new method, developed by us, for the indirect estimation of health-related limits and change limits from mixed distributions of clinical laboratory data.

  13. Toward a complete dataset of drug-drug interaction information from publicly available sources.

    PubMed

    Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D

    2015-06-01

    Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms, integrating high-quality PDDI information from the merged dataset into Wikidata, and making the combined dataset accessible as Semantic Web Linked Data. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.

  14. An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records

    PubMed Central

    Kavuluru, Ramakanth; Rios, Anthony; Lu, Yuan

    2015-01-01

    Background Diagnosis codes are assigned to medical records in healthcare facilities by trained coders by reviewing all physician authored documents associated with a patient's visit. This is a necessary and complex task involving coders adhering to coding guidelines and coding all assignable codes. With the popularity of electronic medical records (EMRs), computational approaches to code assignment have been proposed in the recent years. However, most efforts have focused on single and often short clinical narratives, while realistic scenarios warrant full EMR level analysis for code assignment. Objective We evaluate supervised learning approaches to automatically assign international classification of diseases (ninth revision) - clinical modification (ICD-9-CM) codes to EMRs by experimenting with a large realistic EMR dataset. The overall goal is to identify methods that offer superior performance in this task when considering such datasets. Methods We use a dataset of 71,463 EMRs corresponding to in-patient visits with discharge date falling in a two year period (2011–2012) from the University of Kentucky (UKY) Medical Center. We curate a smaller subset of this dataset and also use a third gold standard dataset of radiology reports. We conduct experiments using different problem transformation approaches with feature and data selection components and employing suitable label calibration and ranking methods with novel features involving code co-occurrence frequencies and latent code associations. Results Over all codes with at least 50 training examples we obtain a micro F-score of 0.48. On the set of codes that occur at least in 1% of the two year dataset, we achieve a micro F-score of 0.54. For the smaller radiology report dataset, the classifier chaining approach yields best results. For the smaller subset of the UKY dataset, feature selection, data selection, and label calibration offer best performance. Conclusions We show that datasets at different scale (size of the EMRs, number of distinct codes) and with different characteristics warrant different learning approaches. For shorter narratives pertaining to a particular medical subdomain (e.g., radiology, pathology), classifier chaining is ideal given the codes are highly related with each other. For realistic in-patient full EMRs, feature and data selection methods offer high performance for smaller datasets. However, for large EMR datasets, we observe that the binary relevance approach with learning-to-rank based code reranking offers the best performance. Regardless of the training dataset size, for general EMRs, label calibration to select the optimal number of labels is an indispensable final step. PMID:26054428

  15. An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records.

    PubMed

    Kavuluru, Ramakanth; Rios, Anthony; Lu, Yuan

    2015-10-01

    Diagnosis codes are assigned to medical records in healthcare facilities by trained coders by reviewing all physician authored documents associated with a patient's visit. This is a necessary and complex task involving coders adhering to coding guidelines and coding all assignable codes. With the popularity of electronic medical records (EMRs), computational approaches to code assignment have been proposed in the recent years. However, most efforts have focused on single and often short clinical narratives, while realistic scenarios warrant full EMR level analysis for code assignment. We evaluate supervised learning approaches to automatically assign international classification of diseases (ninth revision) - clinical modification (ICD-9-CM) codes to EMRs by experimenting with a large realistic EMR dataset. The overall goal is to identify methods that offer superior performance in this task when considering such datasets. We use a dataset of 71,463 EMRs corresponding to in-patient visits with discharge date falling in a two year period (2011-2012) from the University of Kentucky (UKY) Medical Center. We curate a smaller subset of this dataset and also use a third gold standard dataset of radiology reports. We conduct experiments using different problem transformation approaches with feature and data selection components and employing suitable label calibration and ranking methods with novel features involving code co-occurrence frequencies and latent code associations. Over all codes with at least 50 training examples we obtain a micro F-score of 0.48. On the set of codes that occur at least in 1% of the two year dataset, we achieve a micro F-score of 0.54. For the smaller radiology report dataset, the classifier chaining approach yields best results. For the smaller subset of the UKY dataset, feature selection, data selection, and label calibration offer best performance. We show that datasets at different scale (size of the EMRs, number of distinct codes) and with different characteristics warrant different learning approaches. For shorter narratives pertaining to a particular medical subdomain (e.g., radiology, pathology), classifier chaining is ideal given the codes are highly related with each other. For realistic in-patient full EMRs, feature and data selection methods offer high performance for smaller datasets. However, for large EMR datasets, we observe that the binary relevance approach with learning-to-rank based code reranking offers the best performance. Regardless of the training dataset size, for general EMRs, label calibration to select the optimal number of labels is an indispensable final step. Copyright © 2015 Elsevier B.V. All rights reserved.

  16. Isfahan MISP Dataset.

    PubMed

    Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

    2017-01-01

    An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).

  17. A Q-backpropagated time delay neural network for diagnosing severity of gait disturbances in Parkinson's disease.

    PubMed

    Nancy Jane, Y; Khanna Nehemiah, H; Arputharaj, Kannan

    2016-04-01

    Parkinson's disease (PD) is a movement disorder that affects the patient's nervous system and health-care applications mostly uses wearable sensors to collect these data. Since these sensors generate time stamped data, analyzing gait disturbances in PD becomes challenging task. The objective of this paper is to develop an effective clinical decision-making system (CDMS) that aids the physician in diagnosing the severity of gait disturbances in PD affected patients. This paper presents a Q-backpropagated time delay neural network (Q-BTDNN) classifier that builds a temporal classification model, which performs the task of classification and prediction in CDMS. The proposed Q-learning induced backpropagation (Q-BP) training algorithm trains the Q-BTDNN by generating a reinforced error signal. The network's weights are adjusted through backpropagating the generated error signal. For experimentation, the proposed work uses a PD gait database, which contains gait measures collected through wearable sensors from three different PD research studies. The experimental result proves the efficiency of Q-BP in terms of its improved classification accuracy of 91.49%, 92.19% and 90.91% with three datasets accordingly compared to other neural network training algorithms. Copyright © 2016 Elsevier Inc. All rights reserved.

  18. Validation of "AW3D" Global Dsm Generated from Alos Prism

    NASA Astrophysics Data System (ADS)

    Takaku, Junichi; Tadono, Takeo; Tsutsui, Ken; Ichikawa, Mayumi

    2016-06-01

    Panchromatic Remote-sensing Instrument for Stereo Mapping (PRISM), one of onboard sensors carried by Advanced Land Observing Satellite (ALOS), was designed to generate worldwide topographic data with its optical stereoscopic observation. It has an exclusive ability to perform a triplet stereo observation which views forward, nadir, and backward along the satellite track in 2.5 m ground resolution, and collected its derived images all over the world during the mission life of the satellite from 2006 through 2011. A new project, which generates global elevation datasets with the image archives, was started in 2014. The data is processed in unprecedented 5 m grid spacing utilizing the original triplet stereo images in 2.5 m resolution. As the number of processed data is growing steadily so that the global land areas are almost covered, a trend of global data qualities became apparent. This paper reports on up-to-date results of the validations for the accuracy of data products as well as the status of data coverage in global areas. The accuracies and error characteristics of datasets are analyzed by the comparison with existing global datasets such as Ice, Cloud, and land Elevation Satellite (ICESat) data, as well as ground control points (GCPs) and the reference Digital Elevation Model (DEM) derived from the airborne Light Detection and Ranging (LiDAR).

  19. A novel approach for incremental uncertainty rule generation from databases with missing values handling: application to dynamic medical databases.

    PubMed

    Konias, Sokratis; Chouvarda, Ioanna; Vlahavas, Ioannis; Maglaveras, Nicos

    2005-09-01

    Current approaches for mining association rules usually assume that the mining is performed in a static database, where the problem of missing attribute values does not practically exist. However, these assumptions are not preserved in some medical databases, like in a home care system. In this paper, a novel uncertainty rule algorithm is illustrated, namely URG-2 (Uncertainty Rule Generator), which addresses the problem of mining dynamic databases containing missing values. This algorithm requires only one pass from the initial dataset in order to generate the item set, while new metrics corresponding to the notion of Support and Confidence are used. URG-2 was evaluated over two medical databases, introducing randomly multiple missing values for each record's attribute (rate: 5-20% by 5% increments) in the initial dataset. Compared with the classical approach (records with missing values are ignored), the proposed algorithm was more robust in mining rules from datasets containing missing values. In all cases, the difference in preserving the initial rules ranged between 30% and 60% in favour of URG-2. Moreover, due to its incremental nature, URG-2 saved over 90% of the time required for thorough re-mining. Thus, the proposed algorithm can offer a preferable solution for mining in dynamic relational databases.

  20. Cancer Transcriptome Dataset Analysis: Comparing Methods of Pathway and Gene Regulatory Network-Based Cluster Identification.

    PubMed

    Nam, Seungyoon

    2017-04-01

    Cancer transcriptome analysis is one of the leading areas of Big Data science, biomarker, and pharmaceutical discovery, not to forget personalized medicine. Yet, cancer transcriptomics and postgenomic medicine require innovation in bioinformatics as well as comparison of the performance of available algorithms. In this data analytics context, the value of network generation and algorithms has been widely underscored for addressing the salient questions in cancer pathogenesis. Analysis of cancer trancriptome often results in complicated networks where identification of network modularity remains critical, for example, in delineating the "druggable" molecular targets. Network clustering is useful, but depends on the network topology in and of itself. Notably, the performance of different network-generating tools for network cluster (NC) identification has been little investigated to date. Hence, using gastric cancer (GC) transcriptomic datasets, we compared two algorithms for generating pathway versus gene regulatory network-based NCs, showing that the pathway-based approach better agrees with a reference set of cancer-functional contexts. Finally, by applying pathway-based NC identification to GC transcriptome datasets, we describe cancer NCs that associate with candidate therapeutic targets and biomarkers in GC. These observations collectively inform future research on cancer transcriptomics, drug discovery, and rational development of new analysis tools for optimal harnessing of omics data.

  1. Effective Vehicle-Based Kangaroo Detection for Collision Warning Systems Using Region-Based Convolutional Networks.

    PubMed

    Saleh, Khaled; Hossny, Mohammed; Nahavandi, Saeid

    2018-06-12

    Traffic collisions between kangaroos and motorists are on the rise on Australian roads. According to a recent report, it was estimated that there were more than 20,000 kangaroo vehicle collisions that occurred only during the year 2015 in Australia. In this work, we are proposing a vehicle-based framework for kangaroo detection in urban and highway traffic environment that could be used for collision warning systems. Our proposed framework is based on region-based convolutional neural networks (RCNN). Given the scarcity of labeled data of kangaroos in traffic environments, we utilized our state-of-the-art data generation pipeline to generate 17,000 synthetic depth images of traffic scenes with kangaroo instances annotated in them. We trained our proposed RCNN-based framework on a subset of the generated synthetic depth images dataset. The proposed framework achieved a higher average precision (AP) score of 92% over all the testing synthetic depth image datasets. We compared our proposed framework against other baseline approaches and we outperformed it with more than 37% in AP score over all the testing datasets. Additionally, we evaluated the generalization performance of the proposed framework on real live data and we achieved a resilient detection accuracy without any further fine-tuning of our proposed RCNN-based framework.

  2. The South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case register: development and descriptive data

    PubMed Central

    Stewart, Robert; Soremekun, Mishael; Perera, Gayan; Broadbent, Matthew; Callard, Felicity; Denis, Mike; Hotopf, Matthew; Thornicroft, Graham; Lovestone, Simon

    2009-01-01

    Background Case registers have been used extensively in mental health research. Recent developments in electronic medical records, and in computer software to search and analyse these in anonymised format, have the potential to revolutionise this research tool. Methods We describe the development of the South London and Maudsley NHS Foundation Trust (SLAM) Biomedical Research Centre (BRC) Case Register Interactive Search tool (CRIS) which allows research-accessible datasets to be derived from SLAM, the largest provider of secondary mental healthcare in Europe. All clinical data, including free text, are available for analysis in the form of anonymised datasets. Development involved both the building of the system and setting in place the necessary security (with both functional and procedural elements). Results Descriptive data are presented for the Register database as of October 2008. The database at that point included 122,440 cases, 35,396 of whom were receiving active case management under the Care Programme Approach. In terms of gender and ethnicity, the database was reasonably representative of the source population. The most common assigned primary diagnoses were within the ICD mood disorders (n = 12,756) category followed by schizophrenia and related disorders (8158), substance misuse (7749), neuroses (7105) and organic disorders (6414). Conclusion The SLAM BRC Case Register represents a 'new generation' of this research design, built on a long-running system of fully electronic clinical records and allowing in-depth secondary analysis of both numerical, string and free text data, whilst preserving anonymity through technical and procedural safeguards. PMID:19674459

  3. The cross-sectional GRAS sample: A comprehensive phenotypical data collection of schizophrenic patients

    PubMed Central

    2010-01-01

    Background Schizophrenia is the collective term for an exclusively clinically diagnosed, heterogeneous group of mental disorders with still obscure biological roots. Based on the assumption that valuable information about relevant genetic and environmental disease mechanisms can be obtained by association studies on patient cohorts of ≥ 1000 patients, if performed on detailed clinical datasets and quantifiable biological readouts, we generated a new schizophrenia data base, the GRAS (Göttingen Research Association for Schizophrenia) data collection. GRAS is the necessary ground to study genetic causes of the schizophrenic phenotype in a 'phenotype-based genetic association study' (PGAS). This approach is different from and complementary to the genome-wide association studies (GWAS) on schizophrenia. Methods For this purpose, 1085 patients were recruited between 2005 and 2010 by an invariable team of traveling investigators in a cross-sectional field study that comprised 23 German psychiatric hospitals. Additionally, chart records and discharge letters of all patients were collected. Results The corresponding dataset extracted and presented in form of an overview here, comprises biographic information, disease history, medication including side effects, and results of comprehensive cross-sectional psychopathological, neuropsychological, and neurological examinations. With >3000 data points per schizophrenic subject, this data base of living patients, who are also accessible for follow-up studies, provides a wide-ranging and standardized phenotype characterization of as yet unprecedented detail. Conclusions The GRAS data base will serve as prerequisite for PGAS, a novel approach to better understanding 'the schizophrenias' through exploring the contribution of genetic variation to the schizophrenic phenotypes. PMID:21067598

  4. How do we estimate survival? External validation of a tool for survival estimation in patients with metastatic bone disease-decision analysis and comparison of three international patient populations.

    PubMed

    Piccioli, Andrea; Spinelli, M Silvia; Forsberg, Jonathan A; Wedin, Rikard; Healey, John H; Ippolito, Vincenzo; Daolio, Primo Andrea; Ruggieri, Pietro; Maccauro, Giulio; Gasbarrini, Alessandro; Biagini, Roberto; Piana, Raimondo; Fazioli, Flavio; Luzzati, Alessandro; Di Martino, Alberto; Nicolosi, Francesco; Camnasio, Francesco; Rosa, Michele Attilio; Campanacci, Domenico Andrea; Denaro, Vincenzo; Capanna, Rodolfo

    2015-05-22

    We recently developed a clinical decision support tool, capable of estimating the likelihood of survival at 3 and 12 months following surgery for patients with operable skeletal metastases. After making it publicly available on www.PATHFx.org , we attempted to externally validate it using independent, international data. We collected data from patients treated at 13 Italian orthopaedic oncology referral centers between 2010 and 2013, then applied to PATHFx, which generated a probability of survival at three and 12-months for each patient. We assessed accuracy using the area under the receiver-operating characteristic curve (AUC), clinical utility using Decision Curve Analysis (DCA), and compared the Italian patient data to the training set (United States) and first external validation set (Scandinavia). The Italian dataset contained 287 records with at least 12 months follow-up information. The AUCs for the three-month and 12-month estimates was 0.80 and 0.77, respectively. There were missing data, including the surgeon's estimate of survival that was missing in the majority of records. Physiologically, Italian patients were similar to patients in the training and first validation sets. However notable differences were observed in the proportion of those surviving three and 12-months, suggesting differences in referral patterns and perhaps indications for surgery. PATHFx was successfully validated in an Italian dataset containing missing data. This study demonstrates its broad applicability to European patients, even in centers with differing treatment philosophies from those previously studied.

  5. Genome-wide methylomic and transcriptomic analyses identify subtype-specific epigenetic signatures commonly dysregulated in glioma stem cells and glioblastoma.

    PubMed

    Pangeni, Rajendra P; Zhang, Zhou; Alvarez, Angel A; Wan, Xuechao; Sastry, Namratha; Lu, Songjian; Shi, Taiping; Huang, Tianzhi; Lei, Charles X; James, C David; Kessler, John A; Brennan, Cameron W; Nakano, Ichiro; Lu, Xinghua; Hu, Bo; Zhang, Wei; Cheng, Shi-Yuan

    2018-06-21

    Glioma stem cells (GSCs), a subpopulation of tumor cells, contribute to tumor heterogeneity and therapy resistance. Gene expression profiling classified glioblastoma (GBM) and GSCs into four transcriptomically-defined subtypes. Here, we determined the DNA methylation signatures in transcriptomically pre-classified GSC and GBM bulk tumors subtypes. We hypothesized that these DNA methylation signatures correlate with gene expression and are uniquely associated either with only GSCs or only GBM bulk tumors. Additional methylation signatures may be commonly associated with both GSCs and GBM bulk tumors, i.e., common to non-stem-like and stem-like tumor cell populations and correlating with the clinical prognosis of glioma patients. We analyzed Illumina 450K methylation array and expression data from a panel of 23 patient-derived GSCs. We referenced these results with The Cancer Genome Atlas (TCGA) GBM datasets to generate methylomic and transcriptomic signatures for GSCs and GBM bulk tumors of each transcriptomically pre-defined tumor subtype. Survival analyses were carried out for these signature genes using publicly available datasets, including from TCGA. We report that DNA methylation signatures in proneural and mesenchymal tumor subtypes are either unique to GSCs, unique to GBM bulk tumors, or common to both. Further, dysregulated DNA methylation correlates with gene expression and clinical prognoses. Additionally, many previously identified transcriptionally-regulated markers are also dysregulated due to DNA methylation. The subtype-specific DNA methylation signatures described in this study could be useful for refining GBM sub-classification, improving prognostic accuracy, and making therapeutic decisions.

  6. Undersampling strategies for compressed sensing accelerated MR spectroscopic imaging

    NASA Astrophysics Data System (ADS)

    Vidya Shankar, Rohini; Hu, Houchun Harry; Bikkamane Jayadev, Nutandev; Chang, John C.; Kodibagkar, Vikram D.

    2017-03-01

    Compressed sensing (CS) can accelerate magnetic resonance spectroscopic imaging (MRSI), facilitating its widespread clinical integration. The objective of this study was to assess the effect of different undersampling strategy on CS-MRSI reconstruction quality. Phantom data were acquired on a Philips 3 T Ingenia scanner. Four types of undersampling masks, corresponding to each strategy, namely, low resolution, variable density, iterative design, and a priori were simulated in Matlab and retrospectively applied to the test 1X MRSI data to generate undersampled datasets corresponding to the 2X - 5X, and 7X accelerations for each type of mask. Reconstruction parameters were kept the same in each case(all masks and accelerations) to ensure that any resulting differences can be attributed to the type of mask being employed. The reconstructed datasets from each mask were statistically compared with the reference 1X, and assessed using metrics like the root mean square error and metabolite ratios. Simulation results indicate that both the a priori and variable density undersampling masks maintain high fidelity with the 1X up to five-fold acceleration. The low resolution mask based reconstructions showed statistically significant differences from the 1X with the reconstruction failing at 3X, while the iterative design reconstructions maintained fidelity with the 1X till 4X acceleration. In summary, a pilot study was conducted to identify an optimal sampling mask in CS-MRSI. Simulation results demonstrate that the a priori and variable density masks can provide statistically similar results to the fully sampled reference. Future work would involve implementing these two masks prospectively on a clinical scanner.

  7. The Isprs Benchmark on Indoor Modelling

    NASA Astrophysics Data System (ADS)

    Khoshelham, K.; Díaz Vilariño, L.; Peter, M.; Kang, Z.; Acharya, D.

    2017-09-01

    Automated generation of 3D indoor models from point cloud data has been a topic of intensive research in recent years. While results on various datasets have been reported in literature, a comparison of the performance of different methods has not been possible due to the lack of benchmark datasets and a common evaluation framework. The ISPRS benchmark on indoor modelling aims to address this issue by providing a public benchmark dataset and an evaluation framework for performance comparison of indoor modelling methods. In this paper, we present the benchmark dataset comprising several point clouds of indoor environments captured by different sensors. We also discuss the evaluation and comparison of indoor modelling methods based on manually created reference models and appropriate quality evaluation criteria. The benchmark dataset is available for download at: http://www2.isprs.org/commissions/comm4/wg5/benchmark-on-indoor-modelling.html.

  8. Disk storage management for LHCb based on Data Popularity estimator

    NASA Astrophysics Data System (ADS)

    Hushchyn, Mikhail; Charpentier, Philippe; Ustyuzhanin, Andrey

    2015-12-01

    This paper presents an algorithm providing recommendations for optimizing the LHCb data storage. The LHCb data storage system is a hybrid system. All datasets are kept as archives on magnetic tapes. The most popular datasets are kept on disks. The algorithm takes the dataset usage history and metadata (size, type, configuration etc.) to generate a recommendation report. This article presents how we use machine learning algorithms to predict future data popularity. Using these predictions it is possible to estimate which datasets should be removed from disk. We use regression algorithms and time series analysis to find the optimal number of replicas for datasets that are kept on disk. Based on the data popularity and the number of replicas optimization, the algorithm minimizes a loss function to find the optimal data distribution. The loss function represents all requirements for data distribution in the data storage system. We demonstrate how our algorithm helps to save disk space and to reduce waiting times for jobs using this data.

  9. Remote visual analysis of large turbulence databases at multiple scales

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  10. Remote visual analysis of large turbulence databases at multiple scales

    DOE PAGES

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...

    2018-06-15

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  11. MHK Hydrofoils Design, Wind Tunnel Optimization and CFD Analysis Report for the Aquantis 2.5MW Ocean Current Generation Device

    DOE Data Explorer

    Shiu, Henry; Swales, Henry; Van Damn, Case

    2015-06-03

    Dataset contains MHK Hydrofoils Design and Optimization and CFD Analysis Report for the Aquantis 2.5 MW Ocean Current Generation Device, as well as MHK Hydrofoils Wind Tunnel Test Plan and Checkout Test Report.

  12. A metadata approach for clinical data management in translational genomics studies in breast cancer.

    PubMed

    Papatheodorou, Irene; Crichton, Charles; Morris, Lorna; Maccallum, Peter; Davies, Jim; Brenton, James D; Caldas, Carlos

    2009-11-30

    In molecular profiling studies of cancer patients, experimental and clinical data are combined in order to understand the clinical heterogeneity of the disease: clinical information for each subject needs to be linked to tumour samples, macromolecules extracted, and experimental results. This may involve the integration of clinical data sets from several different sources: these data sets may employ different data definitions and some may be incomplete. In this work we employ semantic web techniques developed within the CancerGrid project, in particular the use of metadata elements and logic-based inference to annotate heterogeneous clinical information, integrate and query it. We show how this integration can be achieved automatically, following the declaration of appropriate metadata elements for each clinical data set; we demonstrate the practicality of this approach through application to experimental results and clinical data from five hospitals in the UK and Canada, undertaken as part of the METABRIC project (Molecular Taxonomy of Breast Cancer International Consortium). We describe a metadata approach for managing similarities and differences in clinical datasets in a standardized way that uses Common Data Elements (CDEs). We apply and evaluate the approach by integrating the five different clinical datasets of METABRIC.

  13. Quantitative PET/CT scanner performance characterization based upon the society of nuclear medicine and molecular imaging clinical trials network oncology clinical simulator phantom.

    PubMed

    Sunderland, John J; Christian, Paul E

    2015-01-01

    The Clinical Trials Network (CTN) of the Society of Nuclear Medicine and Molecular Imaging (SNMMI) operates a PET/CT phantom imaging program using the CTN's oncology clinical simulator phantom, designed to validate scanners at sites that wish to participate in oncology clinical trials. Since its inception in 2008, the CTN has collected 406 well-characterized phantom datasets from 237 scanners at 170 imaging sites covering the spectrum of commercially available PET/CT systems. The combined and collated phantom data describe a global profile of quantitative performance and variability of PET/CT data used in both clinical practice and clinical trials. Individual sites filled and imaged the CTN oncology PET phantom according to detailed instructions. Standard clinical reconstructions were requested and submitted. The phantom itself contains uniform regions suitable for scanner calibration assessment, lung fields, and 6 hot spheric lesions with diameters ranging from 7 to 20 mm at a 4:1 contrast ratio with primary background. The CTN Phantom Imaging Core evaluated the quality of the phantom fill and imaging and measured background standardized uptake values to assess scanner calibration and maximum standardized uptake values of all 6 lesions to review quantitative performance. Scanner make-and-model-specific measurements were pooled and then subdivided by reconstruction to create scanner-specific quantitative profiles. Different makes and models of scanners predictably demonstrated different quantitative performance profiles including, in some cases, small calibration bias. Differences in site-specific reconstruction parameters increased the quantitative variability among similar scanners, with postreconstruction smoothing filters being the most influential parameter. Quantitative assessment of this intrascanner variability over this large collection of phantom data gives, for the first time, estimates of reconstruction variance introduced into trials from allowing trial sites to use their preferred reconstruction methodologies. Predictably, time-of-flight-enabled scanners exhibited less size-based partial-volume bias than non-time-of-flight scanners. The CTN scanner validation experience over the past 5 y has generated a rich, well-curated phantom dataset from which PET/CT make-and-model and reconstruction-dependent quantitative behaviors were characterized for the purposes of understanding and estimating scanner-based variances in clinical trials. These results should make it possible to identify and recommend make-and-model-specific reconstruction strategies to minimize measurement variability in cancer clinical trials. © 2015 by the Society of Nuclear Medicine and Molecular Imaging, Inc.

  14. The Problem with Big Data: Operating on Smaller Datasets to Bridge the Implementation Gap.

    PubMed

    Mann, Richard P; Mushtaq, Faisal; White, Alan D; Mata-Cervantes, Gabriel; Pike, Tom; Coker, Dalton; Murdoch, Stuart; Hiles, Tim; Smith, Clare; Berridge, David; Hinchliffe, Suzanne; Hall, Geoff; Smye, Stephen; Wilkie, Richard M; Lodge, J Peter A; Mon-Williams, Mark

    2016-01-01

    Big datasets have the potential to revolutionize public health. However, there is a mismatch between the political and scientific optimism surrounding big data and the public's perception of its benefit. We suggest a systematic and concerted emphasis on developing models derived from smaller datasets to illustrate to the public how big data can produce tangible benefits in the long term. In order to highlight the immediate value of a small data approach, we produced a proof-of-concept model predicting hospital length of stay. The results demonstrate that existing small datasets can be used to create models that generate a reasonable prediction, facilitating health-care delivery. We propose that greater attention (and funding) needs to be directed toward the utilization of existing information resources in parallel with current efforts to create and exploit "big data."

  15. Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT.

    PubMed

    Deist, Timo M; Jochems, A; van Soest, Johan; Nalbantov, Georgi; Oberije, Cary; Walsh, Seán; Eble, Michael; Bulens, Paul; Coucke, Philippe; Dries, Wim; Dekker, Andre; Lambin, Philippe

    2017-06-01

    Machine learning applications for personalized medicine are highly dependent on access to sufficient data. For personalized radiation oncology, datasets representing the variation in the entire cancer patient population need to be acquired and used to learn prediction models. Ethical and legal boundaries to ensure data privacy hamper collaboration between research institutes. We hypothesize that data sharing is possible without identifiable patient data leaving the radiation clinics and that building machine learning applications on distributed datasets is feasible. We developed and implemented an IT infrastructure in five radiation clinics across three countries (Belgium, Germany, and The Netherlands). We present here a proof-of-principle for future 'big data' infrastructures and distributed learning studies. Lung cancer patient data was collected in all five locations and stored in local databases. Exemplary support vector machine (SVM) models were learned using the Alternating Direction Method of Multipliers (ADMM) from the distributed databases to predict post-radiotherapy dyspnea grade [Formula: see text]. The discriminative performance was assessed by the area under the curve (AUC) in a five-fold cross-validation (learning on four sites and validating on the fifth). The performance of the distributed learning algorithm was compared to centralized learning where datasets of all institutes are jointly analyzed. The euroCAT infrastructure has been successfully implemented in five radiation clinics across three countries. SVM models can be learned on data distributed over all five clinics. Furthermore, the infrastructure provides a general framework to execute learning algorithms on distributed data. The ongoing expansion of the euroCAT network will facilitate machine learning in radiation oncology. The resulting access to larger datasets with sufficient variation will pave the way for generalizable prediction models and personalized medicine.

  16. Data Portal | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The CPTAC Data Portal is a centralized repository for the public dissemination of proteomic sequence datasets collected by CPTAC, along with corresponding genomic sequence datasets.  In addition, available are analyses of CPTAC's raw mass spectrometry-based data files (mapping of spectra to peptide sequences and protein identification) by individual investigators from CPTAC and by a Common Data Analysis Pipeline.

  17. Improving resolution of MR images with an adversarial network incorporating images with different contrast.

    PubMed

    Kim, Ki Hwan; Do, Won-Joon; Park, Sung-Hong

    2018-05-04

    The routine MRI scan protocol consists of multiple pulse sequences that acquire images of varying contrast. Since high frequency contents such as edges are not significantly affected by image contrast, down-sampled images in one contrast may be improved by high resolution (HR) images acquired in another contrast, reducing the total scan time. In this study, we propose a new deep learning framework that uses HR MR images in one contrast to generate HR MR images from highly down-sampled MR images in another contrast. The proposed convolutional neural network (CNN) framework consists of two CNNs: (a) a reconstruction CNN for generating HR images from the down-sampled images using HR images acquired with a different MRI sequence and (b) a discriminator CNN for improving the perceptual quality of the generated HR images. The proposed method was evaluated using a public brain tumor database and in vivo datasets. The performance of the proposed method was assessed in tumor and no-tumor cases separately, with perceptual image quality being judged by a radiologist. To overcome the challenge of training the network with a small number of available in vivo datasets, the network was pretrained using the public database and then fine-tuned using the small number of in vivo datasets. The performance of the proposed method was also compared to that of several compressed sensing (CS) algorithms. Incorporating HR images of another contrast improved the quantitative assessments of the generated HR image in reference to ground truth. Also, incorporating a discriminator CNN yielded perceptually higher image quality. These results were verified in regions of normal tissue as well as tumors for various MRI sequences from pseudo k-space data generated from the public database. The combination of pretraining with the public database and fine-tuning with the small number of real k-space datasets enhanced the performance of CNNs in in vivo application compared to training CNNs from scratch. The proposed method outperformed the compressed sensing methods. The proposed method can be a good strategy for accelerating routine MRI scanning. © 2018 American Association of Physicists in Medicine.

  18. Hydrology Research with the North American Land Data Assimilation System (NLDAS) Datasets at the NASA GES DISC Using Giovanni

    NASA Technical Reports Server (NTRS)

    Mocko, David M.; Rui, Hualan; Acker, James G.

    2013-01-01

    The North American Land Data Assimilation System (NLDAS) is a collaboration project between NASA/GSFC, NOAA, Princeton Univ., and the Univ. of Washington. NLDAS has created a surface meteorology dataset using the best-available observations and reanalyses the backbone of this dataset is a gridded precipitation analysis from rain gauges. This dataset is used to drive four separate land-surface models (LSMs) to produce datasets of soil moisture, snow, runoff, and surface fluxes. NLDAS datasets are available hourly and extend from Jan 1979 to near real-time with a typical 4-day lag. The datasets are available at 1/8th-degree over CONUS and portions of Canada and Mexico from 25-53 North. The datasets have been extensively evaluated against observations, and are also used as part of a drought monitor. NLDAS datasets are available from the NASA GES DISC and can be accessed via ftp, GDS, Mirador, and Giovanni. GES DISC news articles were published showing figures from the heat wave of 2011, Hurricane Irene, Tropical Storm Lee, and the low-snow winter of 2011-2012. For this presentation, Giovanni-generated figures using NLDAS data from the derecho across the U.S. Midwest and Mid-Atlantic will be presented. Also, similar figures will be presented from the landfall of Hurricane Isaac and the before-and-after drought conditions of the path of the tropical moisture into the central states of the U.S. Updates on future products and datasets from the NLDAS project will also be introduced.

  19. The Greenwich Photo-heliographic Results (1874 - 1976): Summary of the Observations, Applications, Datasets, Definitions and Errors

    NASA Astrophysics Data System (ADS)

    Willis, D. M.; Coffey, H. E.; Henwood, R.; Erwin, E. H.; Hoyt, D. V.; Wild, M. N.; Denig, W. F.

    2013-11-01

    The measurements of sunspot positions and areas that were published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), as the Greenwich Photo-heliographic Results ( GPR), 1874 - 1976, exist in both printed and digital forms. These printed and digital sunspot datasets have been archived in various libraries and data centres. Unfortunately, however, typographic, systematic and isolated errors can be found in the various datasets. The purpose of the present paper is to begin the task of identifying and correcting these errors. In particular, the intention is to provide in one foundational paper all the necessary background information on the original solar observations, their various applications in scientific research, the format of the different digital datasets, the necessary definitions of the quantities measured, and the initial identification of errors in both the printed publications and the digital datasets. Two companion papers address the question of specific identifiable errors; namely, typographic errors in the printed publications, and both isolated and systematic errors in the digital datasets. The existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the programme of solar observations supported for more than a century by the Royal Observatory, Greenwich, and the Royal Greenwich Observatory. This improved dataset should be of value in many future scientific investigations.

  20. Assessment of NASA's Physiographic and Meteorological Datasets as Input to HSPF and SWAT Hydrological Models

    NASA Technical Reports Server (NTRS)

    Alacron, Vladimir J.; Nigro, Joseph D.; McAnally, William H.; OHara, Charles G.; Engman, Edwin Ted; Toll, David

    2011-01-01

    This paper documents the use of simulated Moderate Resolution Imaging Spectroradiometer land use/land cover (MODIS-LULC), NASA-LIS generated precipitation and evapo-transpiration (ET), and Shuttle Radar Topography Mission (SRTM) datasets (in conjunction with standard land use, topographical and meteorological datasets) as input to hydrological models routinely used by the watershed hydrology modeling community. The study is focused in coastal watersheds in the Mississippi Gulf Coast although one of the test cases focuses in an inland watershed located in northeastern State of Mississippi, USA. The decision support tools (DSTs) into which the NASA datasets were assimilated were the Soil Water & Assessment Tool (SWAT) and the Hydrological Simulation Program FORTRAN (HSPF). These DSTs are endorsed by several US government agencies (EPA, FEMA, USGS) for water resources management strategies. These models use physiographic and meteorological data extensively. Precipitation gages and USGS gage stations in the region were used to calibrate several HSPF and SWAT model applications. Land use and topographical datasets were swapped to assess model output sensitivities. NASA-LIS meteorological data were introduced in the calibrated model applications for simulation of watershed hydrology for a time period in which no weather data were available (1997-2006). The performance of the NASA datasets in the context of hydrological modeling was assessed through comparison of measured and model-simulated hydrographs. Overall, NASA datasets were as useful as standard land use, topographical , and meteorological datasets. Moreover, NASA datasets were used for performing analyses that the standard datasets could not made possible, e.g., introduction of land use dynamics into hydrological simulations

  1. Optimizing tertiary storage organization and access for spatio-temporal datasets

    NASA Technical Reports Server (NTRS)

    Chen, Ling Tony; Rotem, Doron; Shoshani, Arie; Drach, Bob; Louis, Steve; Keating, Meridith

    1994-01-01

    We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. This paper focuses on very large spatial and temporal datasets generated by simulation programs in the area of climate modeling, but the techniques developed can be applied to other applications that deal with large multidimensional datasets. The main requirement we have addressed is the efficient access of subsets of information contained within much larger datasets, for the purpose of analysis and interactive visualization. We have developed data partitioning techniques that partition datasets into 'clusters' based on analysis of data access patterns and storage device characteristics. The goal is to minimize the number of clusters read from mass storage systems when subsets are requested. We emphasize in this paper proposed enhancements to current storage server protocols to permit control over physical placement of data on storage devices. We also discuss in some detail the aspects of the interface between the application programs and the mass storage system, as well as a workbench to help scientists to design the best reorganization of a dataset for anticipated access patterns.

  2. Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

    PubMed Central

    Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

    2014-01-01

    SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111

  3. Climate Forcing Datasets for Agricultural Modeling: Merged Products for Gap-Filling and Historical Climate Series Estimation

    NASA Technical Reports Server (NTRS)

    Ruane, Alex C.; Goldberg, Richard; Chryssanthacopoulos, James

    2014-01-01

    The AgMERRA and AgCFSR climate forcing datasets provide daily, high-resolution, continuous, meteorological series over the 1980-2010 period designed for applications examining the agricultural impacts of climate variability and climate change. These datasets combine daily resolution data from retrospective analyses (the Modern-Era Retrospective Analysis for Research and Applications, MERRA, and the Climate Forecast System Reanalysis, CFSR) with in situ and remotely-sensed observational datasets for temperature, precipitation, and solar radiation, leading to substantial reductions in bias in comparison to a network of 2324 agricultural-region stations from the Hadley Integrated Surface Dataset (HadISD). Results compare favorably against the original reanalyses as well as the leading climate forcing datasets (Princeton, WFD, WFD-EI, and GRASP), and AgMERRA distinguishes itself with substantially improved representation of daily precipitation distributions and extreme events owing to its use of the MERRA-Land dataset. These datasets also peg relative humidity to the maximum temperature time of day, allowing for more accurate representation of the diurnal cycle of near-surface moisture in agricultural models. AgMERRA and AgCFSR enable a number of ongoing investigations in the Agricultural Model Intercomparison and Improvement Project (AgMIP) and related research networks, and may be used to fill gaps in historical observations as well as a basis for the generation of future climate scenarios.

  4. A multi-data source surveillance system to detect a bioterrorism attack during the G8 Summit in Scotland.

    PubMed

    Meyer, N; McMenamin, J; Robertson, C; Donaghy, M; Allardice, G; Cooper, D

    2008-07-01

    In 18 weeks, Health Protection Scotland (HPS) deployed a syndromic surveillance system to early-detect natural or intentional disease outbreaks during the G8 Summit 2005 at Gleneagles, Scotland. The system integrated clinical and non-clinical datasets. Clinical datasets included Accident & Emergency (A&E) syndromes, and General Practice (GPs) codes grouped into syndromes. Non-clinical data included telephone calls to a nurse helpline, laboratory test orders, and hotel staff absenteeism. A cumulative sum-based detection algorithm and a log-linear regression model identified signals in the data. The system had a fax-based track for real-time identification of unusual presentations. Ninety-five signals were triggered by the detection algorithms and four forms were faxed to HPS. Thirteen signals were investigated. The system successfully complemented a traditional surveillance system in identifying a small cluster of gastroenteritis among the police force and triggered interventions to prevent further cases.

  5. Increasing conclusiveness of clinical breath analysis by improved baseline correction of multi capillary column - ion mobility spectrometry (MCC-IMS) data.

    PubMed

    Szymańska, Ewa; Tinnevelt, Gerjen H; Brodrick, Emma; Williams, Mark; Davies, Antony N; van Manen, Henk-Jan; Buydens, Lutgarde M C

    2016-08-05

    Current challenges of clinical breath analysis include large data size and non-clinically relevant variations observed in exhaled breath measurements, which should be urgently addressed with competent scientific data tools. In this study, three different baseline correction methods are evaluated within a previously developed data size reduction strategy for multi capillary column - ion mobility spectrometry (MCC-IMS) datasets. Introduced for the first time in breath data analysis, the Top-hat method is presented as the optimum baseline correction method. A refined data size reduction strategy is employed in the analysis of a large breathomic dataset on a healthy and respiratory disease population. New insights into MCC-IMS spectra differences associated with respiratory diseases are provided, demonstrating the additional value of the refined data analysis strategy in clinical breath analysis. Copyright © 2016 Elsevier B.V. All rights reserved.

  6. Re-Construction of Reference Population and Generating Weights by Decision Tree

    DTIC Science & Technology

    2017-07-21

    2017 Claflin University Orangeburg, SC 29115 DEFENSE EQUAL OPPORTUNITY MANAGEMENT INSTITUTE RESEARCH, DEVELOPMENT, AND STRATEGIC...Original Dataset 32 List of Figures in Appendix B Figure 1: Flow and Components of Project 20 Figure 2: Decision Tree 31 Figure 3: Effects of Weight...can compare the sample data. The dataset of this project has the reference population on unit level for group and gender, which is in red-dotted box

  7. EPA GHG Certification of Medium- and Heavy-Duty Vehicles: Development of Road Grade Profiles Representative of US Controlled Access Highways

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wood, Eric; Duran, Adam; Burton, Evan

    This report includes a detailed comparison of the TomTom national road grade database relative to a local road grade dataset generated by Southwest Research Institute and a national elevation dataset publically available from the U.S. Geological Survey. This analysis concluded that the TomTom national road grade database was a suitable source of road grade data for purposes of this study.

  8. Computer Simulation of Classic Studies in Psychology.

    ERIC Educational Resources Information Center

    Bradley, Drake R.

    This paper describes DATASIM, a comprehensive software package which generates simulated data for actual or hypothetical research designs. DATASIM is primarily intended for use in statistics and research methods courses, where it is used to generate "individualized" datasets for students to analyze, and later to correct their answers.…

  9. SU-F-I-45: An Automated Technique to Measure Image Contrast in Clinical CT Images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanders, J; Abadi, E; Meng, B

    Purpose: To develop and validate an automated technique for measuring image contrast in chest computed tomography (CT) exams. Methods: An automated computer algorithm was developed to measure the distribution of Hounsfield units (HUs) inside four major organs: the lungs, liver, aorta, and bones. These organs were first segmented or identified using computer vision and image processing techniques. Regions of interest (ROIs) were automatically placed inside the lungs, liver, and aorta and histograms of the HUs inside the ROIs were constructed. The mean and standard deviation of each histogram were computed for each CT dataset. Comparison of the mean and standardmore » deviation of the HUs in the different organs provides different contrast values. The ROI for the bones is simply the segmentation mask of the bones. Since the histogram for bones does not follow a Gaussian distribution, the 25th and 75th percentile were computed instead of the mean. The sensitivity and accuracy of the algorithm was investigated by comparing the automated measurements with manual measurements. Fifteen contrast enhanced and fifteen non-contrast enhanced chest CT clinical datasets were examined in the validation procedure. Results: The algorithm successfully measured the histograms of the four organs in both contrast and non-contrast enhanced chest CT exams. The automated measurements were in agreement with manual measurements. The algorithm has sufficient sensitivity as indicated by the near unity slope of the automated versus manual measurement plots. Furthermore, the algorithm has sufficient accuracy as indicated by the high coefficient of determination, R2, values ranging from 0.879 to 0.998. Conclusion: Patient-specific image contrast can be measured from clinical datasets. The algorithm can be run on both contrast enhanced and non-enhanced clinical datasets. The method can be applied to automatically assess the contrast characteristics of clinical chest CT images and quantify dependencies that may not be captured in phantom data.« less

  10. Using Global Total Electron Content to Understand Interminimum Changes in Solar EUV Irradiance and Thermospheric Composition

    NASA Astrophysics Data System (ADS)

    McDonald, S. E.; Emmert, J. T.; Krall, J.; Mannucci, A. J.; Vergados, P.

    2017-12-01

    To understand how and why the distribution of geospace plasma in the ionosphere/plasmasphere is evolving over multi-decadal time scales in response to solar, heliospheric and atmospheric forcing, it is critically important to have long-term, stable datasets. In this study, we use a newly constructed dataset of GPS-based total electron content (TEC) developed by JPL. The JPL Global Ionosphere Mapping (GIM) algorithm was used to generate a 35-station dataset spanning two solar minimum periods (1993-2014). We also use altimeter-derived TEC measurements from TOPEX-Poseidon and Jason-1 to construct a continuous dataset for the 1995-2014 time period. Both longterm datasets are compared to each other to study interminimum changes in the global TEC (during 1995-1995 and 2008-2009). We use the SAMI3 physics-based model of the ionosphere to compare the simulations of 1995-2014 with the JPL TEC and TOPEX/Jason-1 datasets. To drive SAMI3, we use the Naval Research Laboratory Solar Spectral Irradiance (NRLSSI) model to specify the EUV irradiances, and NRLMSIS to specify the thermosphere. We adjust the EUV irradiances and thermospheric constituents to match the TEC datasets and draw conclusions regarding sources of the differences between the two solar minimum periods.

  11. V-FOR-WaTer - a new virtual research environment for environmental research

    NASA Astrophysics Data System (ADS)

    Strobl, Marcus; Azmi, Elnaz; Hassler, Sibylle; Mälicke, Mirko; Meyer, Jörg; Zehe, Erwin

    2017-04-01

    The preparation of heterogeneous datasets for scientific analysis is still a demanding task. Data preprocessing for hydrological models typically involves gathering datasets from different sources, extensive work within geoinformation systems, data transformation, the generation of computational grids and the definition of initial and boundary conditions. V-FOR-WaTer, a standardized and scalable data hub with compatible analysis tools, will ease comprehensive studies and significantly reduce data preparation time. The idea behind V-FOR-WaTer is to bring together various datasets (e.g. point measurements, 2D/3D data, time series data) from different sources (e.g. gathered in research projects, or as part of regular monitoring of state offices) and to provide common as well as innovative scaling tools in space and time to generate a coherent data grid. Each dataset holds detailed standardized metadata to ensure usability of the data, offer a comprehensive search function and provide reference information for appropriate citation of the dataset creators. V-FOR-WaTer includes a basis of data and tools, but its purpose is to grow by users who extend the virtual research environment with their own tools and research data. Researchers who upload new data or tools can receive a digital object identifier, or protect their data and tools from others until publication. Access to data and tools provided from V-FOR-WaTer happens via an easy-to-use web portal. Due to its modular architecture the portal is ready to be extended with new tools and features and also offers interfaces to Matlab, Python and R.

  12. Data-driven probability concentration and sampling on manifold

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Soize, C., E-mail: christian.soize@univ-paris-est.fr; Ghanem, R., E-mail: ghanem@usc.edu

    2016-09-15

    A new methodology is proposed for generating realizations of a random vector with values in a finite-dimensional Euclidean space that are statistically consistent with a dataset of observations of this vector. The probability distribution of this random vector, while a priori not known, is presumed to be concentrated on an unknown subset of the Euclidean space. A random matrix is introduced whose columns are independent copies of the random vector and for which the number of columns is the number of data points in the dataset. The approach is based on the use of (i) the multidimensional kernel-density estimation methodmore » for estimating the probability distribution of the random matrix, (ii) a MCMC method for generating realizations for the random matrix, (iii) the diffusion-maps approach for discovering and characterizing the geometry and the structure of the dataset, and (iv) a reduced-order representation of the random matrix, which is constructed using the diffusion-maps vectors associated with the first eigenvalues of the transition matrix relative to the given dataset. The convergence aspects of the proposed methodology are analyzed and a numerical validation is explored through three applications of increasing complexity. The proposed method is found to be robust to noise levels and data complexity as well as to the intrinsic dimension of data and the size of experimental datasets. Both the methodology and the underlying mathematical framework presented in this paper contribute new capabilities and perspectives at the interface of uncertainty quantification, statistical data analysis, stochastic modeling and associated statistical inverse problems.« less

  13. Bi-directional gene set enrichment and canonical correlation analysis identify key diet-sensitive pathways and biomarkers of metabolic syndrome.

    PubMed

    Morine, Melissa J; McMonagle, Jolene; Toomey, Sinead; Reynolds, Clare M; Moloney, Aidan P; Gormley, Isobel C; Gaora, Peadar O; Roche, Helen M

    2010-10-07

    Currently, a number of bioinformatics methods are available to generate appropriate lists of genes from a microarray experiment. While these lists represent an accurate primary analysis of the data, fewer options exist to contextualise those lists. The development and validation of such methods is crucial to the wider application of microarray technology in the clinical setting. Two key challenges in clinical bioinformatics involve appropriate statistical modelling of dynamic transcriptomic changes, and extraction of clinically relevant meaning from very large datasets. Here, we apply an approach to gene set enrichment analysis that allows for detection of bi-directional enrichment within a gene set. Furthermore, we apply canonical correlation analysis and Fisher's exact test, using plasma marker data with known clinical relevance to aid identification of the most important gene and pathway changes in our transcriptomic dataset. After a 28-day dietary intervention with high-CLA beef, a range of plasma markers indicated a marked improvement in the metabolic health of genetically obese mice. Tissue transcriptomic profiles indicated that the effects were most dramatic in liver (1270 genes significantly changed; p < 0.05), followed by muscle (601 genes) and adipose (16 genes). Results from modified GSEA showed that the high-CLA beef diet affected diverse biological processes across the three tissues, and that the majority of pathway changes reached significance only with the bi-directional test. Combining the liver tissue microarray results with plasma marker data revealed 110 CLA-sensitive genes showing strong canonical correlation with one or more plasma markers of metabolic health, and 9 significantly overrepresented pathways among this set; each of these pathways was also significantly changed by the high-CLA diet. Closer inspection of two of these pathways--selenoamino acid metabolism and steroid biosynthesis--illustrated clear diet-sensitive changes in constituent genes, as well as strong correlations between gene expression and plasma markers of metabolic syndrome independent of the dietary effect. Bi-directional gene set enrichment analysis more accurately reflects dynamic regulatory behaviour in biochemical pathways, and as such highlighted biologically relevant changes that were not detected using a traditional approach. In such cases where transcriptomic response to treatment is exceptionally large, canonical correlation analysis in conjunction with Fisher's exact test highlights the subset of pathways showing strongest correlation with the clinical markers of interest. In this case, we have identified selenoamino acid metabolism and steroid biosynthesis as key pathways mediating the observed relationship between metabolic health and high-CLA beef. These results indicate that this type of analysis has the potential to generate novel transcriptome-based biomarkers of disease.

  14. Bi-directional gene set enrichment and canonical correlation analysis identify key diet-sensitive pathways and biomarkers of metabolic syndrome

    PubMed Central

    2010-01-01

    Background Currently, a number of bioinformatics methods are available to generate appropriate lists of genes from a microarray experiment. While these lists represent an accurate primary analysis of the data, fewer options exist to contextualise those lists. The development and validation of such methods is crucial to the wider application of microarray technology in the clinical setting. Two key challenges in clinical bioinformatics involve appropriate statistical modelling of dynamic transcriptomic changes, and extraction of clinically relevant meaning from very large datasets. Results Here, we apply an approach to gene set enrichment analysis that allows for detection of bi-directional enrichment within a gene set. Furthermore, we apply canonical correlation analysis and Fisher's exact test, using plasma marker data with known clinical relevance to aid identification of the most important gene and pathway changes in our transcriptomic dataset. After a 28-day dietary intervention with high-CLA beef, a range of plasma markers indicated a marked improvement in the metabolic health of genetically obese mice. Tissue transcriptomic profiles indicated that the effects were most dramatic in liver (1270 genes significantly changed; p < 0.05), followed by muscle (601 genes) and adipose (16 genes). Results from modified GSEA showed that the high-CLA beef diet affected diverse biological processes across the three tissues, and that the majority of pathway changes reached significance only with the bi-directional test. Combining the liver tissue microarray results with plasma marker data revealed 110 CLA-sensitive genes showing strong canonical correlation with one or more plasma markers of metabolic health, and 9 significantly overrepresented pathways among this set; each of these pathways was also significantly changed by the high-CLA diet. Closer inspection of two of these pathways - selenoamino acid metabolism and steroid biosynthesis - illustrated clear diet-sensitive changes in constituent genes, as well as strong correlations between gene expression and plasma markers of metabolic syndrome independent of the dietary effect. Conclusion Bi-directional gene set enrichment analysis more accurately reflects dynamic regulatory behaviour in biochemical pathways, and as such highlighted biologically relevant changes that were not detected using a traditional approach. In such cases where transcriptomic response to treatment is exceptionally large, canonical correlation analysis in conjunction with Fisher's exact test highlights the subset of pathways showing strongest correlation with the clinical markers of interest. In this case, we have identified selenoamino acid metabolism and steroid biosynthesis as key pathways mediating the observed relationship between metabolic health and high-CLA beef. These results indicate that this type of analysis has the potential to generate novel transcriptome-based biomarkers of disease. PMID:20929581

  15. Web-based Toolkit for Dynamic Generation of Data Processors

    NASA Astrophysics Data System (ADS)

    Patel, J.; Dascalu, S.; Harris, F. C.; Benedict, K. K.; Gollberg, G.; Sheneman, L.

    2011-12-01

    All computation-intensive scientific research uses structured datasets, including hydrology and all other types of climate-related research. When it comes to testing their hypotheses, researchers might use the same dataset differently, and modify, transform, or convert it to meet their research needs. Currently, many researchers spend a good amount of time performing data processing and building tools to speed up this process. They might routinely repeat the same process activities for new research projects, spending precious time that otherwise could be dedicated to analyzing and interpreting the data. Numerous tools are available to run tests on prepared datasets and many of them work with datasets in different formats. However, there is still a significant need for applications that can comprehensively handle data transformation and conversion activities and help prepare the various processed datasets required by the researchers. We propose a web-based application (a software toolkit) that dynamically generates data processors capable of performing data conversions, transformations, and customizations based on user-defined mappings and selections. As a first step, the proposed solution allows the users to define various data structures and, in the next step, can select various file formats and data conversions for their datasets of interest. In a simple scenario, the core of the proposed web-based toolkit allows the users to define direct mappings between input and output data structures. The toolkit will also support defining complex mappings involving the use of pre-defined sets of mathematical, statistical, date/time, and text manipulation functions. Furthermore, the users will be allowed to define logical cases for input data filtering and sampling. At the end of the process, the toolkit is designed to generate reusable source code and executable binary files for download and use by the scientists. The application is also designed to store all data structures and mappings defined by a user (an author), and allow the original author to modify them using standard authoring techniques. The users can change or define new mappings to create new data processors for download and use. In essence, when executed, the generated data processor binary file can take an input data file in a given format and output this data, possibly transformed, in a different file format. If they so desire, the users will be able modify directly the source code in order to define more complex mappings and transformations that are not currently supported by the toolkit. Initially aimed at supporting research in hydrology, the toolkit's functions and features can be either directly used or easily extended to other areas of climate-related research. The proposed web-based data processing toolkit will be able to generate various custom software processors for data conversion and transformation in a matter of seconds or minutes, saving a significant amount of researchers' time and allowing them to focus on core research issues.

  16. Association Between a Germline OCA2 Polymorphism at Chromosome 15q13.1 and Estrogen Receptor–Negative Breast Cancer Survival

    PubMed Central

    Tyrer, Jonathan; Fasching, Peter A.; Beckmann, Matthias W.; Ekici, Arif B.; Schulz-Wendtland, Rüdiger; Bojesen, Stig E.; Nordestgaard, Børge G.; Flyger, Henrik; Milne, Roger L.; Arias, José Ignacio; Menéndez, Primitiva; Benítez, Javier; Chang-Claude, Jenny; Hein, Rebecca; Wang-Gohrke, Shan; Nevanlinna, Heli; Heikkinen, Tuomas; Aittomäki, Kristiina; Blomqvist, Carl; Margolin, Sara; Mannermaa, Arto; Kosma, Veli-Matti; Kataja, Vesa; Beesley, Jonathan; Chen, Xiaoqing; Chenevix-Trench, Georgia; Couch, Fergus J.; Olson, Janet E.; Fredericksen, Zachary S.; Wang, Xianshu; Giles, Graham G.; Severi, Gianluca; Baglietto, Laura; Southey, Melissa C.; Devilee, Peter; Tollenaar, Rob A. E. M.; Seynaeve, Caroline; García-Closas, Montserrat; Lissowska, Jolanta; Sherman, Mark E.; Bolton, Kelly L.; Hall, Per; Czene, Kamila; Cox, Angela; Brock, Ian W.; Elliott, Graeme C.; Reed, Malcolm W. R.; Greenberg, David; Anton-Culver, Hoda; Ziogas, Argyrios; Humphreys, Manjeet; Easton, Douglas F.; Caporaso, Neil E.; Pharoah, Paul D. P.

    2010-01-01

    Background Traditional prognostic factors for survival and treatment response of patients with breast cancer do not fully account for observed survival variation. We used available genotype data from a previously conducted two-stage, breast cancer susceptibility genome-wide association study (ie, Studies of Epidemiology and Risk factors in Cancer Heredity [SEARCH]) to investigate associations between variation in germline DNA and overall survival. Methods We evaluated possible associations between overall survival after a breast cancer diagnosis and 10 621 germline single-nucleotide polymorphisms (SNPs) from up to 3761 patients with invasive breast cancer (including 647 deaths and 26 978 person-years at risk) that were genotyped previously in the SEARCH study with high-density oligonucleotide microarrays (ie, hypothesis-generating set). Associations with all-cause mortality were assessed for each SNP by use of Cox regression analysis, generating a per rare allele hazard ratio (HR). To validate putative associations, we used patient genotype information that had been obtained with 5′ nuclease assay or mass spectrometry and overall survival information for up to 14 096 patients with invasive breast cancer (including 2303 deaths and 70 019 person-years at risk) from 15 international case–control studies (ie, validation set). Fixed-effects meta-analysis was used to generate an overall effect estimate in the validation dataset and in combined SEARCH and validation datasets. All statistical tests were two-sided. Results In the hypothesis-generating dataset, SNP rs4778137 (C>G) of the OCA2 gene at 15q13.1 was statistically significantly associated with overall survival among patients with estrogen receptor–negative tumors, with the rare G allele being associated with increased overall survival (HR of death per rare allele carried = 0.56, 95% confidence interval [CI] = 0.41 to 0.75, P = 9.2 × 10−5). This association was also observed in the validation dataset (HR of death per rare allele carried = 0.88, 95% CI = 0.78 to 0.99, P = .03) and in the combined dataset (HR of death per rare allele carried = 0.82, 95% CI = 0.73 to 0.92, P = 5 × 10−4). Conclusion The rare G allele of the OCA2 polymorphism, rs4778137, may be associated with improved overall survival among patients with estrogen receptor–negative breast cancer. PMID:20308648

  17. Topic modeling for cluster analysis of large biological and medical datasets

    PubMed Central

    2014-01-01

    Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106

  18. Topic modeling for cluster analysis of large biological and medical datasets.

    PubMed

    Zhao, Weizhong; Zou, Wen; Chen, James J

    2014-01-01

    The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.

  19. Diagnostic Pathology and Laboratory Medicine in the Age of “Omics”

    PubMed Central

    Finn, William G.

    2007-01-01

    Functional genomics and proteomics involve the simultaneous analysis of hundreds or thousands of expressed genes or proteins and have spawned the modern discipline of computational biology. Novel informatic applications, including sophisticated dimensionality reduction strategies and cancer outlier profile analysis, can distill clinically exploitable biomarkers from enormous experimental datasets. Diagnostic pathologists are now charged with translating the knowledge generated by the “omics” revolution into clinical practice. Food and Drug Administration-approved proprietary testing platforms based on microarray technologies already exist and will expand greatly in the coming years. However, for diagnostic pathology, the greatest promise of the “omics” age resides in the explosion in information technology (IT). IT applications allow for the digitization of histological slides, transforming them into minable data and enabling content-based searching and archiving of histological materials. IT will also allow for the optimization of existing (and often underused) clinical laboratory technologies such as flow cytometry and high-throughput core laboratory functions. The state of pathology practice does not always keep up with the pace of technological advancement. However, to use fully the potential of these emerging technologies for the benefit of patients, pathologists and clinical scientists must embrace the changes and transformational advances that will characterize this new era. PMID:17652635

  20. Evaluation of image features and classification methods for Barrett's cancer detection using VLE imaging

    NASA Astrophysics Data System (ADS)

    Klomp, Sander; van der Sommen, Fons; Swager, Anne-Fré; Zinger, Svitlana; Schoon, Erik J.; Curvers, Wouter L.; Bergman, Jacques J.; de With, Peter H. N.

    2017-03-01

    Volumetric Laser Endomicroscopy (VLE) is a promising technique for the detection of early neoplasia in Barrett's Esophagus (BE). VLE generates hundreds of high resolution, grayscale, cross-sectional images of the esophagus. However, at present, classifying these images is a time consuming and cumbersome effort performed by an expert using a clinical prediction model. This paper explores the feasibility of using computer vision techniques to accurately predict the presence of dysplastic tissue in VLE BE images. Our contribution is threefold. First, a benchmarking is performed for widely applied machine learning techniques and feature extraction methods. Second, three new features based on the clinical detection model are proposed, having superior classification accuracy and speed, compared to earlier work. Third, we evaluate automated parameter tuning by applying simple grid search and feature selection methods. The results are evaluated on a clinically validated dataset of 30 dysplastic and 30 non-dysplastic VLE images. Optimal classification accuracy is obtained by applying a support vector machine and using our modified Haralick features and optimal image cropping, obtaining an area under the receiver operating characteristic of 0.95 compared to the clinical prediction model at 0.81. Optimal execution time is achieved using a proposed mean and median feature, which is extracted at least factor 2.5 faster than alternative features with comparable performance.

  1. Generating and Visualizing Climate Indices using Google Earth Engine

    NASA Astrophysics Data System (ADS)

    Erickson, T. A.; Guentchev, G.; Rood, R. B.

    2017-12-01

    Climate change is expected to have largest impacts on regional and local scales. Relevant and credible climate information is needed to support the planning and adaptation efforts in our communities. The volume of climate projections of temperature and precipitation is steadily increasing, as datasets are being generated on finer spatial and temporal grids with an increasing number of ensembles to characterize uncertainty. Despite advancements in tools for querying and retrieving subsets of these large, multi-dimensional datasets, ease of access remains a barrier for many existing and potential users who want to derive useful information from these data, particularly for those outside of the climate modelling research community. Climate indices, that can be derived from daily temperature and precipitation data, such as annual number of frost days or growing season length, can provide useful information to practitioners and stakeholders. For this work the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset was loaded into Google Earth Engine, a cloud-based geospatial processing platform. Algorithms that use the Earth Engine API to generate several climate indices were written. The indices were chosen from the set developed by the joint CCl/CLIVAR/JCOMM Expert Team on Climate Change Detection and Indices (ETCCDI). Simple user interfaces were created that allow users to query, produce maps and graphs of the indices, as well as download results for additional analyses. These browser-based interfaces could allow users in low-bandwidth environments to access climate information. This research shows that calculating climate indices from global downscaled climate projection datasets and sharing them widely using cloud computing technologies is feasible. Further development will focus on exposing the climate indices to existing applications via the Earth Engine API, and building custom user interfaces for presenting climate indices to a diverse set of user groups.

  2. miRCat2: accurate prediction of plant and animal microRNAs from next-generation sequencing datasets

    PubMed Central

    Paicu, Claudia; Mohorianu, Irina; Stocks, Matthew; Xu, Ping; Coince, Aurore; Billmeier, Martina; Dalmay, Tamas; Moulton, Vincent; Moxon, Simon

    2017-01-01

    Abstract Motivation MicroRNAs are a class of ∼21–22 nt small RNAs which are excised from a stable hairpin-like secondary structure. They have important gene regulatory functions and are involved in many pathways including developmental timing, organogenesis and development in eukaryotes. There are several computational tools for miRNA detection from next-generation sequencing datasets. However, many of these tools suffer from high false positive and false negative rates. Here we present a novel miRNA prediction algorithm, miRCat2. miRCat2 incorporates a new entropy-based approach to detect miRNA loci, which is designed to cope with the high sequencing depth of current next-generation sequencing datasets. It has a user-friendly interface and produces graphical representations of the hairpin structure and plots depicting the alignment of sequences on the secondary structure. Results We test miRCat2 on a number of animal and plant datasets and present a comparative analysis with miRCat, miRDeep2, miRPlant and miReap. We also use mutants in the miRNA biogenesis pathway to evaluate the predictions of these tools. Results indicate that miRCat2 has an improved accuracy compared with other methods tested. Moreover, miRCat2 predicts several new miRNAs that are differentially expressed in wild-type versus mutants in the miRNA biogenesis pathway. Availability and Implementation miRCat2 is part of the UEA small RNA Workbench and is freely available from http://srna-workbench.cmp.uea.ac.uk/. Contact v.moulton@uea.ac.uk or s.moxon@uea.ac.uk Supplementary information Supplementary data are available at Bioinformatics online. PMID:28407097

  3. Heterogeneous Optimization Framework: Reproducible Preprocessing of Multi-Spectral Clinical MRI for Neuro-Oncology Imaging Research.

    PubMed

    Milchenko, Mikhail; Snyder, Abraham Z; LaMontagne, Pamela; Shimony, Joshua S; Benzinger, Tammie L; Fouke, Sarah Jost; Marcus, Daniel S

    2016-07-01

    Neuroimaging research often relies on clinically acquired magnetic resonance imaging (MRI) datasets that can originate from multiple institutions. Such datasets are characterized by high heterogeneity of modalities and variability of sequence parameters. This heterogeneity complicates the automation of image processing tasks such as spatial co-registration and physiological or functional image analysis. Given this heterogeneity, conventional processing workflows developed for research purposes are not optimal for clinical data. In this work, we describe an approach called Heterogeneous Optimization Framework (HOF) for developing image analysis pipelines that can handle the high degree of clinical data non-uniformity. HOF provides a set of guidelines for configuration, algorithm development, deployment, interpretation of results and quality control for such pipelines. At each step, we illustrate the HOF approach using the implementation of an automated pipeline for Multimodal Glioma Analysis (MGA) as an example. The MGA pipeline computes tissue diffusion characteristics of diffusion tensor imaging (DTI) acquisitions, hemodynamic characteristics using a perfusion model of susceptibility contrast (DSC) MRI, and spatial cross-modal co-registration of available anatomical, physiological and derived patient images. Developing MGA within HOF enabled the processing of neuro-oncology MR imaging studies to be fully automated. MGA has been successfully used to analyze over 160 clinical tumor studies to date within several research projects. Introduction of the MGA pipeline improved image processing throughput and, most importantly, effectively produced co-registered datasets that were suitable for advanced analysis despite high heterogeneity in acquisition protocols.

  4. Learning for VMM + WTA Embedded Classifiers

    DTIC Science & Technology

    2016-03-31

    enabling correct classification of each novel acoustic signal (generator, idle car , and idle truck). The classification structure requires, after...measured on our SoC FPAA IC. The test input is composed of signals from urban environment for 3 objects (generator, idle car , and idle truck...classifier results from a rural truck data set, an urban generator set, and urban idle car dataset. Solid lines represent our extracted background

  5. Clinical Value of Prognosis Gene Expression Signatures in Colorectal Cancer: A Systematic Review

    PubMed Central

    Cordero, David; Riccadonna, Samantha; Solé, Xavier; Crous-Bou, Marta; Guinó, Elisabet; Sanjuan, Xavier; Biondo, Sebastiano; Soriano, Antonio; Jurman, Giuseppe; Capella, Gabriel; Furlanello, Cesare; Moreno, Victor

    2012-01-01

    Introduction The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. Methods A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value was used to assess association with prognosis. For clinical usefulness evaluation, positive and negative post-tests probabilities were computed in stage II and III samples. Results Five gene signatures showed significant association with prognosis and provided reasonable prediction accuracy in their own training datasets. Nevertheless, all signatures showed low reproducibility in independent data. Stratified analyses by stage or microsatellite instability status showed significant association but limited discrimination ability, especially in stage II tumors. From a clinical perspective, the most predictive signatures showed a minor but significant improvement over the classical staging system. Conclusions The published signatures show low prediction accuracy but moderate clinical usefulness. Although gene expression data may inform prognosis, better strategies for signature validation are needed to encourage their widespread use in the clinic. PMID:23145004

  6. VIP: an integrated pipeline for metagenomics of virus identification and discovery

    PubMed Central

    Li, Yang; Wang, Hao; Nie, Kai; Zhang, Chen; Zhang, Yi; Wang, Ji; Niu, Peihua; Ma, Xuejun

    2016-01-01

    Identification and discovery of viruses using next-generation sequencing technology is a fast-developing area with potential wide application in clinical diagnostics, public health monitoring and novel virus discovery. However, tremendous sequence data from NGS study has posed great challenge both in accuracy and velocity for application of NGS study. Here we describe VIP (“Virus Identification Pipeline”), a one-touch computational pipeline for virus identification and discovery from metagenomic NGS data. VIP performs the following steps to achieve its goal: (i) map and filter out background-related reads, (ii) extensive classification of reads on the basis of nucleotide and remote amino acid homology, (iii) multiple k-mer based de novo assembly and phylogenetic analysis to provide evolutionary insight. We validated the feasibility and veracity of this pipeline with sequencing results of various types of clinical samples and public datasets. VIP has also contributed to timely virus diagnosis (~10 min) in acutely ill patients, demonstrating its potential in the performance of unbiased NGS-based clinical studies with demand of short turnaround time. VIP is released under GPLv3 and is available for free download at: https://github.com/keylabivdc/VIP. PMID:27026381

  7. Isfahan MISP Dataset

    PubMed Central

    Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

    2017-01-01

    An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled “biosigdata.com.” It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf). PMID:28487832

  8. Hydrological research in Ethiopia

    NASA Astrophysics Data System (ADS)

    Gebremichael, M.

    2012-12-01

    Almost all major development problems in Ethiopia are water-related: food insecurity, low economic development, recurrent droughts, disastrous floods, poor health conditions, and low energy condition. In order to develop and manage existing water resources in a sustainable manner, knowledge is required about water availability, water quality, water demand in various sectors, and the impacts of water resource projects on health and the environment. The lack of ground-based data has been a major challenge for generating this knowledge. Current advances in remote sensing and computer simulation technology could provide alternative source of datasets. In this talk, I will present the challenges and opportunities in using remote sensing datasets and hydrological models in regions such as Africa where ground-based datasets are scarce.

  9. Hypergraph Based Feature Selection Technique for Medical Diagnosis.

    PubMed

    Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar

    2016-11-01

    The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.

  10. Dataset from chemical gas sensor array in turbulent wind tunnel.

    PubMed

    Fonollosa, Jordi; Rodríguez-Luján, Irene; Trincavelli, Marco; Huerta, Ramón

    2015-06-01

    The dataset includes the acquired time series of a chemical detection platform exposed to different gas conditions in a turbulent wind tunnel. The chemo-sensory elements were sampling directly the environment. In contrast to traditional approaches that include measurement chambers, open sampling systems are sensitive to dispersion mechanisms of gaseous chemical analytes, namely diffusion, turbulence, and advection, making the identification and monitoring of chemical substances more challenging. The sensing platform included 72 metal-oxide gas sensors that were positioned at 6 different locations of the wind tunnel. At each location, 10 distinct chemical gases were released in the wind tunnel, the sensors were evaluated at 5 different operating temperatures, and 3 different wind speeds were generated in the wind tunnel to induce different levels of turbulence. Moreover, each configuration was repeated 20 times, yielding a dataset of 18,000 measurements. The dataset was collected over a period of 16 months. The data is related to "On the performance of gas sensor arrays in open sampling systems using Inhibitory Support Vector Machines", by Vergara et al.[1]. The dataset can be accessed publicly at the UCI repository upon citation of [1]: http://archive.ics.uci.edu/ml/datasets/Gas+sensor+arrays+in+open+sampling+settings.

  11. Research resource: Tissue-specific transcriptomics and cistromics of nuclear receptor signaling: a web research resource.

    PubMed

    Ochsner, Scott A; Watkins, Christopher M; LaGrone, Benjamin S; Steffen, David L; McKenna, Neil J

    2010-10-01

    Nuclear receptors (NRs) are ligand-regulated transcription factors that recruit coregulators and other transcription factors to gene promoters to effect regulation of tissue-specific transcriptomes. The prodigious rate at which the NR signaling field has generated high content gene expression and, more recently, genome-wide location analysis datasets has not been matched by a committed effort to archiving this information for routine access by bench and clinical scientists. As a first step towards this goal, we searched the MEDLINE database for studies, which referenced either expression microarray and/or genome-wide location analysis datasets in which a NR or NR ligand was an experimental variable. A total of 1122 studies encompassing 325 unique organs, tissues, primary cells, and cell lines, 35 NRs, and 91 NR ligands were retrieved and annotated. The data were incorporated into a new section of the Nuclear Receptor Signaling Atlas Molecule Pages, Transcriptomics and Cistromics, for which we designed an intuitive, freely accessible user interface to browse the studies. Each study links to an abstract, the MEDLINE record, and, where available, Gene Expression Omnibus and ArrayExpress records. The resource will be updated on a regular basis to provide a current and comprehensive entrez into the sum of transcriptomic and cistromic research in this field.

  12. Gene selection for microarray data classification via subspace learning and manifold regularization.

    PubMed

    Tang, Chang; Cao, Lijuan; Zheng, Xiao; Wang, Minhui

    2017-12-19

    With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.

  13. Targeted metabolomics and medication classification data from participants in the ADNI1 cohort.

    PubMed

    St John-Williams, Lisa; Blach, Colette; Toledo, Jon B; Rotroff, Daniel M; Kim, Sungeun; Klavins, Kristaps; Baillie, Rebecca; Han, Xianlin; Mahmoudiandehkordi, Siamak; Jack, John; Massaro, Tyler J; Lucas, Joseph E; Louie, Gregory; Motsinger-Reif, Alison A; Risacher, Shannon L; Saykin, Andrew J; Kastenmüller, Gabi; Arnold, Matthias; Koal, Therese; Moseley, M Arthur; Mangravite, Lara M; Peters, Mette A; Tenenbaum, Jessica D; Thompson, J Will; Kaddurah-Daouk, Rima

    2017-10-17

    Alzheimer's disease (AD) is the most common neurodegenerative disease presenting major health and economic challenges that continue to grow. Mechanisms of disease are poorly understood but significant data point to metabolic defects that might contribute to disease pathogenesis. The Alzheimer Disease Metabolomics Consortium (ADMC) in partnership with Alzheimer Disease Neuroimaging Initiative (ADNI) is creating a comprehensive biochemical database for AD. Using targeted and non- targeted metabolomics and lipidomics platforms we are mapping metabolic pathway and network failures across the trajectory of disease. In this report we present quantitative metabolomics data generated on serum from 199 control, 356 mild cognitive impairment and 175 AD subjects enrolled in ADNI1 using AbsoluteIDQ-p180 platform, along with the pipeline for data preprocessing and medication classification for confound correction. The dataset presented here is the first of eight metabolomics datasets being generated for broad biochemical investigation of the AD metabolome. We expect that these collective metabolomics datasets will provide valuable resources for researchers to identify novel molecular mechanisms contributing to AD pathogenesis and disease phenotypes.

  14. Sub-sampling genetic data to estimate black bear population size: A case study

    USGS Publications Warehouse

    Tredick, C.A.; Vaughan, M.R.; Stauffer, D.F.; Simek, S.L.; Eason, T.

    2007-01-01

    Costs for genetic analysis of hair samples collected for individual identification of bears average approximately US$50 [2004] per sample. This can easily exceed budgetary allowances for large-scale studies or studies of high-density bear populations. We used 2 genetic datasets from 2 areas in the southeastern United States to explore how reducing costs of analysis by sub-sampling affected precision and accuracy of resulting population estimates. We used several sub-sampling scenarios to create subsets of the full datasets and compared summary statistics, population estimates, and precision of estimates generated from these subsets to estimates generated from the complete datasets. Our results suggested that bias and precision of estimates improved as the proportion of total samples used increased, and heterogeneity models (e.g., Mh[CHAO]) were more robust to reduced sample sizes than other models (e.g., behavior models). We recommend that only high-quality samples (>5 hair follicles) be used when budgets are constrained, and efforts should be made to maximize capture and recapture rates in the field.

  15. Targeted metabolomics and medication classification data from participants in the ADNI1 cohort

    PubMed Central

    St John-Williams, Lisa; Blach, Colette; Toledo, Jon B.; Rotroff, Daniel M.; Kim, Sungeun; Klavins, Kristaps; Baillie, Rebecca; Han, Xianlin; Mahmoudiandehkordi, Siamak; Jack, John; Massaro, Tyler J.; Lucas, Joseph E.; Louie, Gregory; Motsinger-Reif, Alison A.; Risacher, Shannon L.; Saykin, Andrew J.; Kastenmüller, Gabi; Arnold, Matthias; Koal, Therese; Moseley, M. Arthur; Mangravite, Lara M.; Peters, Mette A.; Tenenbaum, Jessica D.; Thompson, J. Will; Kaddurah-Daouk, Rima

    2017-01-01

    Alzheimer’s disease (AD) is the most common neurodegenerative disease presenting major health and economic challenges that continue to grow. Mechanisms of disease are poorly understood but significant data point to metabolic defects that might contribute to disease pathogenesis. The Alzheimer Disease Metabolomics Consortium (ADMC) in partnership with Alzheimer Disease Neuroimaging Initiative (ADNI) is creating a comprehensive biochemical database for AD. Using targeted and non- targeted metabolomics and lipidomics platforms we are mapping metabolic pathway and network failures across the trajectory of disease. In this report we present quantitative metabolomics data generated on serum from 199 control, 356 mild cognitive impairment and 175 AD subjects enrolled in ADNI1 using AbsoluteIDQ-p180 platform, along with the pipeline for data preprocessing and medication classification for confound correction. The dataset presented here is the first of eight metabolomics datasets being generated for broad biochemical investigation of the AD metabolome. We expect that these collective metabolomics datasets will provide valuable resources for researchers to identify novel molecular mechanisms contributing to AD pathogenesis and disease phenotypes. PMID:29039849

  16. Glycan array data management at Consortium for Functional Glycomics.

    PubMed

    Venkataraman, Maha; Sasisekharan, Ram; Raman, Rahul

    2015-01-01

    Glycomics or the study of structure-function relationships of complex glycans has reshaped post-genomics biology. Glycans mediate fundamental biological functions via their specific interactions with a variety of proteins. Recognizing the importance of glycomics, large-scale research initiatives such as the Consortium for Functional Glycomics (CFG) were established to address these challenges. Over the past decade, the Consortium for Functional Glycomics (CFG) has generated novel reagents and technologies for glycomics analyses, which in turn have led to generation of diverse datasets. These datasets have contributed to understanding glycan diversity and structure-function relationships at molecular (glycan-protein interactions), cellular (gene expression and glycan analysis), and whole organism (mouse phenotyping) levels. Among these analyses and datasets, screening of glycan-protein interactions on glycan array platforms has gained much prominence and has contributed to cross-disciplinary realization of the importance of glycomics in areas such as immunology, infectious diseases, cancer biomarkers, etc. This manuscript outlines methodologies for capturing data from glycan array experiments and online tools to access and visualize glycan array data implemented at the CFG.

  17. Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.

    PubMed

    Kohli, Marc D; Summers, Ronald M; Geis, J Raymond

    2017-08-01

    At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities.

  18. Prediction of brain tissue temperature using near-infrared spectroscopy.

    PubMed

    Holper, Lisa; Mitra, Subhabrata; Bale, Gemma; Robertson, Nicola; Tachtsidis, Ilias

    2017-04-01

    Broadband near-infrared spectroscopy (NIRS) can provide an endogenous indicator of tissue temperature based on the temperature dependence of the water absorption spectrum. We describe a first evaluation of the calibration and prediction of brain tissue temperature obtained during hypothermia in newborn piglets (animal dataset) and rewarming in newborn infants (human dataset) based on measured body (rectal) temperature. The calibration using partial least squares regression proved to be a reliable method to predict brain tissue temperature with respect to core body temperature in the wavelength interval of 720 to 880 nm with a strong mean predictive power of [Formula: see text] (animal dataset) and [Formula: see text] (human dataset). In addition, we applied regression receiver operating characteristic curves for the first time to evaluate the temperature prediction, which provided an overall mean error bias between NIRS predicted brain temperature and body temperature of [Formula: see text] (animal dataset) and [Formula: see text] (human dataset). We discuss main methodological aspects, particularly the well-known aspect of over- versus underestimation between brain and body temperature, which is relevant for potential clinical applications.

  19. Integration of Neuroimaging and Microarray Datasets through Mapping and Model-Theoretic Semantic Decomposition of Unstructured Phenotypes

    PubMed Central

    Pantazatos, Spiro P.; Li, Jianrong; Pavlidis, Paul; Lussier, Yves A.

    2009-01-01

    An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames, and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n = 50), and precision of the semantic mapping between these terms across datasets was 98% (n = 100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets. PMID:20495688

  20. Measurement Properties of the NIH-Minimal Dataset Dutch Language Version in Patients With Chronic Low Back Pain.

    PubMed

    Boer, Annemarie; Dutmer, Alisa L; Schiphorst Preuper, Henrica R; van der Woude, Lucas H V; Stewart, Roy E; Deyo, Richard A; Reneman, Michiel F; Soer, Remko

    2017-10-01

    Validation study with cross-sectional and longitudinal measurements. To translate the US National Institutes of Health (NIH)-minimal dataset for clinical research on chronic low back pain into the Dutch language and to test its validity and reliability among people with chronic low back pain. The NIH developed a minimal dataset to encourage more complete and consistent reporting of clinical research and to be able to compare studies across countries in patients with low back pain. In the Netherlands, the NIH-minimal dataset has not been translated before and measurement properties are unknown. Cross-cultural validity was tested by a formal forward-backward translation. Structural validity was tested with exploratory factor analyses (comparative fit index, Tucker-Lewis index, and root mean square error of approximation). Hypothesis testing was performed to compare subscales of the NIH dataset with the Pain Disability Index and the EurQol-5D (Pearson correlation coefficients). Internal consistency was tested with Cronbach α and test-retest reliability at 2 weeks was calculated in a subsample of patients with Intraclass Correlation Coefficients and weighted Kappa (κω). In total, 452 patients were included of which 52 were included for the test-retest study. factor analysis for structural validity pointed into the direction of a seven-factor model (Cronbach α = 0.78). Factors and total score of the NIH-minimal dataset showed fair to good correlations with Pain Disability Index (r = 0.43-0.70) and EuroQol-5D (r = -0.41 to -0.64). Reliability: test-retest reliability per item showed substantial agreement (κω=0.65). Test-retest reliability per factor was moderate to good (Intraclass Correlation Coefficient = 0.71). The Dutch language version measurement properties of the NIH-minimal were satisfactory. N/A.

Top