Sample records for xiii common dataset

  1. Racial and genetic determinants of plasma factor XIII activity.

    PubMed

    Saha, N; Aston, C E; Low, P S; Kamboh, M I

    2000-12-01

    Factor XIII (F XIII), a plasma transglutaminase, is essential for normal hemostasis and fibrinolysis. Plasma F XIII consists of two catalytic A (F XIIIA) and two non-catalytic B (F XIIIB) subunits. Activated F XIII is involved in the formation of fibrin gel by covalently crosslinking fibrin monomers. As the characteristics of the fibrin gel structure have been shown to be associated with the risk of coronary heart disease (CHD), F XIII activity may play a seminal role in its etiology. In this investigation, we determined plasma F XIII activity in two racial groups, including Asian Indians (n = 258) and Chinese (n = 385). Adjusted plasma F XIII activity was significantly higher in Indian men (142 vs. 110%; P<0.0001) and women (158 vs. 111%; P<0.0001) than their Chinese counterparts. As compared to Indians where the distribution of F XIII activity was almost normal, in Chinese it was skewed towards low activity. In both racial groups, bivariate and multivariate analyses showed strong correlation of F XIII activity with plasma fibrinogen and plasminogen levels. Race explained about 25% of the variation in F XIII activity even after the adjustment of significant correlates. We also determined the contribution of common genetic polymorphisms in the F XIIIA and F XIIIB genes in affecting plasma F XIII activity. Both loci showed significant and independent effects on plasma F XIII activity in Indians (F XIIIA, P< 0.01; F XIIIB, P<0.05) and Chinese (F XIIIA, P<0.0001; F XIIIB, P<0.13) in a gene dosage fashion. This study shows that both racial and genetic components play a significant role in determining plasma F XIII activity, and consequently it may affect the quantitative risk of CHD. Copyright 2000 Wiley-Liss, Inc.

  2. Identification of eight novel coagulation factor XIII subunit A mutations: implied consequences for structure and function

    PubMed Central

    Ivaskevicius, Vytautas; Biswas, Arijit; Bevans, Carville; Schroeder, Verena; Kohler, Hans Peter; Rott, Hannelore; Halimeh, Susan; Petrides, Petro E.; Lenk, Harald; Krause, Manuele; Miterski, Bruno; Harbrecht, Ursula; Oldenburg, Johannes

    2010-01-01

    Background Severe hereditary coagulation factor XIII deficiency is a rare homozygous bleeding disorder affecting one person in every two million individuals. In contrast, heterozygous factor XIII deficiency is more common, but usually not associated with severe hemorrhage such as intracranial bleeding or hemarthrosis. In most cases, the disease is caused by F13A gene mutations. Causative mutations associated with the F13B gene are rarer. Design and Methods We analyzed ten index patients and three relatives for factor XIII activity using a photometric assay and sequenced their F13A and F13B genes. Additionally, structural analysis of the wild-type protein structure from a previously reported X-ray crystallographic model identified potential structural and functional effects of the missense mutations. Results All individuals except one were heterozygous for factor XIIIA mutations (average factor XIII activity 51%), while the remaining homozygous individual was found to have severe factor XIII deficiency (<5% of normal factor XIII activity). Eight of the 12 heterozygous patients exhibited a bleeding tendency upon provocation. Conclusions The identified missense (Pro289Arg, Arg611His, Asp668Gly) and nonsense (Gly390X, Trp664X) mutations are causative for factor XIII deficiency. A Gly592Ser variant identified in three unrelated index patients, as well as in 200 healthy controls (minor allele frequency 0.005), and two further Tyr167Cys and Arg540Gln variants, represent possible candidates for rare F13A gene polymorphisms since they apparently do not have a significant influence on the structure of the factor XIIIA protein. Future in vitro expression studies of the factor XIII mutations are required to confirm their pathological mechanisms. PMID:20179087

  3. Identification of eight novel coagulation factor XIII subunit A mutations: implied consequences for structure and function.

    PubMed

    Ivaskevicius, Vytautas; Biswas, Arijit; Bevans, Carville; Schroeder, Verena; Kohler, Hans Peter; Rott, Hannelore; Halimeh, Susan; Petrides, Petro E; Lenk, Harald; Krause, Manuele; Miterski, Bruno; Harbrecht, Ursula; Oldenburg, Johannes

    2010-06-01

    Severe hereditary coagulation factor XIII deficiency is a rare homozygous bleeding disorder affecting one person in every two million individuals. In contrast, heterozygous factor XIII deficiency is more common, but usually not associated with severe hemorrhage such as intracranial bleeding or hemarthrosis. In most cases, the disease is caused by F13A gene mutations. Causative mutations associated with the F13B gene are rarer. We analyzed ten index patients and three relatives for factor XIII activity using a photometric assay and sequenced their F13A and F13B genes. Additionally, structural analysis of the wild-type protein structure from a previously reported X-ray crystallographic model identified potential structural and functional effects of the missense mutations. All individuals except one were heterozygous for factor XIIIA mutations (average factor XIII activity 51%), while the remaining homozygous individual was found to have severe factor XIII deficiency (<5% of normal factor XIII activity). Eight of the 12 heterozygous patients exhibited a bleeding tendency upon provocation. The identified missense (Pro289Arg, Arg611His, Asp668Gly) and nonsense (Gly390X, Trp664X) mutations are causative for factor XIII deficiency. A Gly592Ser variant identified in three unrelated index patients, as well as in 200 healthy controls (minor allele frequency 0.005), and two further Tyr167Cys and Arg540Gln variants, represent possible candidates for rare F13A gene polymorphisms since they apparently do not have a significant influence on the structure of the factor XIIIA protein. Future in vitro expression studies of the factor XIII mutations are required to confirm their pathological mechanisms.

  4. Muscle-derived collagen XIII regulates maturation of the skeletal neuromuscular junction.

    PubMed

    Latvanlehto, Anne; Fox, Michael A; Sormunen, Raija; Tu, Hongmin; Oikarainen, Tuomo; Koski, Anu; Naumenko, Nikolay; Shakirzyanova, Anastasia; Kallio, Mika; Ilves, Mika; Giniatullin, Rashid; Sanes, Joshua R; Pihlajaniemi, Taina

    2010-09-15

    Formation, maturation, stabilization, and functional efficacy of the neuromuscular junction (NMJ) are orchestrated by transsynaptic and autocrine signals embedded within the synaptic cleft. Here, we demonstrate that collagen XIII, a nonfibrillar transmembrane collagen, is another such signal. We show that collagen XIII is expressed by muscle and its ectodomain can be proteolytically shed into the extracellular matrix. The collagen XIII protein was found present in the postsynaptic membrane and synaptic basement membrane. To identify a role for collagen XIII at the NMJ, mice were generated lacking this collagen. Morphological and ultrastructural analysis of the NMJ revealed incomplete adhesion of presynaptic and postsynaptic specializations in collagen XIII-deficient mice of both genders. Strikingly, Schwann cells erroneously enwrapped nerve terminals and invaginated into the synaptic cleft, resulting in a decreased contact surface for neurotransmission. Consistent with morphological findings, electrophysiological studies indicated both postsynaptic and presynaptic defects in Col13a1(-/-) mice, such as decreased amplitude of postsynaptic potentials, diminished probabilities of spontaneous release and reduced readily releasable neurotransmitter pool. To identify the role of collagen XIII at the NMJ, shed ectodomain of collagen XIII was applied to cultured myotubes, and it was found to advance acetylcholine receptor (AChR) cluster maturation. Together with the delay in AChR cluster development observed in collagen XIII-deficient mutants in vivo, these results suggest that collagen XIII plays an autocrine role in postsynaptic maturation of the NMJ. Altogether, the results presented here reveal that collagen XIII is a novel muscle-derived cue necessary for the maturation and function of the vertebrate NMJ.

  5. [Factor XIII-guided treatment algorithm reduces blood transfusion in burn surgery].

    PubMed

    Carneiro, João Miguel Gonçalves Valadares de Morais; Alves, Joana; Conde, Patrícia; Xambre, Fátima; Almeida, Emanuel; Marques, Céline; Luís, Mariana; Godinho, Ana Maria Mano Garção; Fernandez-Llimos, Fernando

    Major burn surgery causes large hemorrhage and coagulation dysfunction. Treatment algorithms guided by ROTEM ® and factor VIIa reduce the need for blood products, but there is no evidence regarding factor XIII. Factor XIII deficiency changes clot stability and decreases wound healing. This study evaluates the efficacy and safety of factor XIII correction and its repercussion on transfusion requirements in burn surgery. Randomized retrospective study with 40 patients undergoing surgery at the Burn Unit, allocated into Group A those with factor XIII assessment (n = 20), and Group B, those without assessment (n = 20). Erythrocyte transfusion was guided by a hemoglobin trigger of 10g.dL -1 and the other blood products by routine coagulation and ROTEM ® tests. Analysis of blood product consumption included units of erythrocytes, fresh frozen plasma, platelets, and fibrinogen. The coagulation biomarker analysis compared the pre- and post-operative values. Group A (with factor XIII study) and Group B had identical total body surface area burned. All patients in Group A had a preoperative factor XIII deficiency, whose correction significantly reduced units of erythrocyte concentrate transfusion (1.95 vs. 4.05, p = 0.001). Pre- and post-operative coagulation biomarkers were similar between groups, revealing that routine coagulation tests did not identify factor XIII deficiency. There were no recorded thromboembolic events. Correction of factor XIII deficiency in burn surgery proved to be safe and effective for reducing perioperative transfusion of erythrocyte units. Copyright © 2017 Sociedade Brasileira de Anestesiologia. Publicado por Elsevier Editora Ltda. All rights reserved.

  6. 21 CFR 866.5330 - Factor XIII, A, S, immuno-logical test system.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... SERVICES (CONTINUED) MEDICAL DEVICES IMMUNOLOGY AND MICROBIOLOGY DEVICES Immunological Test Systems § 866.5330 Factor XIII, A, S, immuno-logical test system. (a) Identification. A factor XIII, A, S... 21 Food and Drugs 8 2010-04-01 2010-04-01 false Factor XIII, A, S, immuno-logical test system. 866...

  7. [Factor XIII deficiency in burns].

    PubMed

    Burkhardt, H; Zellner, P R; Möller, I

    1977-08-01

    In 34 patients with severe burn injuries platelets, fibrinogen, prothrombin time, partial thromboplastin time, thrombin time and factor XIII were measured daily. Half of the patients were administered 15 000 IE of heparin per 24 hours. In the first 4 days there was a rapid fall of factor XIII to a value of approximately 30%. Values remained very low during the whole observation period of up to 20 days. However, in patients treated with heparin, values tended to be 10--15% higher. After an initial decline on the tenth day, the platelets had risen to the lowest normal level. Platelets were identical in both groups. The causes for the changes in these haemostasis parameters, their significance, and possible consequences of therapy are discussed.

  8. Genetics Home Reference: factor XIII deficiency

    MedlinePlus

    ... XIII deficiency tend to have heavy or prolonged menstrual bleeding (menorrhagia) and may experience recurrent pregnancy losses ( ... inheritance, which means that it results when both copies of either the F13A1 gene or the F13B ...

  9. The Most Common Geometric and Semantic Errors in CityGML Datasets

    NASA Astrophysics Data System (ADS)

    Biljecki, F.; Ledoux, H.; Du, X.; Stoter, J.; Soon, K. H.; Khoo, V. H. S.

    2016-10-01

    To be used as input in most simulation and modelling software, 3D city models should be geometrically and topologically valid, and semantically rich. We investigate in this paper what is the quality of currently available CityGML datasets, i.e. we validate the geometry/topology of the 3D primitives (Solid and MultiSurface), and we validate whether the semantics of the boundary surfaces of buildings is correct or not. We have analysed all the CityGML datasets we could find, both from portals of cities and on different websites, plus a few that were made available to us. We have thus validated 40M surfaces in 16M 3D primitives and 3.6M buildings found in 37 CityGML datasets originating from 9 countries, and produced by several companies with diverse software and acquisition techniques. The results indicate that CityGML datasets without errors are rare, and those that are nearly valid are mostly simple LOD1 models. We report on the most common errors we have found, and analyse them. One main observation is that many of these errors could be automatically fixed or prevented with simple modifications to the modelling software. Our principal aim is to highlight the most common errors so that these are not repeated in the future. We hope that our paper and the open-source software we have developed will help raise awareness for data quality among data providers and 3D GIS software producers.

  10. Factor XIII as a modulator of plasma fibronectin alterations during experimental bacteremia.

    PubMed

    Kiener, J L; Cho, E; Saba, T M

    1986-11-01

    Fibronectin is found in plasma as well as in association with connective tissue and cell surfaces. Depletion of plasma fibronectin is often observed in septic trauma and burned patients, while experimental rats often manifest hyperfibronectinemia with sepsis. Since Factor XIII may influence the rate of clearance and deposition of plasma fibronectin into tissues, we evaluated the temporal changes in plasma fibronectin and plasma Factor XIII following bacteremia and RE blockade in rats in an attempt to understand the mechanism leading to elevation of fibronectin levels in bacteremic rats, which is distinct from that observed with RE blockade. Clearance of exogenously administered fibronectin after bacteremia was also determined. Rats received either saline, Pseudomonas aeruginosa (1 X 10(9) organisms), gelatinized RE test lipid emulsion (50 mg/100 gm B.W.), or emulsion followed by Pseudomonas. Plasma fibronectin and Factor XIII were determined at 0, 2, 24, and 48 hours post-blockade or bacteremia. At 24 and 48 hr following bacteremia alone or bacteremia after RE blockade, there was a significant elevation (p less than 0.05) of plasma fibronectin and a concomitant decrease (p less than 0.05) of plasma factor XIII activity. Extractable tissue fibronectin from liver and spleen was also increased at 24 and 48 hours following R.E. blockade plus bacteremia. In addition, the plasma clearance of human fibronectin was significantly prolonged (p less than 0.05) following bacterial challenge. Infusion of activated Factor XIII (20 units/rat) during a period of hyperfibronectinemia (908.0 +/- 55.1 micrograms/ml) resulted in a significant (p less than 0.05) decrease in plasma fibronectin (548.5 +/- 49.9 micrograms/ml) within 30 min. Thus Factor XIII deficiency in rats with bacteremia may contribute to the elevation in plasma fibronectin by altering kinetics associated with the clearance of fibronectin from the blood.

  11. Stable expression of recombinant human coagulation factor XIII in protein-free suspension culture of Chinese hamster ovary cells.

    PubMed

    Chun, B H; Bang, W G; Park, Y K; Woo, S K

    2001-11-01

    The recombinant a and bsubunits for human coagulation factor XIII were transfected into Chinese hamster ovary (CHO) cells. CHO cells were amplified and selected with methotrexate in adherent cultures containing serum, and CHO 1-62 cells were later selected in protein-free medium. To develop a recombinant factor XIII production process in a suspension culture, we have investigated the growth characteristics of CHO cells and the maintenance of factor XIII expression in the culture medium. Suspension adaptation of CHO cells was performed in protein-free medium, GC-CHO-PI, by two methods, such as serum weaning and direct switching from serum containing media to protein-free media. Although the growth of CHO cells in suspension culture was affected initially by serum depletion, cell specific productivity of factor XIII showed only minor changes by the direct switching to protein-free medium during a suspension culture. As for the long-term stability of factor XIII, CHO 1-62 cells showed a stable expression of factor XIII in protein-free condition for 1000 h. These results indicate that the CHO 1-62cells can be adapted to express recombinant human factor XIII in a stable maimer in suspension culture using a protein-free medium. Our results demonstrate that enhanced cell growth in a continuous manner is achievable for factor XIII production in a protein-free medium when a perfusion bioreactor culture system with a spin filter is employed.

  12. BABAR: an R package to simplify the normalisation of common reference design microarray-based transcriptomic datasets

    PubMed Central

    2010-01-01

    Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show

  13. A calorimetric study on the low temperature dynamics of doped ice V and its reversible phase transition to hydrogen ordered ice XIII.

    PubMed

    Salzmann, Christoph G; Radaelli, Paolo G; Finney, John L; Mayer, Erwin

    2008-11-07

    Doped ice V samples made from solutions containing 0.01 M HCl (DCl), HF (DF), or KOH (KOD) in H(2)O (D(2)O) were slow-cooled from 250 to 77 K at 0.5 GPa. The effect of the dopant on the hydrogen disorder --> order transition and formation of hydrogen ordered ice XIII was studied by differential scanning calorimetry (DSC) with samples recovered at 77 K. DSC scans of acid-doped samples are consistent with a reversible ice XIII <--> ice V phase transition at ambient pressure, showing an endothermic peak on heating due to the hydrogen ordered ice XIII --> disordered ice V phase transition, and an exothermic peak on subsequent cooling due to the ice V --> ice XIII phase transition. The equilibrium temperature (T(o)) for the ice V <--> ice XIII phase transition is 112 K for both HCl doped H(2)O and DCl doped D(2)O. From the maximal enthalpy change of 250 J mol(-1) on the ice XIII --> ice V phase transition and T(o) of 112 K, the change in configurational entropy for the ice XIII --> ice V transition is calculated as 2.23 J mol(-1) K(-1) which is 66% of the Pauling entropy. For HCl, the most effective dopant, the influence of HCl concentration on the formation of ice XIII was determined: on decreasing the concentration of HCl from 0.01 to 0.001 M, its effectiveness is only slightly lowered. However, further HCl decrease to 0.0001 M drastically lowered its effectiveness. HF (DF) doping is less effective in inducing formation of ice XIII than HCl (DCl) doping. On heating at a rate of 5 K min(-1), kinetic unfreezing starts in pure ice V at approximately 132 K, whereas in acid doped ice XIII it starts at about 105 K due to acceleration of reorientation of water molecules. KOH doping does not lead to formation of hydrogen ordered ice XIII, a result which is consistent with our powder neutron diffraction study (C. G. Salzmann, P. G. Radaelli, A. Hallbrucker, E. Mayer, J. L. Finney, Science, 2006, 311, 1758). We further conjecture whether or not ice XIII has a stable region in

  14. Platelet factor XIII increases the fibrinolytic resistance of platelet-rich clots by accelerating the crosslinking of alpha 2-antiplasmin to fibrin

    NASA Technical Reports Server (NTRS)

    Reed, G. L.; Matsueda, G. R.; Haber, E.

    1992-01-01

    Platelet clots resist fibrinolysis by plasminogen activators. We hypothesized that platelet factor XIII may enhance the fibrinolytic resistance of platelet-rich clots by catalyzing the crosslinking of alpha 2-antiplasmin (alpha 2AP) to fibrin. Analysis of plasma clot structure by polyacrylamide gel electrophoresis and immunoblotting revealed accelerated alpha 2AP-fibrin crosslinking in platelet-rich compared with platelet-depleted plasma clots. A similar study of clots formed with purified fibrinogen (depleted of factor XIII activity), isolated platelets, and specific factor XIII inhibitors indicated that this accelerated crosslinking was due to the catalytic activity of platelet factor XIII. Moreover, when washed platelets were aggregated by thrombin, there was evidence of platelet factor XIII-mediated crosslinking between platelet alpha 2AP and platelet fibrin(ogen). Specific inhibition (by a monoclonal antibody) of the alpha 2AP associated with washed platelet aggregates accelerated the fibrinolysis of the platelet aggregate. Thus in platelet-rich plasma clots, and in thrombin-induced platelet aggregates, platelet factor XIII actively formed alpha 2AP-fibrin crosslinks, which appeared to enhance the resistance of platelet-rich clots to fibrinolysis.

  15. Characterization of carbonic anhydrase XIII in the erythrocytes of the Burmese python, Python molurus bivittatus.

    PubMed

    Esbaugh, A J; Secor, S M; Grosell, M

    2015-09-01

    Carbonic anhydrase (CA) is one of the most abundant proteins found in vertebrate erythrocytes with the majority of species expressing a low activity CA I and high activity CA II. However, several phylogenetic gaps remain in our understanding of the expansion of cytoplasmic CA in vertebrate erythrocytes. In particular, very little is known about isoforms from reptiles. The current study sought to characterize the erythrocyte isoforms from two squamate species, Python molurus and Nerodia rhombifer, which was combined with information from recent genome projects to address this important phylogenetic gap. Obtained sequences grouped closely with CA XIII in phylogenetic analyses. CA II mRNA transcripts were also found in erythrocytes, but found at less than half the levels of CA XIII. Structural analysis suggested similar biochemical activity as the respective mammalian isoforms, with CA XIII being a low activity isoform. Biochemical characterization verified that the majority of CA activity in the erythrocytes was due to a high activity CA II-like isoform; however, titration with copper supported the presence of two CA pools. The CA II-like pool accounted for 90 % of the total activity. To assess potential disparate roles of these isoforms a feeding stress was used to up-regulate CO2 excretion pathways. Significant up-regulation of CA II and the anion exchanger was observed; CA XIII was strongly down-regulated. While these results do not provide insight into the role of CA XIII in the erythrocytes, they do suggest that the presence of two isoforms is not simply a case of physiological redundancy. Copyright © 2015. Published by Elsevier Inc.

  16. A specific colorimetric assay for measuring transglutaminase 1 and factor XIII activities.

    PubMed

    Hitomi, Kiyotaka; Kitamura, Miyako; Alea, Mileidys Perez; Ceylan, Ismail; Thomas, Vincent; El Alaoui, Saïd

    2009-11-15

    Transglutaminase (TGase) is an enzyme that catalyzes both isopeptide cross-linking and incorporation of primary amines into proteins. Eight TGases have been identified in humans, and each of these TGases has a unique tissue distribution and physiological significance. Although several assays for TGase enzymatic activity have been reported, it has been difficult to establish an assay for discriminating each of these different TGase activities. Using a random peptide library, we recently identified the preferred substrate sequences for three major TGases: TGase 1, TGase 2, and factor XIII. In this study, we use these substrates in specific tests for measuring the activities of TGase 1 and factor XIII.

  17. 25 CFR 36.40 - Standard XIII-Library/media program.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... 25 Indians 1 2010-04-01 2010-04-01 false Standard XIII-Library/media program. 36.40 Section 36.40... § 36.40 Standard XIII—Library/media program. (a) Each school shall provide a library/media program... objectives have been met. (2) A written policy for the selection of materials and equipment shall be...

  18. 25 CFR 36.40 - Standard XIII-Library/media program.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... 25 Indians 1 2012-04-01 2011-04-01 true Standard XIII-Library/media program. 36.40 Section 36.40... § 36.40 Standard XIII—Library/media program. (a) Each school shall provide a library/media program... developed by a library committee in collaboration with the librarian and be approved by the school board...

  19. 25 CFR 36.40 - Standard XIII-Library/media program.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 25 Indians 1 2014-04-01 2014-04-01 false Standard XIII-Library/media program. 36.40 Section 36.40... § 36.40 Standard XIII—Library/media program. (a) Each school shall provide a library/media program... developed by a library committee in collaboration with the librarian and be approved by the school board...

  20. 25 CFR 36.40 - Standard XIII-Library/media program.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... 25 Indians 1 2013-04-01 2013-04-01 false Standard XIII-Library/media program. 36.40 Section 36.40... § 36.40 Standard XIII—Library/media program. (a) Each school shall provide a library/media program... developed by a library committee in collaboration with the librarian and be approved by the school board...

  1. Phylogenetic analysis of Newcastle disease viruses from Bangladesh suggests continuing evolution of genotype XIII.

    PubMed

    Barman, Lalita Rani; Nooruzzaman, Mohammed; Sarker, Rahul Deb; Rahman, Md Tazinur; Saife, Md Rajib Bin; Giasuddin, Mohammad; Das, Bidhan Chandra; Das, Priya Mohan; Chowdhury, Emdadul Haque; Islam, Mohammad Rafiqul

    2017-10-01

    A total of 23 Newcastle disease virus (NDV) isolates from Bangladesh taken between 2010 and 2012 were characterized on the basis of partial F gene sequences. All the isolates belonged to genotype XIII of class II NDV but segregated into three sub-clusters. One sub-cluster with 17 isolates aligned with sub-genotype XIIIc. The other two sub-clusters were phylogenetically distinct from the previously described sub-genotypes XIIIa, XIIIb and XIIIc and could be candidates of new sub-genotypes; however, that needs to be validated through full-length F gene sequence data. The results of the present study suggest that genotype XIII NDVs are under continuing evolution in Bangladesh.

  2. Impact of combined C1 esterase inhibitor/coagulation factor XIII or N-acetylcysteine/tirilazad mesylate administration on leucocyte adherence and cytokine release in experimental endotoxaemia.

    PubMed

    Birnbaum, J; Klotz, E; Spies, C D; Mueller, J; Vargas Hein, O; Feller, J; Lehmann, C

    2008-01-01

    We determined the effects of combinations of C1 esterase inhibitor (C1-INH) with factor XIII and of N-acetylcysteine (NAC) with tirilazad mesylate (TM) during lipo-polysaccharide (LPS)-induced endotoxaemia in rats. Forty Wistar rats were divided into four groups: the control (CON) group received no LPS; the LPS, C1-INH + factor XIII and NAC + TM groups received endotoxin infusions (5 mg/kg per h). After 30 min of endotoxaemia, 100 U/kg C1-INH + 50 U/kg factor XIII was administered to the C1-INH + factor XIII group, and 150 mg/kg NAC + 10 mg/kg TM was administered in the NAC + TM group. Administration of C1-INH + factor XIII and NAC + TM both resulted in reduced leucocyte adherence and reduced levels of interleukin-1beta (IL-1beta). The LPS-induced increase in IL-6 levels was amplified by both drug combinations. There was no significant effect on mesenteric plasma extravasation. In conclusion, the administration of C1-INH + factor XIII and NAC + TM reduced endothelial leucocyte adherence and IL-1beta plasma levels, but increased IL-6 levels.

  3. Dirac R-matrix calculations of photoionization cross sections of Ni XII and atomic structure data of Ni XIII

    NASA Astrophysics Data System (ADS)

    Nazir, R. T.; Bari, M. A.; Bilal, M.; Sardar, S.; Nasim, M. H.; Salahuddin, M.

    2017-02-01

    We performed R-matrix calculations for photoionization cross sections of the two ground state configuration 3s23p5 (^2P^o3/2,1/2) levels and 12 excited states of Ni XII using relativistic Dirac Atomic R-matrix Codes (DARC) across the photon energy range between the ionizations thresholds of the corresponding states and well above the thresholds of the last level of the Ni XIII target ion. Generally, a good agreement is obtained between our results and the earlier theoretical photoionization cross sections. Moreover, we have used two independent fully relativistic GRASP and FAC codes to calculate fine-structure energy levels, wavelengths, oscillator strengths, transitions rates among the lowest 48 levels belonging to the configuration (3s23p4, 3s3p5, 3p6, 3s23p33d) in Ni XIII. Additionally, radiative lifetimes of all the excited states of Ni XIII are presented. Our results of the atomic structure of Ni XIII show good agreement with other theoretical and experimental results available in the literature. A good agreement is found between our calculated lifetimes and the experimental ones. Our present results are useful for plasma diagnostic of fusion and astrophysical plasmas.

  4. Traffic analysis toolbox volume XIII : integrated corridor management analysis, modeling, and simulation guide

    DOT National Transportation Integrated Search

    2017-02-01

    As part of the Federal Highway Administration (FHWA) Traffic Analysis Toolbox (Volume XIII), this guide was designed to help corridor stakeholders implement the Integrated Corridor Management (ICM) Analysis, Modeling, and Simulation (AMS) methodology...

  5. Traffic analysis toolbox volume XIII : integrated corridor management analysis, modeling, and simulation guide.

    DOT National Transportation Integrated Search

    2017-02-01

    As part of the Federal Highway Administration (FHWA) Traffic Analysis Toolbox (Volume XIII), this guide was designed to help corridor stakeholders implement the Integrated Corridor Management (ICM) Analysis, Modeling, and Simulation (AMS) methodology...

  6. 40 CFR Appendix Xiii to Part 86 - State Requirements Incorporated by Reference in Part 86 of the Code of Federal Regulations

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... AND IN-USE HIGHWAY VEHICLES AND ENGINES (CONTINUED) Pt. 86, App. XIII Appendix XIII to Part 86—State...-Line Test Procedures for 1983 Through 1997 Model-Year Passenger Cars, Light-Duty Trucks and Medium-Duty...: California Assembly-Line Test Procedures for 1998 and Subsequent Model-Year Passenger Cars, Light-Duty Trucks...

  7. 40 CFR Appendix Xiii to Part 86 - State Requirements Incorporated by Reference in Part 86 of the Code of Federal Regulations

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... AND IN-USE HIGHWAY VEHICLES AND ENGINES (CONTINUED) Pt. 86, App. XIII Appendix XIII to Part 86—State...-Line Test Procedures for 1983 Through 1997 Model-Year Passenger Cars, Light-Duty Trucks and Medium-Duty...: California Assembly-Line Test Procedures for 1998 and Subsequent Model-Year Passenger Cars, Light-Duty Trucks...

  8. 40 CFR Appendix Xiii to Part 86 - State Requirements Incorporated by Reference in Part 86 of the Code of Federal Regulations

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... AND IN-USE HIGHWAY VEHICLES AND ENGINES (CONTINUED) Pt. 86, App. XIII Appendix XIII to Part 86—State...-Line Test Procedures for 1983 Through 1997 Model-Year Passenger Cars, Light-Duty Trucks and Medium-Duty...: California Assembly-Line Test Procedures for 1998 and Subsequent Model-Year Passenger Cars, Light-Duty Trucks...

  9. A Meta-Analysis: Identification of Common Mir-145 Target Genes that have Similar Behavior in Different GEO Datasets.

    PubMed

    Pashaei, Elnaz; Guzel, Esra; Ozgurses, Mete Emir; Demirel, Goksun; Aydin, Nizamettin; Ozen, Mustafa

    MicroRNAs, which are small regulatory RNAs, post-transcriptionally regulate gene expression by binding 3'-UTR of their mRNA targets. Their deregulation has been shown to cause increased proliferation, migration, invasion, and apoptosis. miR-145, an important tumor supressor microRNA, has shown to be downregulated in many cancer types and has crucial roles in tumor initiation, progression, metastasis, invasion, recurrence, and chemo-radioresistance. Our aim is to investigate potential common target genes of miR-145, and to help understanding the underlying molecular pathways of tumor pathogenesis in association with those common target genes. Eight published microarray datasets, where targets of mir-145 were investigated in cell lines upon mir-145 over expression, were included into this study for meta-analysis. Inter group variabilities were assessed by box-plot analysis. Microarray datasets were analyzed using GEOquery package in Bioconducter 3.2 with R version 3.2.2 and two-way Hierarchical Clustering was used for gene expression data analysis. Meta-analysis of different GEO datasets showed that UNG, FUCA2, DERA, GMFB, TF, and SNX2 were commonly downregulated genes, whereas MYL9 and TAGLN were found to be commonly upregulated upon mir-145 over expression in prostate, breast, esophageal, bladder cancer, and head and neck squamous cell carcinoma. Biological process, molecular function, and pathway analysis of these potential targets of mir-145 through functional enrichments in PPI network demonstrated that those genes are significantly involved in telomere maintenance, DNA binding and repair mechanisms. As a conclusion, our results indicated that mir-145, through targeting its common potential targets, may significantly contribute to tumor pathogenesis in distinct cancer types and might serve as an important target for cancer therapy.

  10. FE-XIII Infrared / FE-XIV Green Line Ratio Diagnostics (P55)

    NASA Astrophysics Data System (ADS)

    Srivastava, A. K.; et al.

    2006-11-01

    aks.astro.itbhu@gmail.com We consider the first 27-level atomic model of Fe XIII (5.9 < log Te < 6.4 K) to estimate its ground level populations, taking account of electron as well as proton collisional excitations and de-excitations, radiative cascades, radiative excitations and de-excitations. Radiative cascade is important but the effect of dilution factor is negligible at higher electron densities. The 3 P1-3P0 and 3P2-3P1 transitions in the ground configuration 3s2 3p2 of Fe XIII result in two forbidden coronal emission lines in the infrared region, namely 10747 Å and 10798 Å., while the 5303 Å green line is formed in the 3s2 3p 2 2 ground configuration of Fe XIV as a result of P3 / 2 - P1 / 2 magnetic dipole transition. The line-widths of appropriate pair of forbidden coronal emission lines observed simultaneously can be useful diagnostic tool to deduce temperature and non-thermal velocity in the large scale coronal structures using intensity ratios of the lines as the temperature signature, instead of assuming ion temperature to be equal to the electron temperature. Since the line intensity ratios IG5303/IIR10747 and IG5303/IIR10798 have very week density dependence, they are ideal monitors of temperature mapping in the solar corona.

  11. Newly-recognized roles of factor XIII in thrombosis

    PubMed Central

    Byrnes, James R.; Wolberg, Alisa S.

    2017-01-01

    Arterial and venous thrombosis are major contributors to coagulation-associated morbidity and mortality. Greater understanding of mechanisms leading to thrombus formation and stability is expected to lead to improved treatment strategies. Factor XIII (FXIII) is a transglutaminase found in plasma and platelets. During thrombosis, activated FXIII crosslinks fibrin and promotes thrombus stability. Recent studies have provided new information about FXIII activity during coagulation and its effects on clot composition and function. These findings reveal newly-recognized roles for FXIII in thrombosis. Herein, we review published literature on FXIII biology and effects on fibrin structure and stability, epidemiologic data associating FXIII with thrombosis, and evidence from animal models indicating FXIII has an essential role in determining thrombus stability, composition, and size. PMID:27056150

  12. Free factor XIII activation peptide (fAP-FXIII) is a regulator of factor XIII activity via factor XIII-B.

    PubMed

    Dodt, Johannes; Pasternack, Ralf; Seitz, Rainer; Volkers, Peter

    2016-02-01

    In a factor XIIIa (FXIIIa) generation assay with recombinant FXIII-A2 (rFXIII-A2 ) free FXIII activation peptide (fAP-FXII) prolonged the time to peak (TTP) but did not affect the area under the curve (AUC) or concentration at peak (CP). Addition of recombinant factorXIII-B2 (rFXIII-B2 ) restored the characteristics of the FXIIIa generation parameters (AUC, TTP and CP) to those observed for plasma FXIII (FXIII-A2 B2 ). FXIII-A2 B2 reconstituted from rFXIII-A2 and rFXIII-B2 showed a similar effect on AUC, TTP and CP in the presence of fAP-FXII as observed for plasma FXIII-A2 B2 , indicating a role for FXIII-B in this observation. An effect of fAP-FXIII on thrombin, the proteolytic activator of FXIII, was excluded by thrombin generation assays and clotting experiments. In a purified system, fAP-FXIII did not interfere with the FXIIIa activity development of thrombin-cleaved rFXIII-A2 (rFXIII-A2 ') also excluding direct inhibition of FXIIIa. However, FXIIIa activity development of FXIII-A2 'B2 was inhibited in a concentration-dependent manner by fAP-FXIII, indicating that an interaction between AP-FXIII and FXIII-B2 contributes to the overall stability of FXIII-A2 'B2 . In addition to its well-known role, FXIII-B also contributes to FXIII-A2 B2 stability or dissociation depending on fAP-FXIII and calcium concentrations. © 2015 John Wiley & Sons Ltd.

  13. Comparison of Shallow Survey 2012 Multibeam Datasets

    NASA Astrophysics Data System (ADS)

    Ramirez, T. M.

    2012-12-01

    The purpose of the Shallow Survey common dataset is a comparison of the different technologies utilized for data acquisition in the shallow survey marine environment. The common dataset consists of a series of surveys conducted over a common area of seabed using a variety of systems. It provides equipment manufacturers the opportunity to showcase their latest systems while giving hydrographic researchers and scientists a chance to test their latest algorithms on the dataset so that rigorous comparisons can be made. Five companies collected data for the Common Dataset in the Wellington Harbor area in New Zealand between May 2010 and May 2011; including Kongsberg, Reson, R2Sonic, GeoAcoustics, and Applied Acoustics. The Wellington harbor and surrounding coastal area was selected since it has a number of well-defined features, including the HMNZS South Seas and HMNZS Wellington wrecks, an armored seawall constructed of Tetrapods and Akmons, aquifers, wharves and marinas. The seabed inside the harbor basin is largely fine-grained sediment, with gravel and reefs around the coast. The area outside the harbor on the southern coast is an active environment, with moving sand and exposed reefs. A marine reserve is also in this area. For consistency between datasets, the coastal research vessel R/V Ikatere and crew were used for all surveys conducted for the common dataset. Using Triton's Perspective processing software multibeam datasets collected for the Shallow Survey were processed for detail analysis. Datasets from each sonar manufacturer were processed using the CUBE algorithm developed by the Center for Coastal and Ocean Mapping/Joint Hydrographic Center (CCOM/JHC). Each dataset was gridded at 0.5 and 1.0 meter resolutions for cross comparison and compliance with International Hydrographic Organization (IHO) requirements. Detailed comparisons were made of equipment specifications (transmit frequency, number of beams, beam width), data density, total uncertainty, and

  14. Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis.

    PubMed

    Yi, Ming; Mudunuri, Uma; Che, Anney; Stephens, Robert M

    2009-06-29

    One of the challenges in the analysis of microarray data is to integrate and compare the selected (e.g., differential) gene lists from multiple experiments for common or unique underlying biological themes. A common way to approach this problem is to extract common genes from these gene lists and then subject these genes to enrichment analysis to reveal the underlying biology. However, the capacity of this approach is largely restricted by the limited number of common genes shared by datasets from multiple experiments, which could be caused by the complexity of the biological system itself. We now introduce a new Pathway Pattern Extraction Pipeline (PPEP), which extends the existing WPS application by providing a new pathway-level comparative analysis scheme. To facilitate comparing and correlating results from different studies and sources, PPEP contains new interfaces that allow evaluation of the pathway-level enrichment patterns across multiple gene lists. As an exploratory tool, this analysis pipeline may help reveal the underlying biological themes at both the pathway and gene levels. The analysis scheme provided by PPEP begins with multiple gene lists, which may be derived from different studies in terms of the biological contexts, applied technologies, or methodologies. These lists are then subjected to pathway-level comparative analysis for extraction of pathway-level patterns. This analysis pipeline helps to explore the commonality or uniqueness of these lists at the level of pathways or biological processes from different but relevant biological systems using a combination of statistical enrichment measurements, pathway-level pattern extraction, and graphical display of the relationships of genes and their associated pathways as Gene-Term Association Networks (GTANs) within the WPS platform. As a proof of concept, we have used the new method to analyze many datasets from our collaborators as well as some public microarray datasets. This tool provides a

  15. Raman spectroscopic study of hydrogen ordered ice XIII and of its reversible phase transition to disordered ice V.

    PubMed

    Salzmann, Christoph G; Hallbrucker, Andreas; Finney, John L; Mayer, Erwin

    2006-07-14

    Raman spectra of recovered ordered H(2)O (D(2)O) ice XIII doped with 0.01 M HCl (DCl) recorded in vacuo at 80 K are reported in the range 3600-200 cm(-1). The bands are assigned to the various types of modes on the basis of isotope ratios. On thermal cycling between 80 and 120 K, the reversible phase transition to disordered ice V is observed. The remarkable effect of HCl (DCl) on orientational ordering in ice V and its phase transition to ordered ice XIII, first reported in a powder neutron diffraction study of DCl doped D(2)O ice V (C. G. Salzmann, P. G. Radaelli, A. Hallbrucker, E. Mayer, J. L. Finney, Science, 2006, 311, 1758), is demonstrated by Raman spectroscopy and discussed. The dopants KOH and HF have only a minor effect on hydrogen ordering in ice V, as shown by the Raman spectra.

  16. Relevancy Ranking of Satellite Dataset Search Results

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher; Quinn, Patrick; Norton, James

    2017-01-01

    As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.

  17. Decibel: The Relational Dataset Branching System

    PubMed Central

    Maddox, Michael; Goehring, David; Elmore, Aaron J.; Madden, Samuel; Parameswaran, Aditya; Deshpande, Amol

    2017-01-01

    As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs. PMID:28149668

  18. Factor XIII deficiency in Iran: a comprehensive review of the literature.

    PubMed

    Dorgalaleh, Akbar; Naderi, Majid; Hosseini, Maryam Sadat; Alizadeh, Shaban; Hosseini, Soudabeh; Tabibian, Shadi; Eshghi, Peyman

    2015-04-01

    Factor XIII deficiency (FXIIID) is a rare bleeding disorder with an estimated prevalence of 1 in 2-million population worldwide. In Iran, a Middle Eastern country with a high rate of consanguineous marriages, there are approximately 473 patients afflicted with FXIIID. An approximately 12-fold higher prevalence of FXIIID is estimated in Iran in comparison with overall worldwide frequency. In this study, we have undertaken a comprehensive review on different aspects of FXIIID in the Iranian population. The distribution of this disease in different regions of Iran reveals that Sistan and Baluchestan Province has not only the highest number of patients with FXIIID in Iran but the highest global incidence of this condition. Among Iranian patients, umbilical cord bleeding, hematoma, and prolonged wound bleeding are the most frequent clinical manifestations. There are several disease causing mutations in Iranian patients with FXIIID, with Trp187Arg being the most common mutation in FXIIID in Iran. Traditionally, the management of FXIIID in Iran was only based on administration of fresh frozen plasma or cryoprecipitate, until 2009 when FXIII concentrate became available for patient management. Various studies have evaluated the efficacy and safety of prophylactic regimens in different situations with valuable findings. Although the focus of this study is on Iran, it offers considerable insight into FXIIID, which can be applied more extensively to improve the management and quality of life in all affected patients. Thieme Medical Publishers 333 Seventh Avenue, New York, NY 10001, USA.

  19. Proceedings of the XIII International Symposium on Biological Control of Weeds; September 11-16, 2011; Waikoloa, Hawaii, USA

    Treesearch

    Yun Wu; Tracy Johnson; Sharlene Sing; S. Raghu; Greg Wheeler; Paul Pratt; Keith Warner; Ted Center; John Goolsby; Richard Reardon

    2013-01-01

    A total of 208 participants from 78 organizations in 19 countries gathered at the Waikoloa Beach Marriott on the Big Island of Hawaii on September 11-16, 2011 for the XIII International Symposium on Biological Control of Weeds. Following a reception on the first evening, Symposium co-chairs Tracy Johnson and Pat Conant formally welcomed the attendees on the morning of...

  20. Heuristics for Relevancy Ranking of Earth Dataset Search Results

    NASA Astrophysics Data System (ADS)

    Lynnes, C.; Quinn, P.; Norton, J.

    2016-12-01

    As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.

  1. Heuristics for Relevancy Ranking of Earth Dataset Search Results

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher; Quinn, Patrick; Norton, James

    2016-01-01

    As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.

  2. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

    PubMed Central

    Wernisch, Lorenz

    2017-01-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190

  3. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

    PubMed

    Gabasova, Evelina; Reid, John; Wernisch, Lorenz

    2017-10-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

  4. Dataset for forensic analysis of B-tree file system.

    PubMed

    Wani, Mohamad Ahtisham; Bhat, Wasim Ahmad

    2018-06-01

    Since B-tree file system (Btrfs) is set to become de facto standard file system on Linux (and Linux based) operating systems, Btrfs dataset for forensic analysis is of great interest and immense value to forensic community. This article presents a novel dataset for forensic analysis of Btrfs that was collected using a proposed data-recovery procedure. The dataset identifies various generalized and common file system layouts and operations, specific node-balancing mechanisms triggered, logical addresses of various data structures, on-disk records, recovered-data as directory entries and extent data from leaf and internal nodes, and percentage of data recovered.

  5. Usefulness of DARPA dataset for intrusion detection system evaluation

    NASA Astrophysics Data System (ADS)

    Thomas, Ciza; Sharma, Vishwas; Balakrishnan, N.

    2008-03-01

    The MIT Lincoln Laboratory IDS evaluation methodology is a practical solution in terms of evaluating the performance of Intrusion Detection Systems, which has contributed tremendously to the research progress in that field. The DARPA IDS evaluation dataset has been criticized and considered by many as a very outdated dataset, unable to accommodate the latest trend in attacks. Then naturally the question arises as to whether the detection systems have improved beyond detecting these old level of attacks. If not, is it worth thinking of this dataset as obsolete? The paper presented here tries to provide supporting facts for the use of the DARPA IDS evaluation dataset. The two commonly used signature-based IDSs, Snort and Cisco IDS, and two anomaly detectors, the PHAD and the ALAD, are made use of for this evaluation purpose and the results support the usefulness of DARPA dataset for IDS evaluation.

  6. Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades.

    PubMed

    Orchard, Garrick; Jayawant, Ajinkya; Cohen, Gregory K; Thakor, Nitish

    2015-01-01

    Creating datasets for Neuromorphic Vision is a challenging task. A lack of available recordings from Neuromorphic Vision sensors means that data must typically be recorded specifically for dataset creation rather than collecting and labeling existing data. The task is further complicated by a desire to simultaneously provide traditional frame-based recordings to allow for direct comparison with traditional Computer Vision algorithms. Here we propose a method for converting existing Computer Vision static image datasets into Neuromorphic Vision datasets using an actuated pan-tilt camera platform. Moving the sensor rather than the scene or image is a more biologically realistic approach to sensing and eliminates timing artifacts introduced by monitor updates when simulating motion on a computer monitor. We present conversion of two popular image datasets (MNIST and Caltech101) which have played important roles in the development of Computer Vision, and we provide performance metrics on these datasets using spike-based recognition algorithms. This work contributes datasets for future use in the field, as well as results from spike-based algorithms against which future works can compare. Furthermore, by converting datasets already popular in Computer Vision, we enable more direct comparison with frame-based approaches.

  7. Artificial intelligence (AI) systems for interpreting complex medical datasets.

    PubMed

    Altman, R B

    2017-05-01

    Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability. © 2017 ASCPT.

  8. Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.

    PubMed

    Kohli, Marc D; Summers, Ronald M; Geis, J Raymond

    2017-08-01

    At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities.

  9. Correlation between thyroidal and peripheral blood total T cells, CD8+ T cells, and CD8+ T- regulatory cells and T-cell reactivity to calsequestrin and collagen XIII in patients with Graves' ophthalmopathy.

    PubMed

    Al-Ansari, Farah; Lahooti, Hooshang; Stokes, Leanne; Edirimanne, Senarath; Wall, Jack

    2018-05-22

    Purpose/aim of the study: Graves' ophthalmopathy (GO) is closely related to the thyroid autoimmune disorder Graves' disease. Previous studies have suggested roles for thyroidal CD8 +  T cells and autoimmunity against calsequestrin-1 (CASQ)-1 in the link between thyroidal and orbital autoimmune reactions in GO. A role for autoimmunity against CollXIII has also been suggested. In this study, we aimed to investigate correlations between some thyroidal and peripheral blood T-cell subsets and thyroidal T-cell reactivity against CASQ1 and CollXIII in patients with GO. Fresh thyroid tissues were processed by enzyme digestion and density gradient to isolate mononuclear cells (MNCs). Peripheral blood MNCs were also isolated using density gradient. Flow-cytometric analysis was used to identify the various T-cell subsets. T -cell reactivity to CASQ1 and CollXIII was measured by a 5-day culture of the MNCs and BrdU uptake method. We found a positive correlation between thyroidal CD8 +  T cells and CD8 +  T-regulatory (T-reg) cells in patients with GO. Thyroidal T cells from two out of the three patients with GO tested (66.7%) showed a positive response to CASQ1, while thyroidal T cells from none of the six Graves' Disease patients without ophthalmopathy (GD) tested showed a positive response to this antigen. Thyroidal T cells from these patient groups however, showed no significant differences in their response to CollXIII. Our observations provide further evidence for a possible role of thyroidal CD8 +  T cells, CD8 +  T-reg cells and the autoantigen CASQ1 in the link between thyroidal and orbital autoimmune reactions of GO.

  10. [Best time to administer coagulation factor XIII( Fibrogammin P) for postoperative intractable pancreatic fistula following gastrectomy for gastric cancer].

    PubMed

    Shoda, Katsutoshi; Komatsu, Shuhei; Ichikawa, Daisuke; Kubota, Takeshi; Okamoto, Kazuma; Arita, Tomohiro; Konishi, Hirotaka; Murayama, Yasutoshi; Shiozaki, Atsushi; Kuriu, Yoshiaki; Ikoma, Hisashi; Nakanishi, Masayoshi; Fujiwara, Hitoshi; Otsuji, Eigo

    2013-11-01

    Coagulation factor XIII( Fibrogammin P, F XIII) has been used to treat postoperative pancreatic fistulas following gastrectomy for gastric cancer in Japan. However, little is known about the best timing to start this treatment for early recovery. This study was designed to examine the appropriate time to start Fibrogammin P treatment for pancreatic fistulas, based on nutritional and inflammatory parameters. We retrospectively examined 27 consecutive patients with Grade B or C pancreatic fistulas as defined by the International Study Group of Pancreatic Fistula( ISGPF) classification who underwent gastrectomy at our institute between 1997 and 2011. We analyzed data on total protein( TP), albumin (Alb), C-reactive protein( CRP), and hemoglobin( Hb) concentrations and white blood cell( WBC) and lymphocyte counts. We used this information to determine laboratory cut-off values that indicate the most advantageous time to start the administration of Fibrogammin P in order to achieve early recovery within 2 weeks. When Fibrogammin P administration was based on more than 2 cut-off values such as Alb>2.6 g/dL and Hb>9.0 g/dL and WBC<9,000/μL (p= 0.1563), early cure of pancreatic fistulas was achieved. The use of nutritional and inflammatory parameter values to determine the best time to administer Fibrogammin P may shorten the treatment period.

  11. Constraining the common properties of active region formation using the SDO/HEAR dataset

    NASA Astrophysics Data System (ADS)

    Schunker, H.; Braun, D. C.; Birch, A. C.

    2016-10-01

    Observations from the Solar Dynamics Observatory (SDO) have the potential for allowing the helioseismic study of the formation of hundreds of active regions, which enable us to perform statistical analyses. We collated a uniform data set of emerging active regions (EARs) observed by the SDO/HMI instrument suitable for helioseismic analysis, where each active region can be observed up to 7 days before emergence. We call this dataset the SDO Helioseismic Emerging Active Region (SDO/HEAR) survey. We have used this dataset to to understand the nature of active region emergence. The latitudinally averaged line-of-sight magnetic field of all the EARs shows that the leading (trailing) polarity moves in a prograde (retrograde) direction with a speed of 110 ± 15 m/s (-60 ± 10 m/s) relative to the Carrington rotation rate in the first day after emergence. However, relative to the differential rotation of the surface plasma the East-West velocity is symmetric, with a mean of 90 ± 10 m/s. We have also compared the surface flows associated with the EARs at the time of emergence with surface flows from numerical simulations of flux emergence with different rise speeds. We found that the surface flows in simulations of emerging flux with a low rise speed of 70 m/s best match the observations.

  12. Datasets used in Oshida et al. Disruption of STAT5b-Regulated Sexual Dimorphism of the Liver Transcriptome by Diverse Factors Is a Common Event. PLOS ONE 2016 Mar 9;11(3):e0148308

    EPA Pesticide Factsheets

    Includes 1) list of genes in the STAT5b biomarker and 2) list of accession numbers for microarray datasets used in study.This dataset is associated with the following publication:Oshida, K., N. Vasani, D. Waxman, and C. Corton. Disruption of STAT5b-Regulated Sexual Dimorphism of the Liver Transcriptome by Diverse Factors Is a Common Event. PLoS ONE. Public Library of Science, San Francisco, CA, USA, 11(3): NA, (2016).

  13. Intrinsic thermodynamics of ethoxzolamide inhibitor binding to human carbonic anhydrase XIII

    PubMed Central

    2012-01-01

    Background Human carbonic anhydrases (CAs) play crucial role in various physiological processes including carbon dioxide and hydrocarbon transport, acid homeostasis, biosynthetic reactions, and various pathological processes, especially tumor progression. Therefore, CAs are interesting targets for pharmaceutical research. The structure-activity relationships (SAR) of designed inhibitors require detailed thermodynamic and structural characterization of the binding reaction. Unfortunately, most publications list only the observed thermodynamic parameters that are significantly different from the intrinsic parameters. However, only intrinsic parameters could be used in the rational design and SAR of the novel compounds. Results Intrinsic binding parameters for several inhibitors, including ethoxzolamide, trifluoromethanesulfonamide, and acetazolamide, binding to recombinant human CA XIII isozyme were determined. The parameters were the intrinsic Gibbs free energy, enthalpy, entropy, and the heat capacity. They were determined by titration calorimetry and thermal shift assay in a wide pH and temperature range to dissect all linked protonation reaction contributions. Conclusions Precise determination of the inhibitor binding thermodynamics enabled correct intrinsic affinity and enthalpy ranking of the compounds and provided the means for SAR analysis of other rationally designed CA inhibitors. PMID:22676044

  14. Evolving hard problems: Generating human genetics datasets with a complex etiology.

    PubMed

    Himmelstein, Daniel S; Greene, Casey S; Moore, Jason H

    2011-07-07

    A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.

  15. Identifying Differentially Abundant Metabolic Pathways in Metagenomic Datasets

    NASA Astrophysics Data System (ADS)

    Liu, Bo; Pop, Mihai

    Enabled by rapid advances in sequencing technology, metagenomic studies aim to characterize entire communities of microbes bypassing the need for culturing individual bacterial members. One major goal of such studies is to identify specific functional adaptations of microbial communities to their habitats. Here we describe a powerful analytical method (MetaPath) that can identify differentially abundant pathways in metagenomic data-sets, relying on a combination of metagenomic sequence data and prior metabolic pathway knowledge. We show that MetaPath outperforms other common approaches when evaluated on simulated datasets. We also demonstrate the power of our methods in analyzing two, publicly available, metagenomic datasets: a comparison of the gut microbiome of obese and lean twins; and a comparison of the gut microbiome of infant and adult subjects. We demonstrate that the subpathways identified by our method provide valuable insights into the biological activities of the microbiome.

  16. Statistical Reference Datasets

    National Institute of Standards and Technology Data Gateway

    Statistical Reference Datasets (Web, free access)   The Statistical Reference Datasets is also supported by the Standard Reference Data Program. The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software.

  17. Seven novel mutations in the factor XIII A-subunit gene causing hereditary factor XIII deficiency in 10 unrelated families.

    PubMed

    Vysokovsky, A; Saxena, R; Landau, M; Zivelin, A; Eskaraev, R; Rosenberg, N; Seligsohn, U; Inbal, A

    2004-10-01

    Hereditary factor (F)XIII deficiency is a rare bleeding disorder mostly due to mutations in FXIII A subunit. We studied the molecular basis of FXIII deficiency in patients from 10 unrelated families originating from Israel, India and Tunisia. Exons 2-15 of genomic DNA consisting of coding regions and intron/exon boundaries were amplified and sequenced. Structural analysis of the mutations was undertaken by computer modeling. Seven novel mutations were identified in the FXIIIA gene. The propositus from the Ethiopian-Jewish family was found to be a compound heterozygote for two novel mutations: a 10-bp deletion in exon 12 at nucleotides 1652-1661 (followed by 22 altered amino acids and termination codon) and Ala318Val mutation. The propositus of the Tunisian family was homozygous for C insertion after nucleotide 863 within a stretch of six cytosines of exon 7. This insertion results in generation of eight altered amino acids followed by a termination codon downstream. The propositus from Indian-Jewish origin was found to be homozygous for G to T substitution at IVS 11 [+1] resulting in skipping of exons 10 and 11. In addition to the Ala318Val mutation, three of the novel mutations identified are missense mutations: Arg260Leu, Thr398Asn and Gly210Arg each occurring in a homozygous state in an Israeli-Arab and two Indian families, respectively. Structure-function correlation analysis by computer modeling of the new missense mutations predicted that Gly210Arg will cause protein misfolding, Ala318Val and Thr398Asn will interfere with the catalytic process or protein stability, and Arg260Leu will impair dimerization.

  18. Dataset Lifecycle Policy

    NASA Technical Reports Server (NTRS)

    Armstrong, Edward; Tauer, Eric

    2013-01-01

    The presentation focused on describing a new dataset lifecycle policy that the NASA Physical Oceanography DAAC (PO.DAAC) has implemented for its new and current datasets to foster improved stewardship and consistency across its archive. The overarching goal is to implement this dataset lifecycle policy for all new GHRSST GDS2 datasets and bridge the mission statements from the GHRSST Project Office and PO.DAAC to provide the best quality SST data in a cost-effective, efficient manner, preserving its integrity so that it will be available and usable to a wide audience.

  19. The health care and life sciences community profile for dataset descriptions

    PubMed Central

    Alexiev, Vladimir; Ansell, Peter; Bader, Gary; Baran, Joachim; Bolleman, Jerven T.; Callahan, Alison; Cruz-Toledo, José; Gaudet, Pascale; Gombocz, Erich A.; Gonzalez-Beltran, Alejandra N.; Groth, Paul; Haendel, Melissa; Ito, Maori; Jupp, Simon; Juty, Nick; Katayama, Toshiaki; Kobayashi, Norio; Krishnaswami, Kalpana; Laibe, Camille; Le Novère, Nicolas; Lin, Simon; Malone, James; Miller, Michael; Mungall, Christopher J.; Rietveld, Laurens; Wimalaratne, Sarala M.; Yamaguchi, Atsuko

    2016-01-01

    Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. PMID:27602295

  20. Clot retraction is mediated by factor XIII-dependent fibrin-αIIbβ3-myosin axis in platelet sphingomyelin-rich membrane rafts.

    PubMed

    Kasahara, Kohji; Kaneda, Mizuho; Miki, Toshiaki; Iida, Kazuko; Sekino-Suzuki, Naoko; Kawashima, Ikuo; Suzuki, Hidenori; Shimonaka, Motoyuki; Arai, Morio; Ohno-Iwashita, Yoshiko; Kojima, Soichi; Abe, Mitsuhiro; Kobayashi, Toshihide; Okazaki, Toshiro; Souri, Masayoshi; Ichinose, Akitada; Yamamoto, Naomasa

    2013-11-07

    Membrane rafts are spatially and functionally heterogenous in the cell membrane. We observed that lysenin-positive sphingomyelin (SM)-rich rafts are identified histochemically in the central region of adhered platelets where fibrin and myosin are colocalized on activation by thrombin. The clot retraction of SM-depleted platelets from SM synthase knockout mouse was delayed significantly, suggesting that platelet SM-rich rafts are involved in clot retraction. We found that fibrin converted by thrombin translocated immediately in platelet detergent-resistant membrane (DRM) rafts but that from Glanzmann's thrombasthenic platelets failed. The fibrinogen γ-chain C-terminal (residues 144-411) fusion protein translocated to platelet DRM rafts on thrombin activation, but its mutant that was replaced by A398A399 at factor XIII crosslinking sites (Q398Q399) was inhibited. Furthermore, fibrin translocation to DRM rafts was impaired in factor XIII A subunit-deficient mouse platelets, which show impaired clot retraction. In the cytoplasm, myosin translocated concomitantly with fibrin translocation into the DRM raft of thrombin-stimulated platelets. Furthermore, the disruption of SM-rich rafts by methyl-β-cyclodextrin impaired myosin activation and clot retraction. Thus, we propose that clot retraction takes place in SM-rich rafts where a fibrin-αIIbβ3-myosin complex is formed as a primary axis to promote platelet contraction.

  1. Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses

    PubMed Central

    Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi

    2018-01-01

    Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625

  2. Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.

    PubMed

    Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi

    2018-02-27

    Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.

  3. geoknife: Reproducible web-processing of large gridded datasets

    USGS Publications Warehouse

    Read, Jordan S.; Walker, Jordan I.; Appling, Alison P.; Blodgett, David L.; Read, Emily K.; Winslow, Luke A.

    2016-01-01

    Geoprocessing of large gridded data according to overlap with irregular landscape features is common to many large-scale ecological analyses. The geoknife R package was created to facilitate reproducible analyses of gridded datasets found on the U.S. Geological Survey Geo Data Portal web application or elsewhere, using a web-enabled workflow that eliminates the need to download and store large datasets that are reliably hosted on the Internet. The package provides access to several data subset and summarization algorithms that are available on remote web processing servers. Outputs from geoknife include spatial and temporal data subsets, spatially-averaged time series values filtered by user-specified areas of interest, and categorical coverage fractions for various land-use types.

  4. Learning to recognize rat social behavior: Novel dataset and cross-dataset application.

    PubMed

    Lorbach, Malte; Kyriakou, Elisavet I; Poppe, Ronald; van Dam, Elsbeth A; Noldus, Lucas P J J; Veltkamp, Remco C

    2018-04-15

    Social behavior is an important aspect of rodent models. Automated measuring tools that make use of video analysis and machine learning are an increasingly attractive alternative to manual annotation. Because machine learning-based methods need to be trained, it is important that they are validated using data from different experiment settings. To develop and validate automated measuring tools, there is a need for annotated rodent interaction datasets. Currently, the availability of such datasets is limited to two mouse datasets. We introduce the first, publicly available rat social interaction dataset, RatSI. We demonstrate the practical value of the novel dataset by using it as the training set for a rat interaction recognition method. We show that behavior variations induced by the experiment setting can lead to reduced performance, which illustrates the importance of cross-dataset validation. Consequently, we add a simple adaptation step to our method and improve the recognition performance. Most existing methods are trained and evaluated in one experimental setting, which limits the predictive power of the evaluation to that particular setting. We demonstrate that cross-dataset experiments provide more insight in the performance of classifiers. With our novel, public dataset we encourage the development and validation of automated recognition methods. We are convinced that cross-dataset validation enhances our understanding of rodent interactions and facilitates the development of more sophisticated recognition methods. Combining them with adaptation techniques may enable us to apply automated recognition methods to a variety of animals and experiment settings. Copyright © 2017 Elsevier B.V. All rights reserved.

  5. Coated platelets function in platelet-dependent fibrin formation via integrin αIIbβ3 and transglutaminase factor XIII

    PubMed Central

    Mattheij, Nadine J.A.; Swieringa, Frauke; Mastenbroek, Tom G.; Berny-Lang, Michelle A.; May, Frauke; Baaten, Constance C.F.M.J.; van der Meijden, Paola E.J.; Henskens, Yvonne M.C.; Beckers, Erik A.M.; Suylen, Dennis P.L.; Nolte, Marc W.; Hackeng, Tilman M.; McCarty, Owen J.T.; Heemskerk, Johan W.M.; Cosemans, Judith M.E.M.

    2016-01-01

    Coated platelets, formed by collagen and thrombin activation, have been characterized in different ways: i) by the formation of a protein coat of α-granular proteins; ii) by exposure of procoagulant phosphatidylserine; or iii) by high fibrinogen binding. Yet, their functional role has remained unclear. Here we used a novel transglutaminase probe, Rhod-A14, to identify a subpopulation of platelets with a cross-linked protein coat, and compared this with other platelet subpopulations using a panel of functional assays. Platelet stimulation with convulxin/thrombin resulted in initial integrin αIIbβ3 activation, the appearance of a platelet population with high fibrinogen binding, (independently of active integrins, but dependent on the presence of thrombin) followed by phosphatidylserine exposure and binding of coagulation factors Va and Xa. A subpopulation of phosphatidylserine-exposing platelets bound Rhod-A14 both in suspension and in thrombi generated on a collagen surface. In suspension, high fibrinogen and Rhod-A14 binding were antagonized by combined inhibition of transglutaminase activity and integrin αIIbβ3. Markedly, in thrombi from mice deficient in transglutaminase factor XIII, platelet-driven fibrin formation and Rhod-A14 binding were abolished by blockage of integrin αIIbβ3. Vice versa, star-like fibrin formation from platelets of a patient with deficiency in αIIbβ3 (Glanzmann thrombasthenia) was abolished upon blockage of transglutaminase activity. We conclude that coated platelets, with initial αIIbβ3 activation and high fibrinogen binding, form a subpopulation of phosphatidylserine-exposing platelets, and function in platelet-dependent star-like fibrin fiber formation via transglutaminase factor XIII and integrin αIIbβ3. PMID:26721892

  6. Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset

    NASA Technical Reports Server (NTRS)

    Ramasso, Emannuel; Saxena, Abhinav

    2014-01-01

    Benchmarking of prognostic algorithms has been challenging due to limited availability of common datasets suitable for prognostics. In an attempt to alleviate this problem several benchmarking datasets have been collected by NASA's prognostic center of excellence and made available to the Prognostics and Health Management (PHM) community to allow evaluation and comparison of prognostics algorithms. Among those datasets are five C-MAPSS datasets that have been extremely popular due to their unique characteristics making them suitable for prognostics. The C-MAPSS datasets pose several challenges that have been tackled by different methods in the PHM literature. In particular, management of high variability due to sensor noise, effects of operating conditions, and presence of multiple simultaneous fault modes are some factors that have great impact on the generalization capabilities of prognostics algorithms. More than 70 publications have used the C-MAPSS datasets for developing data-driven prognostic algorithms. The C-MAPSS datasets are also shown to be well-suited for development of new machine learning and pattern recognition tools for several key preprocessing steps such as feature extraction and selection, failure mode assessment, operating conditions assessment, health status estimation, uncertainty management, and prognostics performance evaluation. This paper summarizes a comprehensive literature review of publications using C-MAPSS datasets and provides guidelines and references to further usage of these datasets in a manner that allows clear and consistent comparison between different approaches.

  7. Use of Electronic Health-Related Datasets in Nursing and Health-Related Research.

    PubMed

    Al-Rawajfah, Omar M; Aloush, Sami; Hewitt, Jeanne Beauchamp

    2015-07-01

    Datasets of gigabyte size are common in medical sciences. There is increasing consensus that significant untapped knowledge lies hidden in these large datasets. This review article aims to discuss Electronic Health-Related Datasets (EHRDs) in terms of types, features, advantages, limitations, and possible use in nursing and health-related research. Major scientific databases, MEDLINE, ScienceDirect, and Scopus, were searched for studies or review articles regarding using EHRDs in research. A total number of 442 articles were located. After application of study inclusion criteria, 113 articles were included in the final review. EHRDs were categorized into Electronic Administrative Health-Related Datasets and Electronic Clinical Health-Related Datasets. Subcategories of each major category were identified. EHRDs are invaluable assets for nursing the health-related research. Advanced research skills such as using analytical softwares, advanced statistical procedures, dealing with missing data and missing variables will maximize the efficient utilization of EHRDs in research. © The Author(s) 2014.

  8. interPopula: a Python API to access the HapMap Project dataset

    PubMed Central

    2010-01-01

    Background The HapMap project is a publicly available catalogue of common genetic variants that occur in humans, currently including several million SNPs across 1115 individuals spanning 11 different populations. This important database does not provide any programmatic access to the dataset, furthermore no standard relational database interface is provided. Results interPopula is a Python API to access the HapMap dataset. interPopula provides integration facilities with both the Python ecology of software (e.g. Biopython and matplotlib) and other relevant human population datasets (e.g. Ensembl gene annotation and UCSC Known Genes). A set of guidelines and code examples to address possible inconsistencies across heterogeneous data sources is also provided. Conclusions interPopula is a straightforward and flexible Python API that facilitates the construction of scripts and applications that require access to the HapMap dataset. PMID:21210977

  9. Segmentation of Unstructured Datasets

    NASA Technical Reports Server (NTRS)

    Bhat, Smitha

    1996-01-01

    Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.

  10. Fitting Meta-Analytic Structural Equation Models with Complex Datasets

    ERIC Educational Resources Information Center

    Wilson, Sandra Jo; Polanin, Joshua R.; Lipsey, Mark W.

    2016-01-01

    A modification of the first stage of the standard procedure for two-stage meta-analytic structural equation modeling for use with large complex datasets is presented. This modification addresses two common problems that arise in such meta-analyses: (a) primary studies that provide multiple measures of the same construct and (b) the correlation…

  11. Genetic Factors Influencing Coagulation Factor XIII B-Subunit Contribute to Risk of Ischemic Stroke.

    PubMed

    Hanscombe, Ken B; Traylor, Matthew; Hysi, Pirro G; Bevan, Stephen; Dichgans, Martin; Rothwell, Peter M; Worrall, Bradford B; Seshadri, Sudha; Sudlow, Cathie; Williams, Frances M K; Markus, Hugh S; Lewis, Cathryn M

    2015-08-01

    Abnormal coagulation has been implicated in the pathogenesis of ischemic stroke, but how this association is mediated and whether it differs between ischemic stroke subtypes is unknown. We determined the shared genetic risk between 14 coagulation factors and ischemic stroke and its subtypes. Using genome-wide association study results for 14 coagulation factors from the population-based TwinsUK sample (N≈2000 for each factor), meta-analysis results from the METASTROKE consortium ischemic stroke genome-wide association study (12 389 cases, 62 004 controls), and genotype data for 9520 individuals from the WTCCC2 ischemic stroke study (3548 cases, 5972 controls-the largest METASTROKE subsample), we explored shared genetic risk for coagulation and stroke. We performed three analyses: (1) a test for excess concordance (or discordance) in single nucleotide polymorphism effect direction across coagulation and stroke, (2) an estimation of the joint effect of multiple coagulation-associated single nucleotide polymorphisms in stroke, and (3) an evaluation of common genetic risk between coagulation and stroke. One coagulation factor, factor XIII subunit B (FXIIIB), showed consistent effects in the concordance analysis, the estimation of polygenic risk, and the validation with genotype data, with associations specific to the cardioembolic stroke subtype. Effect directions for FXIIIB-associated single nucleotide polymorphisms were significantly discordant with cardioembolic disease (smallest P=5.7×10(-04)); the joint effect of FXIIIB-associated single nucleotide polymorphisms was significantly predictive of ischemic stroke (smallest P=1.8×10(-04)) and the cardioembolic subtype (smallest P=1.7×10(-04)). We found substantial negative genetic covariation between FXIIIB and ischemic stroke (rG=-0.71, P=0.01) and the cardioembolic subtype (rG=-0.80, P=0.03). Genetic markers associated with low FXIIIB levels increase risk of ischemic stroke cardioembolic subtype. © 2015 The

  12. Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories.

    PubMed

    Jong, Victor L; Novianti, Putri W; Roes, Kit C B; Eijkemans, Marinus J C

    2014-12-01

    The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.

  13. Common Structure in Different Physical Properties: Electrical Conductivity and Surface Waves Phase Velocity

    NASA Astrophysics Data System (ADS)

    Mandolesi, E.; Jones, A. G.; Roux, E.; Lebedev, S.

    2009-12-01

    Recently different studies were undertaken on the correlation between diverse geophysical datasets. Magnetotelluric (MT) data are used to map the electrical conductivity structure behind the Earth, but one of the problems in MT method is the lack in resolution in mapping zones beneath a region of high conductivity. Joint inversion of different datasets in which a common structure is recognizable reduces non-uniqueness and may improve the quality of interpretation when different dataset are sensitive to different physical properties with an underlined common structure. A common structure is recognized if the change of physical properties occur at the same spatial locations. Common structure may be recognized in 1D inversion of seismic and MT datasets, and numerous authors show that also 2D common structure may drive to an improvement of inversion quality while dataset are jointly inverted. In this presentation a tool to constrain MT 2D inversion with phase velocity of surface wave seismic data (SW) is proposed and is being developed and tested on synthetic data. Results obtained suggest that a joint inversion scheme could be applied with success along a section profile for which data are compatible with a 2D MT model.

  14. Energy Levels and Radiative Rates for Transitions in F-like Sc XIII and Ne-like Sc XII and Y XXX

    NASA Astrophysics Data System (ADS)

    Aggarwal, Kanti

    2018-05-01

    Energy levels, radiative rates and lifetimes are reported for F-like Sc~XIII and Ne-like Sc~XII and Y~XXX for which the general-purpose relativistic atomic structure package ({\\sc grasp}) has been adopted. For all three ions limited data exist in the literature but comparisons have been made wherever possible to assess the accuracy of the calculations. In the present work the lowest 102, 125 and 139 levels have been considered for the respective ions. Additionally, calculations have also been performed with the flexible atomic code ({\\sc fac}) to (particularly) confirm the accuracy of energy levels.

  15. ESSG-based global spatial reference frame for datasets interrelation

    NASA Astrophysics Data System (ADS)

    Yu, J. Q.; Wu, L. X.; Jia, Y. J.

    2013-10-01

    To know well about the highly complex earth system, a large volume of, as well as a large variety of, datasets on the planet Earth are being obtained, distributed, and shared worldwide everyday. However, seldom of existing systems concentrates on the distribution and interrelation of different datasets in a common Global Spatial Reference Frame (GSRF), which holds an invisble obstacle to the data sharing and scientific collaboration. Group on Earth Obeservation (GEO) has recently established a new GSRF, named Earth System Spatial Grid (ESSG), for global datasets distribution, sharing and interrelation in its 2012-2015 WORKING PLAN.The ESSG may bridge the gap among different spatial datasets and hence overcome the obstacles. This paper is to present the implementation of the ESSG-based GSRF. A reference spheroid, a grid subdvision scheme, and a suitable encoding system are required to implement it. The radius of ESSG reference spheroid was set to the double of approximated Earth radius to make datasets from different areas of earth system science being covered. The same paramerters of positioning and orienting as Earth Centred Earth Fixed (ECEF) was adopted for the ESSG reference spheroid to make any other GSRFs being freely transformed into the ESSG-based GSRF. Spheroid degenerated octree grid with radius refiment (SDOG-R) and its encoding method were taken as the grid subdvision and encoding scheme for its good performance in many aspects. A triple (C, T, A) model is introduced to represent and link different datasets based on the ESSG-based GSRF. Finally, the methods of coordinate transformation between the ESSGbased GSRF and other GSRFs were presented to make ESSG-based GSRF operable and propagable.

  16. Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.

    PubMed

    Mainali, Kumar P; Bewick, Sharon; Thielen, Peter; Mehoke, Thomas; Breitwieser, Florian P; Paudel, Shishir; Adhikari, Arjun; Wolfe, Joshua; Slud, Eric V; Karig, David; Fagan, William F

    2017-01-01

    Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson's correlation coefficient (r) and Jaccard's index (J)-two of the most common metrics for correlation analysis of presence-absence data-can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson's correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard's index of similarity (J) can yield improvements over Pearson's correlation coefficient. However, the standard null model for Jaccard's index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for

  17. Statistical analysis of co-occurrence patterns in microbial presence-absence datasets

    PubMed Central

    Bewick, Sharon; Thielen, Peter; Mehoke, Thomas; Breitwieser, Florian P.; Paudel, Shishir; Adhikari, Arjun; Wolfe, Joshua; Slud, Eric V.; Karig, David; Fagan, William F.

    2017-01-01

    Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson’s correlation coefficient (r) and Jaccard’s index (J)–two of the most common metrics for correlation analysis of presence-absence data–can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson’s correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard’s index of similarity (J) can yield improvements over Pearson’s correlation coefficient. However, the standard null model for Jaccard’s index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately

  18. Systematic analysis of microarray datasets to identify Parkinson's disease‑associated pathways and genes.

    PubMed

    Feng, Yinling; Wang, Xuefeng

    2017-03-01

    In order to investigate commonly disturbed genes and pathways in various brain regions of patients with Parkinson's disease (PD), microarray datasets from previous studies were collected and systematically analyzed. Different normalization methods were applied to microarray datasets from different platforms. A strategy combining gene co‑expression networks and clinical information was adopted, using weighted gene co‑expression network analysis (WGCNA) to screen for commonly disturbed genes in different brain regions of patients with PD. Functional enrichment analysis of commonly disturbed genes was performed using the Database for Annotation, Visualization, and Integrated Discovery (DAVID). Co‑pathway relationships were identified with Pearson's correlation coefficient tests and a hypergeometric distribution‑based test. Common genes in pathway pairs were selected out and regarded as risk genes. A total of 17 microarray datasets from 7 platforms were retained for further analysis. Five gene coexpression modules were identified, containing 9,745, 736, 233, 101 and 93 genes, respectively. One module was significantly correlated with PD samples and thus the 736 genes it contained were considered to be candidate PD‑associated genes. Functional enrichment analysis demonstrated that these genes were implicated in oxidative phosphorylation and PD. A total of 44 pathway pairs and 52 risk genes were revealed, and a risk gene pathway relationship network was constructed. Eight modules were identified and were revealed to be associated with PD, cancers and metabolism. A number of disturbed pathways and risk genes were unveiled in PD, and these findings may help advance understanding of PD pathogenesis.

  19. Statistical analysis of large simulated yield datasets for studying climate effects

    USDA-ARS?s Scientific Manuscript database

    Ensembles of process-based crop models are now commonly used to simulate crop growth and development for climate scenarios of temperature and/or precipitation changes corresponding to different projections of atmospheric CO2 concentrations. This approach generates large datasets with thousands of de...

  20. Agreement between factor XIII activity and antigen assays in measurement of factor XIII: A French multicenter study of 147 human plasma samples.

    PubMed

    Caron, C; Meley, R; Le Cam Duchez, V; Aillaud, M F; Lavenu-Bombled, C; Dutrillaux, F; Flaujac, C; Ryman, A; Ternisien, C; Lasne, D; Galinat, H; Pouplard, C

    2017-06-01

    Factor XIII (FXIII) deficiency is a rare hemorrhagic disorder whose early diagnosis is crucial for appropriate treatment and prophylactic supplementation in cases of severe deficiency. International guidelines recommend a quantitative FXIII activity assay as first-line screening test. FXIII antigen measurement may be performed to establish the subtype of FXIII deficiency (FXIIID) when activity is decreased. The aim of this multicenter study was to evaluate the analytical and diagnostic levels of performance of a new latex immunoassay, K-Assay ® FXIII reagent from Stago, for first-line measurement of FXIII antigen. Results were compared to those obtained with the Berichrom ® FXIII chromogenic assay for measurement of FXIII activity. Of the 147 patient plasma samples, 138 were selected for analysis. The accuracy was very good, with intercenter reproducibility close to 7%. Five groups were defined on FXIII activity level (<5% (n = 5), 5%-30% (n = 23), 30%-60% (n = 17), 60%-120% (n = 69), above 120% (n = 24)), without statistical differences between activity and antigen levels (P value >0.05). Correlation of the K-Assay ® with the Berichrom ® FXIII activity results was excellent (r = 0.919). Good agreement was established by the Bland and Altman method, with a bias of +9.4% on all samples, and of -1.4% for FXIII levels lower than 30%. One patient with afibrinogenemia showed low levels of Berichrom ® FXIII activity but normal antigen level and clot solubility as expected. The measurement of FXIII antigen using the K-Assay ® is a reliable first-line tool for detection of FXIII deficiency when an activity assay is not available. © 2017 John Wiley & Sons Ltd.

  1. EDITORIAL: XIII Mexican Workshop on Particles and Fields

    NASA Astrophysics Data System (ADS)

    Barranco, Juan; Contreras, Guillermo; Delepine, David; Napsuciale, Mauro

    2012-08-01

    Juan Barranco Physics Department, Guanajuato University, Loma del Bosque 103, col. Loma del Campestre, 37150, Leon (Mexico) jbarranc@fisica.ugto.mx Guillermo Contreras Departamento de Fisica Aplicada Centro de Investigacion y de Estudios Avanzados del Instituto Politecnico Nacional, Merida (Mexico) jgcn@mda.cinvestav.mx David Delepine Physics Department, Guanajuato University, Loma del Bosque 103, col. Loma del Campestre, 37150, Leon (Mexico) delepine@fisica.ugto.mx Mauro Napsuciale Physics Department, Guanajuato University, Loma del Bosque 103, col. Loma del Campestre, 37150, Leon (Mexico) mauro@fisica.ugto.mx The XIII Mexican Workshop on Particles and Fields (MWPF) took place from 20-26 October 2011, in the city of León, Guanajuato, México. This is a biennial meeting organized by the Division of Particles and Fields of the Mexican Physical Society designed to gather specialists in different areas of high energy physics to discuss the latest developments in the field. The thirteenth edition of this meeting was hosted by the Department of Cultural Studies of Guanajuato University in a nice environment dedicated to the Arts and Culture. The XIII MWPF was organized by three working groups who organized the corresponding sessions around three topics. The first one was Strings, Cosmology, Astroparticles and Physics Beyond the Standard Model. In this category we included: Cosmic Rays, Gamma Ray Bursts, Physics Beyond the Standard Model (theory and experimental searches), Strings and Cosmology. The working group for this topic was formed by Arnulfo Zepeda, Oscar Loaiza, Axel de la Macorra and Myriam Mondragón. The second topic was Hadronic Matter which included Perturbative QCD, Jets and Diffractive Physics, Hadronic Structure, Soft QCD, Hadron Spectroscopy, Heavy Ion Collisions and Soft Physics at Hadron Colliders, Lattice Results and Instrumentation. The working group for this topic was integrated by Wolfgang Bietenholz and Mariana Kirchbach. The third topic was

  2. Fixing Dataset Search

    NASA Technical Reports Server (NTRS)

    Lynnes, Chris

    2014-01-01

    Three current search engines are queried for ozone data at the GES DISC. The results range from sub-optimal to counter-intuitive. We propose a method to fix dataset search by implementing a robust relevancy ranking scheme. The relevancy ranking scheme is based on several heuristics culled from more than 20 years of helping users select datasets.

  3. Extreme Ultraviolet Emission Lines of Iron Fe XI-XIII

    NASA Astrophysics Data System (ADS)

    Lepson, Jaan; Beiersdorfer, P.; Brown, G. V.; Liedahl, D. A.; Brickhouse, N. S.; Dupree, A. K.

    2013-04-01

    The extreme ultraviolet (EUV) spectral region (ca. 20--300 Å) is rich in emission lines from low- to mid-Z ions, particularly from the middle charge states of iron. Many of these emission lines are important diagnostics for astrophysical plasmas, providing information on properties such as elemental abundance, temperature, density, and even magnetic field strength. In recent years, strides have been made to understand the complexity of the atomic levels of the ions that emit the lines that contribute to the richness of the EUV region. Laboratory measurements have been made to verify and benchmark the lines. Here, we present laboratory measurements of Fe XI, Fe XII, and Fe XIII between 40-140 Å. The measurements were made at the Lawrence Livermore electron beam ion trap (EBIT) facility, which has been optimized for laboratory astrophysics, and which allows us to select specific charge states of iron to help line identification. We also present new calculations by the Hebrew University - Lawrence Livermore Atomic Code (HULLAC), which we also utilized for line identification. We found that HULLAC does a creditable job of reproducing the forest of lines we observed in the EBIT spectra, although line positions are in need of adjustment, and line intensities often differed from those observed. We identify or confirm a number of new lines for these charge states. This work was supported by the NASA Solar and Heliospheric Program under Contract NNH10AN31I and the DOE General Plasma Science program. Work was performed in part under the auspices of the Department of Energy by Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344.

  4. Recombinant Factor XIII Mitigates Hemorrhagic Shock-Induced Organ Dysfunction

    PubMed Central

    Zaets, Sergey B.; Xu, Da-Zhong; Lu, Qi; Feketova, Eleonora; Berezina, Tamara L.; Malinina, Inga V.; Deitch, Edwin A.; Olsen, Eva H.

    2012-01-01

    Background Plasma factor XIII (FXIII) is responsible for stabilization of fibrin clot at the final stage of blood coagulation. Since FXIII has also been shown to modulate inflammation, endothelial permeability, as well as diminish multiple organ dysfunction (MOD) after gut ischemia-reperfusion injury, we hypothesized that FXIII would reduce MOD caused by trauma-hemorrhagic shock (THS). Materials and methods Rats were subjected to a 90 min THS or trauma sham shock (TSS) and treated with either recombinant human FXIII A2 subunit (rFXIII) or placebo immediately after resuscitation with shed blood or at the end of the TSS period. Lung permeability, lung and gut myeloperoxidase (MPO) activity, gut histology, neutrophil respiratory burst, microvascular blood flow in the liver and muscles, and cytokine levels were measured 3 h after the THS or TSS. FXIII levels were measured before THS or TSS and after the 3-h post-shock period. Results THS-induced lung permeability as well as lung and gut MPO activity was significantly lower in rFXIII-treated than in placebo-treated animals. Similarly, rFXIII-treated rats had lower neutrophil respiratory burst activity and less ileal mucosal injury. rFXIII-treated rats also had a higher liver microvascular blood flow compared with the placebo group. Cytokine response was more favorable in rFXIII-treated animals. Trauma-hemorrhagic shock did not cause a drop in FXIII activity during the study period. Conclusions Administration of rFXIII diminishes THS-induced MOD in rats, presumably by preservation of the gut barrier function, limitation of polymorphonuclear leukocyte (PMN) activation, and modulation of the cytokine response. PMID:21276979

  5. Functional factor XIII-A is exposed on the stimulated platelet surface

    PubMed Central

    Mitchell, Joanne L.; Lionikiene, Ausra S.; Fraser, Steven R.; Whyte, Claire S.; Booth, Nuala A.

    2014-01-01

    Factor XIII (FXIII) stabilizes thrombi against fibrinolysis by cross-linking α2-antiplasmin (α2AP) to fibrin. Cellular FXIII (FXIII-A) is abundant in platelets, but the extracellular functions of this pool are unclear because it is not released by classical secretion mechanisms. We examined the function of platelet FXIII-A using Chandler model thrombi formed from FXIII-depleted plasma. Platelets stabilized FXIII-depleted thrombi in a transglutaminase-dependent manner. FXIII-A activity on activated platelets was unstable and was rapidly lost over 1 hour. Inhibiting platelet activation abrogated the ability of platelets to stabilize thrombi. Incorporating a neutralizing antibody to α2AP into FXIII-depleted thrombi revealed that the stabilizing effect of platelet FXIII-A on lysis was α2AP dependent. Platelet FXIII-A activity and antigen were associated with the cytoplasm and membrane fraction of unstimulated platelets, and these fractions were functional in stabilizing FXIII-depleted thrombi against lysis. Fluorescence confocal microscopy and flow cytometry revealed exposure of FXIII-A on activated membranes, with maximal signal detected with thrombin and collagen stimulation. FXIII-A was evident in protruding caps on the surface of phosphatidylserine-positive platelets. Our data show a functional role for platelet FXIII-A through exposure on the activated platelet membrane where it exerts antifibrinolytic function by cross-linking α2AP to fibrin. PMID:25331118

  6. Decoys Selection in Benchmarking Datasets: Overview and Perspectives

    PubMed Central

    Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu

    2018-01-01

    Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509

  7. FLUXNET2015 Dataset: Batteries included

    NASA Astrophysics Data System (ADS)

    Pastorello, G.; Papale, D.; Agarwal, D.; Trotta, C.; Chu, H.; Canfora, E.; Torn, M. S.; Baldocchi, D. D.

    2016-12-01

    The synthesis datasets have become one of the signature products of the FLUXNET global network. They are composed from contributions of individual site teams to regional networks, being then compiled into uniform data products - now used in a wide variety of research efforts: from plant-scale microbiology to global-scale climate change. The FLUXNET Marconi Dataset in 2000 was the first in the series, followed by the FLUXNET LaThuile Dataset in 2007, with significant additions of data products and coverage, solidifying the adoption of the datasets as a research tool. The FLUXNET2015 Dataset counts with another round of substantial improvements, including extended quality control processes and checks, use of downscaled reanalysis data for filling long gaps in micrometeorological variables, multiple methods for USTAR threshold estimation and flux partitioning, and uncertainty estimates - all of which accompanied by auxiliary flags. This "batteries included" approach provides a lot of information for someone who wants to explore the data (and the processing methods) in detail. This inevitably leads to a large number of data variables. Although dealing with all these variables might seem overwhelming at first, especially to someone looking at eddy covariance data for the first time, there is method to our madness. In this work we describe the data products and variables that are part of the FLUXNET2015 Dataset, and the rationale behind the organization of the dataset, covering the simplified version (labeled SUBSET), the complete version (labeled FULLSET), and the auxiliary products in the dataset.

  8. Isfahan MISP Dataset

    PubMed Central

    Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

    2017-01-01

    An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled “biosigdata.com.” It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf). PMID:28487832

  9. Isfahan MISP Dataset.

    PubMed

    Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

    2017-01-01

    An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).

  10. Preprocessed Consortium for Neuropsychiatric Phenomics dataset.

    PubMed

    Gorgolewski, Krzysztof J; Durnez, Joke; Poldrack, Russell A

    2017-01-01

    Here we present preprocessed MRI data of 265 participants from the Consortium for Neuropsychiatric Phenomics (CNP) dataset. The preprocessed dataset includes minimally preprocessed data in the native, MNI and surface spaces accompanied with potential confound regressors, tissue probability masks, brain masks and transformations. In addition the preprocessed dataset includes unthresholded group level and single subject statistical maps from all tasks included in the original dataset. We hope that availability of this dataset will greatly accelerate research.

  11. Automatic Diabetic Macular Edema Detection in Fundus Images Using Publicly Available Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

    2011-01-01

    Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME. This and other two publiclymore » available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing. Our algorithm is robust to segmentation uncertainties, does not need ground truth at lesion level, and is very fast, generating a diagnosis on an average of 4.4 seconds per image on an 2.6 GHz platform with an unoptimised Matlab implementation.« less

  12. The ISLSCP initiative I global datasets: Surface boundary conditions and atmospheric forcings for land-atmosphere studies

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sellers, P.J.; Collatz, J.; Koster, R.

    1996-09-01

    A comprehensive series of global datasets for land-atmosphere models has been collected, formatted to a common grid, and released on a set of CD-ROMs. This paper describes the motivation for and the contents of the dataset. In June of 1992, an interdisciplinary earth science workshop was convened in Columbia, Maryland, to assess progress in land-atmosphere research, specifically in the areas of models, satellite data algorithms, and field experiments. At the workshop, representatives of the land-atmosphere modeling community defined a need for global datasets to prescribe boundary conditions, initialize state variables, and provide near-surface meteorological and radiative forcings for their models.more » The International Satellite Land Surface Climatology Project (ISLSCP), a part of the Global Energy and Water Cycle Experiment, worked with the Distributed Active Archive Center of the National Aeronautics and Space Administration Goddard Space Flight Center to bring the required datasets together in a usable format. The data have since been released on a collection of CD-ROMs. The datasets on the CD-ROMs are grouped under the following headings: vegetation; hydrology and soils; snow, ice, and oceans; radiation and clouds; and near-surface meteorology. All datasets cover the period 1987-88, and all but a few are spatially continuous over the earth`s land surface. All have been mapped to a common 1{degree} x 1{degree} equal-angle grid. The temporal frequency for most of the datasets is monthly. A few of the near-surface meteorological parameters are available both as six-hourly values and as monthly means. 26 refs., 8 figs., 2 tabs.« less

  13. Dissecting the space-time structure of tree-ring datasets using the partial triadic analysis.

    PubMed

    Rossi, Jean-Pierre; Nardin, Maxime; Godefroid, Martin; Ruiz-Diaz, Manuela; Sergent, Anne-Sophie; Martinez-Meier, Alejandro; Pâques, Luc; Rozenberg, Philippe

    2014-01-01

    Tree-ring datasets are used in a variety of circumstances, including archeology, climatology, forest ecology, and wood technology. These data are based on microdensity profiles and consist of a set of tree-ring descriptors, such as ring width or early/latewood density, measured for a set of individual trees. Because successive rings correspond to successive years, the resulting dataset is a ring variables × trees × time datacube. Multivariate statistical analyses, such as principal component analysis, have been widely used for extracting worthwhile information from ring datasets, but they typically address two-way matrices, such as ring variables × trees or ring variables × time. Here, we explore the potential of the partial triadic analysis (PTA), a multivariate method dedicated to the analysis of three-way datasets, to apprehend the space-time structure of tree-ring datasets. We analyzed a set of 11 tree-ring descriptors measured in 149 georeferenced individuals of European larch (Larix decidua Miller) during the period of 1967-2007. The processing of densitometry profiles led to a set of ring descriptors for each tree and for each year from 1967-2007. The resulting three-way data table was subjected to two distinct analyses in order to explore i) the temporal evolution of spatial structures and ii) the spatial structure of temporal dynamics. We report the presence of a spatial structure common to the different years, highlighting the inter-individual variability of the ring descriptors at the stand scale. We found a temporal trajectory common to the trees that could be separated into a high and low frequency signal, corresponding to inter-annual variations possibly related to defoliation events and a long-term trend possibly related to climate change. We conclude that PTA is a powerful tool to unravel and hierarchize the different sources of variation within tree-ring datasets.

  14. Preliminary AirMSPI Datasets

    Atmospheric Science Data Center

    2018-02-26

    ... Datasets   The data files available through this web page and ftp links are preliminary AIrMSPI datasets from recent campaigns. ... and geometric corrections. Caution should be used for science analysis. At a later date, more qualified versions will be made public. ...

  15. Factor XIII stiffens fibrin clots by causing fiber compaction.

    PubMed

    Kurniawan, N A; Grimbergen, J; Koopman, J; Koenderink, G H

    2014-10-01

    Factor XIII-induced cross-linking has long been associated with the ability of fibrin blood clots to resist mechanical deformation, but how FXIII can directly modulate clot stiffness is unknown. We hypothesized that FXIII affects the self-assembly of fibrin fibers by altering the lateral association between protofibrils. To test this hypothesis, we studied the cross-linking kinetics and the structural evolution of the fibers and clots during the formation of plasma-derived and recombinant fibrins by using light scattering, and the response of the clots to mechanical stresses by using rheology. We show that the lateral aggregation of fibrin protofibrils initially results in the formation of floppy fibril bundles, which then compact to form tight and more rigid fibers. The first stage is reflected in a fast (10 min) increase in clot stiffness, whereas the compaction phase is characterized by a slow (hours) development of clot stiffness. Inhibition of FXIII completely abrogates the slow compaction. FXIII strongly increases the linear elastic modulus of the clots, but does not affect the non-linear response at large deformations. We propose a multiscale structural model whereby FXIII-mediated cross-linking tightens the coupling between the protofibrils within a fibrin fiber, thus making the fiber stiffer and less porous. At small strains, fiber stiffening enhances clot stiffness, because the clot response is governed by the entropic elasticity of the fibers, but once the clot is sufficiently stressed, the modulus is independent of protofibril coupling, because clot stiffness is governed by individual protofibril stretching. © 2014 International Society on Thrombosis and Haemostasis.

  16. Collaboration tools and techniques for large model datasets

    USGS Publications Warehouse

    Signell, R.P.; Carniel, S.; Chiggiato, J.; Janekovic, I.; Pullen, J.; Sherwood, C.R.

    2008-01-01

    In MREA and many other marine applications, it is common to have multiple models running with different grids, run by different institutions. Techniques and tools are described for low-bandwidth delivery of data from large multidimensional datasets, such as those from meteorological and oceanographic models, directly into generic analysis and visualization tools. Output is stored using the NetCDF CF Metadata Conventions, and then delivered to collaborators over the web via OPeNDAP. OPeNDAP datasets served by different institutions are then organized via THREDDS catalogs. Tools and procedures are then used which enable scientists to explore data on the original model grids using tools they are familiar with. It is also low-bandwidth, enabling users to extract just the data they require, an important feature for access from ship or remote areas. The entire implementation is simple enough to be handled by modelers working with their webmasters - no advanced programming support is necessary. ?? 2007 Elsevier B.V. All rights reserved.

  17. Open University Learning Analytics dataset.

    PubMed

    Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek

    2017-11-28

    Learning Analytics focuses on the collection and analysis of learners' data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students' interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.

  18. Open University Learning Analytics dataset

    PubMed Central

    Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek

    2017-01-01

    Learning Analytics focuses on the collection and analysis of learners’ data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students’ interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license. PMID:29182599

  19. Large Dataset of Acute Oral Toxicity Data Created for Testing ...

    EPA Pesticide Factsheets

    Acute toxicity data is a common requirement for substance registration in the US. Currently only data derived from animal tests are accepted by regulatory agencies, and the standard in vivo tests use lethality as the endpoint. Non-animal alternatives such as in silico models are being developed due to animal welfare and resource considerations. We compiled a large dataset of oral rat LD50 values to assess the predictive performance currently available in silico models. Our dataset combines LD50 values from five different sources: literature data provided by The Dow Chemical Company, REACH data from eChemportal, HSDB (Hazardous Substances Data Bank), RTECS data from Leadscope, and the training set underpinning TEST (Toxicity Estimation Software Tool). Combined these data sources yield 33848 chemical-LD50 pairs (data points), with 23475 unique data points covering 16439 compounds. The entire dataset was loaded into a chemical properties database. All of the compounds were registered in DSSTox and 59.5% have publically available structures. Compounds without a structure in DSSTox are currently having their structures registered. The structural data will be used to evaluate the predictive performance and applicable chemical domains of three QSAR models (TIMES, PROTOX, and TEST). Future work will combine the dataset with information from ToxCast assays, and using random forest modeling, assess whether ToxCast assays are useful in predicting acute oral toxicity. Pre

  20. [A case of pancreatic and duodenal fistula after total gastrectomy successfully treated with coagulation factor XIII].

    PubMed

    Nishino, Hitoe; Kojima, Kazuhiro; Oshima, Hirokazu; Nakagawa, Koji; Fumura, Masao; Kikuchi, Norio

    2013-11-01

    Pancreatic fistula( PF) is a challenging postoperative complication. We report a case of PF following gastrectomy successfully treated using intravenous coagulation factor XIII( FXIII).A 78-year-old man with early gastric cancer underwent total gastrectomy with Roux-en-Y reconstruction. PF developed postoperatively, following which, leakage from the duodenal stump was observed. Percutaneous drainage and re-operative surgery were performed. A somatostatin analogue, antibiotic drugs, and gabexate mesilate were administrated along with nutritional support. The pancreatic and duodenal fistula had been producing duodenal juice for over 30 days since the re-operative surgery. As suspected, reduced FXIII activity was confirmed in the patient. After administering FXIII for 5 days, the amount of duodenal juice from the fistula markedly reduced, and the fistula closed immediately afterwards. The results of our study suggest that administration of FXIII could be a reasonable and effective treatment for patients with pancreatic or/and enterocutaneous fistula who are resistant to standard treatments.

  1. A New Combinatorial Optimization Approach for Integrated Feature Selection Using Different Datasets: A Prostate Cancer Transcriptomic Study

    PubMed Central

    Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo

    2015-01-01

    Background The joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics. Methods We propose a new combinatorial optimization problem that addresses the core issue of biomarker detection in integrated datasets. Optimal solutions for this model deliver a feature selection from a panel of prospective biomarkers. The model we propose is a generalised version of the (α,β)-k-Feature Set problem. We illustrate the performance of this new methodology via a challenging meta-analysis task involving six prostate cancer microarray datasets. The results are then compared to the popular RankProd meta-analysis tool and to what can be obtained by analysing the individual datasets by statistical and combinatorial methods alone. Results Application of the integrated method resulted in a more informative signature than the rank-based meta-analysis or individual dataset results, and overcomes problems arising from real world datasets. The set of genes identified is highly significant in the context of prostate cancer. The method used does not rely on homogenisation or transformation of values to a common scale, and at the same time is able to capture markers associated with subgroups of the disease. PMID:26106884

  2. Vehicle Classification Using an Imbalanced Dataset Based on a Single Magnetic Sensor.

    PubMed

    Xu, Chang; Wang, Yingguan; Bao, Xinghe; Li, Fengrong

    2018-05-24

    This paper aims to improve the accuracy of automatic vehicle classifiers for imbalanced datasets. Classification is made through utilizing a single anisotropic magnetoresistive sensor, with the models of vehicles involved being classified into hatchbacks, sedans, buses, and multi-purpose vehicles (MPVs). Using time domain and frequency domain features in combination with three common classification algorithms in pattern recognition, we develop a novel feature extraction method for vehicle classification. These three common classification algorithms are the k-nearest neighbor, the support vector machine, and the back-propagation neural network. Nevertheless, a problem remains with the original vehicle magnetic dataset collected being imbalanced, and may lead to inaccurate classification results. With this in mind, we propose an approach called SMOTE, which can further boost the performance of classifiers. Experimental results show that the k-nearest neighbor (KNN) classifier with the SMOTE algorithm can reach a classification accuracy of 95.46%, thus minimizing the effect of the imbalance.

  3. Exudate-based diabetic macular edema detection in fundus images using publicly available datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

    2011-01-01

    Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME through the presence of exudation. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME.more » This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing (e.g., the classifier was trained on an independent dataset and tested on MESSIDOR). Our algorithm obtained an AUC between 0.88 and 0.94 depending on the dataset/features used. Additionally, it does not need ground truth at lesion level to reject false positives and is computationally efficient, as it generates a diagnosis on an average of 4.4 s (9.3 s, considering the optic nerve localization) per image on an 2.6 GHz platform with an unoptimized Matlab implementation.« less

  4. Comparing the accuracy of food outlet datasets in an urban environment.

    PubMed

    Wong, Michelle S; Peyton, Jennifer M; Shields, Timothy M; Curriero, Frank C; Gudzune, Kimberly A

    2017-05-11

    Studies that investigate the relationship between the retail food environment and health outcomes often use geospatial datasets. Prior studies have identified challenges of using the most common data sources. Retail food environment datasets created through academic-government partnership present an alternative, but their validity (retail existence, type, location) has not been assessed yet. In our study, we used ground-truth data to compare the validity of two datasets, a 2015 commercial dataset (InfoUSA) and data collected from 2012 to 2014 through the Maryland Food Systems Mapping Project (MFSMP), an academic-government partnership, on the retail food environment in two low-income, inner city neighbourhoods in Baltimore City. We compared sensitivity and positive predictive value (PPV) of the commercial and academic-government partnership data to ground-truth data for two broad categories of unhealthy food retailers: small food retailers and quick-service restaurants. Ground-truth data was collected in 2015 and analysed in 2016. Compared to the ground-truth data, MFSMP and InfoUSA generally had similar sensitivity that was greater than 85%. MFSMP had higher PPV compared to InfoUSA for both small food retailers (MFSMP: 56.3% vs InfoUSA: 40.7%) and quick-service restaurants (MFSMP: 58.6% vs InfoUSA: 36.4%). We conclude that data from academic-government partnerships like MFSMP might be an attractive alternative option and improvement to relying only on commercial data. Other research institutes or cities might consider efforts to create and maintain such an environmental dataset. Even if these datasets cannot be updated on an annual basis, they are likely more accurate than commercial data.

  5. The role of metadata in managing large environmental science datasets. Proceedings

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Melton, R.B.; DeVaney, D.M.; French, J. C.

    1995-06-01

    The purpose of this workshop was to bring together computer science researchers and environmental sciences data management practitioners to consider the role of metadata in managing large environmental sciences datasets. The objectives included: establishing a common definition of metadata; identifying categories of metadata; defining problems in managing metadata; and defining problems related to linking metadata with primary data.

  6. Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

    PubMed

    Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

    2009-07-01

    Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.

  7. Background qualitative analysis of the European Reference Life Cycle Database (ELCD) energy datasets - part I: fuel datasets.

    PubMed

    Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

    2015-01-01

    The aim of this study is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) fuel datasets. The revision is based on the data quality indicators described by the ILCD Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD fuel datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the fuel-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD fuel datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall DQR of databases.

  8. A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts.

    PubMed

    Nilsson, R Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M; Bengtsson-Palme, Johan; Walker, Donald M; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C; Abarenkov, Kessy

    2015-01-01

    The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric-artificially joined-DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation.

  9. A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts

    PubMed Central

    Nilsson, R. Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M.; Bengtsson-Palme, Johan; Walker, Donald M.; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C.; Abarenkov, Kessy

    2015-01-01

    The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric—artificially joined—DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation. PMID:25786896

  10. JingleBells: A Repository of Immune-Related Single-Cell RNA-Sequencing Datasets.

    PubMed

    Ner-Gaon, Hadas; Melchior, Ariel; Golan, Nili; Ben-Haim, Yael; Shay, Tal

    2017-05-01

    Recent advances in single-cell RNA-sequencing (scRNA-seq) technology increase the understanding of immune differentiation and activation processes, as well as the heterogeneity of immune cell types. Although the number of available immune-related scRNA-seq datasets increases rapidly, their large size and various formats render them hard for the wider immunology community to use, and read-level data are practically inaccessible to the non-computational immunologist. To facilitate datasets reuse, we created the JingleBells repository for immune-related scRNA-seq datasets ready for analysis and visualization of reads at the single-cell level (http://jinglebells.bgu.ac.il/). To this end, we collected the raw data of publicly available immune-related scRNA-seq datasets, aligned the reads to the relevant genome, and saved aligned reads in a uniform format, annotated for cell of origin. We also added scripts and a step-by-step tutorial for visualizing each dataset at the single-cell level, through the commonly used Integrated Genome Viewer (www.broadinstitute.org/igv/). The uniform scRNA-seq format used in JingleBells can facilitate reuse of scRNA-seq data by computational biologists. It also enables immunologists who are interested in a specific gene to visualize the reads aligned to this gene to estimate cell-specific preferences for splicing, mutation load, or alleles. Thus JingleBells is a resource that will extend the usefulness of scRNA-seq datasets outside the programming aficionado realm. Copyright © 2017 by The American Association of Immunologists, Inc.

  11. X-ray computed tomography datasets for forensic analysis of vertebrate fossils.

    PubMed

    Rowe, Timothy B; Luo, Zhe-Xi; Ketcham, Richard A; Maisano, Jessica A; Colbert, Matthew W

    2016-06-07

    We describe X-ray computed tomography (CT) datasets from three specimens recovered from Early Cretaceous lakebeds of China that illustrate the forensic interpretation of CT imagery for paleontology. Fossil vertebrates from thinly bedded sediments often shatter upon discovery and are commonly repaired as amalgamated mosaics grouted to a solid backing slab of rock or plaster. Such methods are prone to inadvertent error and willful forgery, and once required potentially destructive methods to identify mistakes in reconstruction. CT is an efficient, nondestructive alternative that can disclose many clues about how a specimen was handled and repaired. These annotated datasets illustrate the power of CT in documenting specimen integrity and are intended as a reference in applying CT more broadly to evaluating the authenticity of comparable fossils.

  12. X-ray computed tomography datasets for forensic analysis of vertebrate fossils

    PubMed Central

    Rowe, Timothy B.; Luo, Zhe-Xi; Ketcham, Richard A.; Maisano, Jessica A.; Colbert, Matthew W.

    2016-01-01

    We describe X-ray computed tomography (CT) datasets from three specimens recovered from Early Cretaceous lakebeds of China that illustrate the forensic interpretation of CT imagery for paleontology. Fossil vertebrates from thinly bedded sediments often shatter upon discovery and are commonly repaired as amalgamated mosaics grouted to a solid backing slab of rock or plaster. Such methods are prone to inadvertent error and willful forgery, and once required potentially destructive methods to identify mistakes in reconstruction. CT is an efficient, nondestructive alternative that can disclose many clues about how a specimen was handled and repaired. These annotated datasets illustrate the power of CT in documenting specimen integrity and are intended as a reference in applying CT more broadly to evaluating the authenticity of comparable fossils. PMID:27272251

  13. BLOND, a building-level office environment dataset of typical electrical appliances.

    PubMed

    Kriechbaumer, Thomas; Jacobsen, Hans-Arno

    2018-03-27

    Energy metering has gained popularity as conventional meters are replaced by electronic smart meters that promise energy savings and higher comfort levels for occupants. Achieving these goals requires a deeper understanding of consumption patterns to reduce the energy footprint: load profile forecasting, power disaggregation, appliance identification, startup event detection, etc. Publicly available datasets are used to test, verify, and benchmark possible solutions to these problems. For this purpose, we present the BLOND dataset: continuous energy measurements of a typical office environment at high sampling rates with common appliances and load profiles. We provide voltage and current readings for aggregated circuits and matching fully-labeled ground truth data (individual appliance measurements). The dataset contains 53 appliances (16 classes) in a 3-phase power grid. BLOND-50 contains 213 days of measurements sampled at 50kSps (aggregate) and 6.4kSps (individual appliances). BLOND-250 consists of the same setup: 50 days, 250kSps (aggregate), 50kSps (individual appliances). These are the longest continuous measurements at such high sampling rates and fully-labeled ground truth we are aware of.

  14. BLOND, a building-level office environment dataset of typical electrical appliances

    NASA Astrophysics Data System (ADS)

    Kriechbaumer, Thomas; Jacobsen, Hans-Arno

    2018-03-01

    Energy metering has gained popularity as conventional meters are replaced by electronic smart meters that promise energy savings and higher comfort levels for occupants. Achieving these goals requires a deeper understanding of consumption patterns to reduce the energy footprint: load profile forecasting, power disaggregation, appliance identification, startup event detection, etc. Publicly available datasets are used to test, verify, and benchmark possible solutions to these problems. For this purpose, we present the BLOND dataset: continuous energy measurements of a typical office environment at high sampling rates with common appliances and load profiles. We provide voltage and current readings for aggregated circuits and matching fully-labeled ground truth data (individual appliance measurements). The dataset contains 53 appliances (16 classes) in a 3-phase power grid. BLOND-50 contains 213 days of measurements sampled at 50kSps (aggregate) and 6.4kSps (individual appliances). BLOND-250 consists of the same setup: 50 days, 250kSps (aggregate), 50kSps (individual appliances). These are the longest continuous measurements at such high sampling rates and fully-labeled ground truth we are aware of.

  15. BLOND, a building-level office environment dataset of typical electrical appliances

    PubMed Central

    Kriechbaumer, Thomas; Jacobsen, Hans-Arno

    2018-01-01

    Energy metering has gained popularity as conventional meters are replaced by electronic smart meters that promise energy savings and higher comfort levels for occupants. Achieving these goals requires a deeper understanding of consumption patterns to reduce the energy footprint: load profile forecasting, power disaggregation, appliance identification, startup event detection, etc. Publicly available datasets are used to test, verify, and benchmark possible solutions to these problems. For this purpose, we present the BLOND dataset: continuous energy measurements of a typical office environment at high sampling rates with common appliances and load profiles. We provide voltage and current readings for aggregated circuits and matching fully-labeled ground truth data (individual appliance measurements). The dataset contains 53 appliances (16 classes) in a 3-phase power grid. BLOND-50 contains 213 days of measurements sampled at 50kSps (aggregate) and 6.4kSps (individual appliances). BLOND-250 consists of the same setup: 50 days, 250kSps (aggregate), 50kSps (individual appliances). These are the longest continuous measurements at such high sampling rates and fully-labeled ground truth we are aware of. PMID:29583141

  16. Design of an audio advertisement dataset

    NASA Astrophysics Data System (ADS)

    Fu, Yutao; Liu, Jihong; Zhang, Qi; Geng, Yuting

    2015-12-01

    Since more and more advertisements swarm into radios, it is necessary to establish an audio advertising dataset which could be used to analyze and classify the advertisement. A method of how to establish a complete audio advertising dataset is presented in this paper. The dataset is divided into four different kinds of advertisements. Each advertisement's sample is given in *.wav file format, and annotated with a txt file which contains its file name, sampling frequency, channel number, broadcasting time and its class. The classifying rationality of the advertisements in this dataset is proved by clustering the different advertisements based on Principal Component Analysis (PCA). The experimental results show that this audio advertisement dataset offers a reliable set of samples for correlative audio advertisement experimental studies.

  17. Generation of open biomedical datasets through ontology-driven transformation and integration processes.

    PubMed

    Carmen Legaz-García, María Del; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2016-06-03

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources, which makes difficult the integrated exploitation of such data. The Semantic Web paradigm offers a natural technological space for data integration and exploitation by generating content readable by machines. Linked Open Data is a Semantic Web initiative that promotes the publication and sharing of data in machine readable semantic formats. We present an approach for the transformation and integration of heterogeneous biomedical data with the objective of generating open biomedical datasets in Semantic Web formats. The transformation of the data is based on the mappings between the entities of the data schema and the ontological infrastructure that provides the meaning to the content. Our approach permits different types of mappings and includes the possibility of defining complex transformation patterns. Once the mappings are defined, they can be automatically applied to datasets to generate logically consistent content and the mappings can be reused in further transformation processes. The results of our research are (1) a common transformation and integration process for heterogeneous biomedical data; (2) the application of Linked Open Data principles to generate interoperable, open, biomedical datasets; (3) a software tool, called SWIT, that implements the approach. In this paper we also describe how we have applied SWIT in different biomedical scenarios and some lessons learned. We have presented an approach that is able to generate open biomedical repositories in Semantic Web formats. SWIT is able to apply the Linked Open Data principles in the generation of the datasets, so allowing for linking their content to external repositories and creating linked open datasets. SWIT datasets may contain data from multiple sources and schemas, thus becoming integrated datasets.

  18. Background qualitative analysis of the European reference life cycle database (ELCD) energy datasets - part II: electricity datasets.

    PubMed

    Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

    2015-01-01

    The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.

  19. Data Recommender: An Alternative Way to Discover Open Scientific Datasets

    NASA Astrophysics Data System (ADS)

    Klump, J. F.; Devaraju, A.; Williams, G.; Hogan, D.; Davy, R.; Page, J.; Singh, D.; Peterson, N.

    2017-12-01

    Over the past few years, institutions and government agencies have adopted policies to openly release their data, which has resulted in huge amounts of open data becoming available on the web. When trying to discover the data, users face two challenges: an overload of choice and the limitations of the existing data search tools. On the one hand, there are too many datasets to choose from, and therefore, users need to spend considerable effort to find the datasets most relevant to their research. On the other hand, data portals commonly offer keyword and faceted search, which depend fully on the user queries to search and rank relevant datasets. Consequently, keyword and faceted search may return loosely related or irrelevant results, although the results may contain the same query. They may also return highly specific results that depend more on how well metadata was authored. They do not account well for variance in metadata due to variance in author styles and preferences. The top-ranked results may also come from the same data collection, and users are unlikely to discover new and interesting datasets. These search modes mainly suits users who can express their information needs in terms of the structure and terminology of the data portals, but may pose a challenge otherwise. The above challenges reflect that we need a solution that delivers the most relevant (i.e., similar and serendipitous) datasets to users, beyond the existing search functionalities on the portals. A recommender system is an information filtering system that presents users with relevant and interesting contents based on users' context and preferences. Delivering data recommendations to users can make data discovery easier, and as a result may enhance user engagement with the portal. We developed a hybrid data recommendation approach for the CSIRO Data Access Portal. The approach leverages existing recommendation techniques (e.g., content-based filtering and item co-occurrence) to produce

  20. Diagnosis, clinical manifestations and management of rare bleeding disorders in Iran.

    PubMed

    Dorgalaleh, Akbar; Alavi, Sayed Ezatolla Rafiee; Tabibian, Shadi; Soori, Shahrzad; Moradi, Es'hagh; Bamedi, Taregh; Asadi, Mansour; Jalalvand, Masumeh; Shamsizadeh, Morteza

    2017-05-01

    Rare bleeding disorders (RBDs) are heterogeneous disorders, mostly inherited in an autosomal recessive pattern. Iran is a Mideast country with a high rate of consanguinity that has a high rate of RBDs. In this study, we present prevalence and clinical presentation as well as management and genetic defects of Iranian patients with RBDs. For this study, all relevant publications were searched in Medlin until 2015. Iran has the highest global incidence of factor XIII deficiency. Factor VII deficiency also is common in Iran, while factor II deficiency, with a prevalence of 1 per ∼3 million, is the rarest form of RBDs. Factor activity is available for all RBDs except for factor XIII deficiency, in which clot solubility remains as a diagnostic test. Molecular analysis of Iranian patients with RBDs revealed a few recurrent, common mutations only in patients with factor XIII deficiency, and considerable novel mutations in other RBDs. Clinical manifestations of these patients are variable and patients with factor XIII, factor X and factor VII more commonly presented severe life-threatening bleeding, while patients with combined factor V and factor VIII presented a milder phenotype. Plasma-derived products are the most common therapeutic choice in Iran, used prophylactically or on-demand for the management of these patients. Since Iran has a high rate of RBDs with life-threatening bleeding, molecular studies can be used for carrier detection and, therefore, prevention of the further expansion of these disorders and their fatal consequence.

  1. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets

    PubMed Central

    Heath, Allison P; Greenway, Matthew; Powell, Raymond; Spring, Jonathan; Suarez, Rafael; Hanley, David; Bandlamudi, Chai; McNerney, Megan E; White, Kevin P; Grossman, Robert L

    2014-01-01

    Background As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it. Methods Bionimbus is an open source cloud-computing platform that is based primarily upon OpenStack, which manages on-demand virtual machines that provide the required computational resources, and GlusterFS, which is a high-performance clustered file system. Bionimbus also includes Tukey, which is a portal, and associated middleware that provides a single entry point and a single sign on for the various Bionimbus resources; and Yates, which automates the installation, configuration, and maintenance of the software infrastructure required. Results Bionimbus is used by a variety of projects to process genomics and phenotypic data. For example, it is used by an acute myeloid leukemia resequencing project at the University of Chicago. The project requires several computational pipelines, including pipelines for quality control, alignment, variant calling, and annotation. For each sample, the alignment step requires eight CPUs for about 12 h. BAM file sizes ranged from 5 GB to 10 GB for each sample. Conclusions Most members of the research community have difficulty downloading large genomics datasets and obtaining sufficient storage and computer resources to manage and analyze the data. Cloud computing platforms, such as Bionimbus, with data commons that contain large genomics datasets, are one choice for broadening access to research data in genomics. PMID:24464852

  2. Benchmark Dataset for Whole Genome Sequence Compression.

    PubMed

    C L, Biji; S Nair, Achuthsankar

    2017-01-01

    The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.

  3. Subsampling for dataset optimisation

    NASA Astrophysics Data System (ADS)

    Ließ, Mareike

    2017-04-01

    Soil-landscapes have formed by the interaction of soil-forming factors and pedogenic processes. In modelling these landscapes in their pedodiversity and the underlying processes, a representative unbiased dataset is required. This concerns model input as well as output data. However, very often big datasets are available which are highly heterogeneous and were gathered for various purposes, but not to model a particular process or data space. As a first step, the overall data space and/or landscape section to be modelled needs to be identified including considerations regarding scale and resolution. Then the available dataset needs to be optimised via subsampling to well represent this n-dimensional data space. A couple of well-known sampling designs may be adapted to suit this purpose. The overall approach follows three main strategies: (1) the data space may be condensed and de-correlated by a factor analysis to facilitate the subsampling process. (2) Different methods of pattern recognition serve to structure the n-dimensional data space to be modelled into units which then form the basis for the optimisation of an existing dataset through a sensible selection of samples. Along the way, data units for which there is currently insufficient soil data available may be identified. And (3) random samples from the n-dimensional data space may be replaced by similar samples from the available dataset. While being a presupposition to develop data-driven statistical models, this approach may also help to develop universal process models and identify limitations in existing models.

  4. Modified Bat Algorithm for Feature Selection with the Wisconsin Diagnosis Breast Cancer (WDBC) Dataset

    PubMed

    Jeyasingh, Suganthi; Veluchamy, Malathi

    2017-05-01

    Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE). Creative Commons Attribution License

  5. [Parallel virtual reality visualization of extreme large medical datasets].

    PubMed

    Tang, Min

    2010-04-01

    On the basis of a brief description of grid computing, the essence and critical techniques of parallel visualization of extreme large medical datasets are discussed in connection with Intranet and common-configuration computers of hospitals. In this paper are introduced several kernel techniques, including the hardware structure, software framework, load balance and virtual reality visualization. The Maximum Intensity Projection algorithm is realized in parallel using common PC cluster. In virtual reality world, three-dimensional models can be rotated, zoomed, translated and cut interactively and conveniently through the control panel built on virtual reality modeling language (VRML). Experimental results demonstrate that this method provides promising and real-time results for playing the role in of a good assistant in making clinical diagnosis.

  6. A global distributed basin morphometric dataset

    NASA Astrophysics Data System (ADS)

    Shen, Xinyi; Anagnostou, Emmanouil N.; Mei, Yiwen; Hong, Yang

    2017-01-01

    Basin morphometry is vital information for relating storms to hydrologic hazards, such as landslides and floods. In this paper we present the first comprehensive global dataset of distributed basin morphometry at 30 arc seconds resolution. The dataset includes nine prime morphometric variables; in addition we present formulas for generating twenty-one additional morphometric variables based on combination of the prime variables. The dataset can aid different applications including studies of land-atmosphere interaction, and modelling of floods and droughts for sustainable water management. The validity of the dataset has been consolidated by successfully repeating the Hack's law.

  7. Estimation of Missed Statin Prescription Use in an Administrative Claims Dataset.

    PubMed

    Wade, Rolin L; Patel, Jeetvan G; Hill, Jerrold W; De, Ajita P; Harrison, David J

    2017-09-01

    Nonadherence to statin medications is associated with increased risk of cardiovascular disease and poses a challenge to lipid management in patients who are at risk for atherosclerotic cardiovascular disease. Numerous studies have examined statin adherence based on administrative claims data; however, these data may underestimate statin use in patients who participate in generic drug discount programs or who have alternative coverage. To estimate the proportion of patients with missing statin claims in a claims database and determine how missing claims affect commonly used utilization metrics. This retrospective cohort study used pharmacy data from the PharMetrics Plus (P+) claims dataset linked to the IMS longitudinal pharmacy point-of-sale prescription database (LRx) from January 1, 2012, through December 31, 2014. Eligible patients were represented in the P+ and LRx datasets, had ≥1 claim for a statin (index claim) in either database, and had ≥ 24 months of continuous enrollment in P+. Patients were linked between P+ and LRx using a deterministic method. Duplicate claims between LRx and P+ were removed to produce a new dataset comprised of P+ claims augmented with LRx claims. Statin use was then compared between P+ and the augmented P+ dataset. Utilization metrics that were evaluated included percentage of patients with ≥ 1 missing statin claim over 12 months in P+; the number of patients misclassified as new users in P+; the number of patients misclassified as nonstatin users in P+; the change in 12-month medication possession ratio (MPR) and proportion of days covered (PDC) in P+; the comparison between P+ and LRx of classifications of statin treatment patterns (statin intensity and patients with treatment modifications); and the payment status for missing statin claims. Data from 965,785 patients with statin claims in P+ were analyzed (mean age 56.6 years; 57% male). In P+, 20.1% had ≥ 1 missing statin claim post-index; 13.7% were misclassified as

  8. In-situ AFM measurement of single fibrin fiber stiffness before and after addition of Factor XIII

    NASA Astrophysics Data System (ADS)

    Houser, John; O'Brien, E. Timothy; Lord, Susan T.; Superfine, Richard; Falvo, Michael R.

    2008-10-01

    Fibrin fibers are the main structural component of blood clots. Ligation of fibrin by native Factor XIII (FXIII) serves to fine tune the mechanical properties of the clot. Mechanical alteration is important because a clot must be stiff enough to resist forces from blood flow but compliant enough to prevent embolism (fracture). Cone and Plate measurements of fibrin gels, which represent the vast majority of mechanical measurements on fibrin, show that FXIII increases clot stiffness. More recently, measurements on individual fibrin fibers show that they exhibit remarkable extensibility, breaking at strains up to 300%. As of yet, the origin of this extensibility is not fully understood. The different responses of ligated and unligated fibrin fibers can give us clues as to it's mechanism of extension. We use a combined fluorescence/atomic force microscope to stretch individual, isolated, fibrin fibers and then compare force extension curves of the same fiber before and after addition of FXIII. We found up to a 3.5-fold increase in fiber stiffness after addition of FXIII. We also show stiffening of individual fibrin fibers after crosslinking by gluteraldehyde.

  9. Bayesian correlated clustering to integrate multiple datasets

    PubMed Central

    Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.

    2012-01-01

    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID

  10. Analysis of plant-derived miRNAs in animal small RNA datasets

    PubMed Central

    2012-01-01

    Background Plants contain significant quantities of small RNAs (sRNAs) derived from various sRNA biogenesis pathways. Many of these sRNAs play regulatory roles in plants. Previous analysis revealed that numerous sRNAs in corn, rice and soybean seeds have high sequence similarity to animal genes. However, exogenous RNA is considered to be unstable within the gastrointestinal tract of many animals, thus limiting potential for any adverse effects from consumption of dietary RNA. A recent paper reported that putative plant miRNAs were detected in animal plasma and serum, presumably acquired through ingestion, and may have a functional impact in the consuming organisms. Results To address the question of how common this phenomenon could be, we searched for plant miRNAs sequences in public sRNA datasets from various tissues of mammals, chicken and insects. Our analyses revealed that plant miRNAs were present in the animal sRNA datasets, and significantly miR168 was extremely over-represented. Furthermore, all or nearly all (>96%) miR168 sequences were monocot derived for most datasets, including datasets for two insects reared on dicot plants in their respective experiments. To investigate if plant-derived miRNAs, including miR168, could accumulate and move systemically in insects, we conducted insect feeding studies for three insects including corn rootworm, which has been shown to be responsive to plant-produced long double-stranded RNAs. Conclusions Our analyses suggest that the observed plant miRNAs in animal sRNA datasets can originate in the process of sequencing, and that accumulation of plant miRNAs via dietary exposure is not universal in animals. PMID:22873950

  11. Chemical elements in the environment: multi-element geochemical datasets from continental to national scale surveys on four continents

    USGS Publications Warehouse

    Caritat, Patrice de; Reimann, Clemens; Smith, David; Wang, Xueqiu

    2017-01-01

    During the last 10-20 years, Geological Surveys around the world have undertaken a major effort towards delivering fully harmonized and tightly quality-controlled low-density multi-element soil geochemical maps and datasets of vast regions including up to whole continents. Concentrations of between 45 and 60 elements commonly have been determined in a variety of different regolith types (e.g., sediment, soil). The multi-element datasets are published as complete geochemical atlases and made available to the general public. Several other geochemical datasets covering smaller areas but generally at a higher spatial density are also available. These datasets may, however, not be found by superficial internet-based searches because the elements are not mentioned individually either in the title or in the keyword lists of the original references. This publication attempts to increase the visibility and discoverability of these fundamental background datasets covering large areas up to whole continents.

  12. The Non-catalytic B Subunit of Coagulation Factor XIII Accelerates Fibrin Cross-linking*

    PubMed Central

    Souri, Masayoshi; Osaki, Tsukasa; Ichinose, Akitada

    2015-01-01

    Covalent cross-linking of fibrin chains is required for stable blood clot formation, which is catalyzed by coagulation factor XIII (FXIII), a proenzyme of plasma transglutaminase consisting of catalytic A (FXIII-A) and non-catalytic B subunits (FXIII-B). Herein, we demonstrate that FXIII-B accelerates fibrin cross-linking. Depletion of FXIII-B from normal plasma supplemented with a physiological level of recombinant FXIII-A resulted in delayed fibrin cross-linking, reduced incorporation of FXIII-A into fibrin clots, and impaired activation peptide cleavage by thrombin; the addition of recombinant FXIII-B restored normal fibrin cross-linking, FXIII-A incorporation into fibrin clots, and activation peptide cleavage by thrombin. Immunoprecipitation with an anti-fibrinogen antibody revealed an interaction between the FXIII heterotetramer and fibrinogen mediated by FXIII-B and not FXIII-A. FXIII-B probably binds the γ-chain of fibrinogen with its D-domain, which is near the fibrin polymerization pockets, and dissociates from fibrin during or after cross-linking between γ-chains. Thus, FXIII-B plays important roles in the formation of a ternary complex between proenzyme FXIII, prosubstrate fibrinogen, and activator thrombin. Accordingly, congenital or acquired FXIII-B deficiency may result in increased bleeding tendency through impaired fibrin stabilization due to decreased FXIII-A activation by thrombin and secondary FXIII-A deficiency arising from enhanced circulatory clearance. PMID:25809477

  13. Handwritten mathematical symbols dataset.

    PubMed

    Chajri, Yassine; Bouikhalene, Belaid

    2016-06-01

    Due to the technological advances in recent years, paper scientific documents are used less and less. Thus, the trend in the scientific community to use digital documents has increased considerably. Among these documents, there are scientific documents and more specifically mathematics documents. In this context, we present our own dataset of handwritten mathematical symbols composed of 10,379 images. This dataset gathers Arabic characters, Latin characters, Arabic numerals, Latin numerals, arithmetic operators, set-symbols, comparison symbols, delimiters, etc.

  14. 77 FR 15052 - Dataset Workshop-U.S. Billion Dollar Disasters Dataset (1980-2011): Assessing Dataset Strengths...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-03-14

    ... and related methodology. Emphasis will be placed on dataset accuracy and time-dependent biases. Pathways to overcome accuracy and bias issues will be an important focus. Participants will consider...] Guidance for improving these methods. [cir] Recommendations for rectifying any known time-dependent biases...

  15. Processing and population genetic analysis of multigenic datasets with ProSeq3 software.

    PubMed

    Filatov, Dmitry A

    2009-12-01

    The current tendency in molecular population genetics is to use increasing numbers of genes in the analysis. Here I describe a program for handling and population genetic analysis of DNA polymorphism data collected from multiple genes. The program includes a sequence/alignment editor and an internal relational database that simplify the preparation and manipulation of multigenic DNA polymorphism datasets. The most commonly used DNA polymorphism analyses are implemented in ProSeq3, facilitating population genetic analysis of large multigenic datasets. Extensive input/output options make ProSeq3 a convenient hub for sequence data processing and analysis. The program is available free of charge from http://dps.plants.ox.ac.uk/sequencing/proseq.htm.

  16. NP-PAH Interaction Dataset

    EPA Pesticide Factsheets

    Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration were allowed to equilibrate with known mass of nanoparticles. The mixture was then ultracentrifuged and sampled for analysis. This dataset is associated with the following publication:Sahle-Demessie, E., A. Zhao, C. Han, B. Hann, and H. Grecsek. Interaction of engineered nanomaterials with hydrophobic organic pollutants.. Journal of Nanotechnology. Hindawi Publishing Corporation, New York, NY, USA, 27(28): 284003, (2016).

  17. Handwritten mathematical symbols dataset

    PubMed Central

    Chajri, Yassine; Bouikhalene, Belaid

    2016-01-01

    Due to the technological advances in recent years, paper scientific documents are used less and less. Thus, the trend in the scientific community to use digital documents has increased considerably. Among these documents, there are scientific documents and more specifically mathematics documents. In this context, we present our own dataset of handwritten mathematical symbols composed of 10,379 images. This dataset gathers Arabic characters, Latin characters, Arabic numerals, Latin numerals, arithmetic operators, set-symbols, comparison symbols, delimiters, etc. PMID:27006975

  18. Quantitative super-resolution single molecule microscopy dataset of YFP-tagged growth factor receptors.

    PubMed

    Lukeš, Tomáš; Pospíšil, Jakub; Fliegel, Karel; Lasser, Theo; Hagen, Guy M

    2018-03-01

    Super-resolution single molecule localization microscopy (SMLM) is a method for achieving resolution beyond the classical limit in optical microscopes (approx. 200 nm laterally). Yellow fluorescent protein (YFP) has been used for super-resolution single molecule localization microscopy, but less frequently than other fluorescent probes. Working with YFP in SMLM is a challenge because a lower number of photons are emitted per molecule compared with organic dyes, which are more commonly used. Publically available experimental data can facilitate development of new data analysis algorithms. Four complete, freely available single molecule super-resolution microscopy datasets on YFP-tagged growth factor receptors expressed in a human cell line are presented, including both raw and analyzed data. We report methods for sample preparation, for data acquisition, and for data analysis, as well as examples of the acquired images. We also analyzed the SMLM datasets using a different method: super-resolution optical fluctuation imaging (SOFI). The 2 modes of analysis offer complementary information about the sample. A fifth single molecule super-resolution microscopy dataset acquired with the dye Alexa 532 is included for comparison purposes. This dataset has potential for extensive reuse. Complete raw data from SMLM experiments have typically not been published. The YFP data exhibit low signal-to-noise ratios, making data analysis a challenge. These datasets will be useful to investigators developing their own algorithms for SMLM, SOFI, and related methods. The data will also be useful for researchers investigating growth factor receptors such as ErbB3.

  19. An Improved TA-SVM Method Without Matrix Inversion and Its Fast Implementation for Nonstationary Datasets.

    PubMed

    Shi, Yingzhong; Chung, Fu-Lai; Wang, Shitong

    2015-09-01

    Recently, a time-adaptive support vector machine (TA-SVM) is proposed for handling nonstationary datasets. While attractive performance has been reported and the new classifier is distinctive in simultaneously solving several SVM subclassifiers locally and globally by using an elegant SVM formulation in an alternative kernel space, the coupling of subclassifiers brings in the computation of matrix inversion, thus resulting to suffer from high computational burden in large nonstationary dataset applications. To overcome this shortcoming, an improved TA-SVM (ITA-SVM) is proposed using a common vector shared by all the SVM subclassifiers involved. ITA-SVM not only keeps an SVM formulation, but also avoids the computation of matrix inversion. Thus, we can realize its fast version, that is, improved time-adaptive core vector machine (ITA-CVM) for large nonstationary datasets by using the CVM technique. ITA-CVM has the merit of asymptotic linear time complexity for large nonstationary datasets as well as inherits the advantage of TA-SVM. The effectiveness of the proposed classifiers ITA-SVM and ITA-CVM is also experimentally confirmed.

  20. TRI Preliminary Dataset

    EPA Pesticide Factsheets

    The TRI preliminary dataset includes the most current TRI data available and reflects toxic chemical releases and pollution prevention activities that occurred at TRI facilities during the each calendar year.

  1. Immersive Interaction, Manipulation and Analysis of Large 3D Datasets for Planetary and Earth Sciences

    NASA Astrophysics Data System (ADS)

    Pariser, O.; Calef, F.; Manning, E. M.; Ardulov, V.

    2017-12-01

    We will present implementation and study of several use-cases of utilizing Virtual Reality (VR) for immersive display, interaction and analysis of large and complex 3D datasets. These datasets have been acquired by the instruments across several Earth, Planetary and Solar Space Robotics Missions. First, we will describe the architecture of the common application framework that was developed to input data, interface with VR display devices and program input controllers in various computing environments. Tethered and portable VR technologies will be contrasted and advantages of each highlighted. We'll proceed to presenting experimental immersive analytics visual constructs that enable augmentation of 3D datasets with 2D ones such as images and statistical and abstract data. We will conclude by presenting comparative analysis with traditional visualization applications and share the feedback provided by our users: scientists and engineers.

  2. Comparison of recent SnIa datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanchez, J.C. Bueno; Perivolaropoulos, L.; Nesseris, S., E-mail: jbueno@cc.uoi.gr, E-mail: nesseris@nbi.ku.dk, E-mail: leandros@uoi.gr

    2009-11-01

    We rank the six latest Type Ia supernova (SnIa) datasets (Constitution (C), Union (U), ESSENCE (Davis) (E), Gold06 (G), SNLS 1yr (S) and SDSS-II (D)) in the context of the Chevalier-Polarski-Linder (CPL) parametrization w(a) = w{sub 0}+w{sub 1}(1−a), according to their Figure of Merit (FoM), their consistency with the cosmological constant (ΛCDM), their consistency with standard rulers (Cosmic Microwave Background (CMB) and Baryon Acoustic Oscillations (BAO)) and their mutual consistency. We find a significant improvement of the FoM (defined as the inverse area of the 95.4% parameter contour) with the number of SnIa of these datasets ((C) highest FoM, (U),more » (G), (D), (E), (S) lowest FoM). Standard rulers (CMB+BAO) have a better FoM by about a factor of 3, compared to the highest FoM SnIa dataset (C). We also find that the ranking sequence based on consistency with ΛCDM is identical with the corresponding ranking based on consistency with standard rulers ((S) most consistent, (D), (C), (E), (U), (G) least consistent). The ranking sequence of the datasets however changes when we consider the consistency with an expansion history corresponding to evolving dark energy (w{sub 0},w{sub 1}) = (−1.4,2) crossing the phantom divide line w = −1 (it is practically reversed to (G), (U), (E), (S), (D), (C)). The SALT2 and MLCS2k2 fitters are also compared and some peculiar features of the SDSS-II dataset when standardized with the MLCS2k2 fitter are pointed out. Finally, we construct a statistic to estimate the internal consistency of a collection of SnIa datasets. We find that even though there is good consistency among most samples taken from the above datasets, this consistency decreases significantly when the Gold06 (G) dataset is included in the sample.« less

  3. Common Variable Immunodeficiency Non-Infectious Disease Endotypes Redefined Using Unbiased Network Clustering in Large Electronic Datasets.

    PubMed

    Farmer, Jocelyn R; Ong, Mei-Sing; Barmettler, Sara; Yonker, Lael M; Fuleihan, Ramsay; Sullivan, Kathleen E; Cunningham-Rundles, Charlotte; Walter, Jolan E

    2017-01-01

    Common variable immunodeficiency (CVID) is increasingly recognized for its association with autoimmune and inflammatory complications. Despite recent advances in immunophenotypic and genetic discovery, clinical care of CVID remains limited by our inability to accurately model risk for non-infectious disease development. Herein, we demonstrate the utility of unbiased network clustering as a novel method to analyze inter-relationships between non-infectious disease outcomes in CVID using databases at the United States Immunodeficiency Network (USIDNET), the centralized immunodeficiency registry of the United States, and Partners, a tertiary care network in Boston, MA, USA, with a shared electronic medical record amenable to natural language processing. Immunophenotypes were comparable in terms of native antibody deficiencies, low titer response to pneumococcus, and B cell maturation arrest. However, recorded non-infectious disease outcomes were more substantial in the Partners cohort across the spectrum of lymphoproliferation, cytopenias, autoimmunity, atopy, and malignancy. Using unbiased network clustering to analyze 34 non-infectious disease outcomes in the Partners cohort, we further identified unique patterns of lymphoproliferative (two clusters), autoimmune (two clusters), and atopic (one cluster) disease that were defined as CVID non-infectious endotypes according to discrete and non-overlapping immunophenotypes. Markers were both previously described {high serum IgE in the atopic cluster [odds ratio (OR) 6.5] and low class-switched memory B cells in the total lymphoproliferative cluster (OR 9.2)} and novel [low serum C3 in the total lymphoproliferative cluster (OR 5.1)]. Mortality risk in the Partners cohort was significantly associated with individual non-infectious disease outcomes as well as lymphoproliferative cluster 2, specifically (OR 5.9). In contrast, unbiased network clustering failed to associate known comorbidities in the adult USIDNET cohort

  4. A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge

    PubMed Central

    Gururaj, Anupama E.; Chen, Xiaoling; Pournejati, Saeid; Alter, George; Hersh, William R.; Demner-Fushman, Dina; Ohno-Machado, Lucila

    2017-01-01

    Abstract The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval. Database URL: https://biocaddie.org/benchmark-data PMID:29220453

  5. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets.

    PubMed

    Heath, Allison P; Greenway, Matthew; Powell, Raymond; Spring, Jonathan; Suarez, Rafael; Hanley, David; Bandlamudi, Chai; McNerney, Megan E; White, Kevin P; Grossman, Robert L

    2014-01-01

    As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it. Bionimbus is an open source cloud-computing platform that is based primarily upon OpenStack, which manages on-demand virtual machines that provide the required computational resources, and GlusterFS, which is a high-performance clustered file system. Bionimbus also includes Tukey, which is a portal, and associated middleware that provides a single entry point and a single sign on for the various Bionimbus resources; and Yates, which automates the installation, configuration, and maintenance of the software infrastructure required. Bionimbus is used by a variety of projects to process genomics and phenotypic data. For example, it is used by an acute myeloid leukemia resequencing project at the University of Chicago. The project requires several computational pipelines, including pipelines for quality control, alignment, variant calling, and annotation. For each sample, the alignment step requires eight CPUs for about 12 h. BAM file sizes ranged from 5 GB to 10 GB for each sample. Most members of the research community have difficulty downloading large genomics datasets and obtaining sufficient storage and computer resources to manage and analyze the data. Cloud computing platforms, such as Bionimbus, with data commons that contain large genomics datasets, are one choice for broadening access to research data in genomics. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

  6. [Spatial domain display for interference image dataset].

    PubMed

    Wang, Cai-Ling; Li, Yu-Shan; Liu, Xue-Bin; Hu, Bing-Liang; Jing, Juan-Juan; Wen, Jia

    2011-11-01

    The requirements of imaging interferometer visualization is imminent for the user of image interpretation and information extraction. However, the conventional researches on visualization only focus on the spectral image dataset in spectral domain. Hence, the quick show of interference spectral image dataset display is one of the nodes in interference image processing. The conventional visualization of interference dataset chooses classical spectral image dataset display method after Fourier transformation. In the present paper, the problem of quick view of interferometer imager in image domain is addressed and the algorithm is proposed which simplifies the matter. The Fourier transformation is an obstacle since its computation time is very large and the complexion would be even deteriorated with the size of dataset increasing. The algorithm proposed, named interference weighted envelopes, makes the dataset divorced from transformation. The authors choose three interference weighted envelopes respectively based on the Fourier transformation, features of interference data and human visual system. After comparing the proposed with the conventional methods, the results show the huge difference in display time.

  7. Comparison of CORA and EN4 in-situ datasets validation methods, toward a better quality merged dataset.

    NASA Astrophysics Data System (ADS)

    Szekely, Tanguy; Killick, Rachel; Gourrion, Jerome; Reverdin, Gilles

    2017-04-01

    CORA and EN4 are both global delayed time mode validated in-situ ocean temperature and salinity datasets distributed by the Met Office (http://www.metoffice.gov.uk/) and Copernicus (www.marine.copernicus.eu). A large part of the profiles distributed by CORA and EN4 in recent years are Argo profiles from the ARGO DAC, but profiles are also extracted from the World Ocean Database and TESAC profiles from GTSPP. In the case of CORA, data coming from the EUROGOOS Regional operationnal oserving system( ROOS) operated by European institutes no managed by National Data Centres and other datasets of profiles povided by scientific sources can also be found (Sea mammals profiles from MEOP, XBT datasets from cruises ...). (EN4 also takes data from the ASBO dataset to supplement observations in the Arctic). First advantage of this new merge product is to enhance the space and time coverage at global and european scales for the period covering 1950 till a year before the current year. This product is updated once a year and T&S gridded fields are alos generated for the period 1990-year n-1. The enhancement compared to the revious CORA product will be presented Despite the fact that the profiles distributed by both datasets are mostly the same, the quality control procedures developed by the Met Office and Copernicus teams differ, sometimes leading to different quality control flags for the same profile. Started in 2016 a new study started that aims to compare both validation procedures to move towards a Copernicus Marine Service dataset with the best features of CORA and EN4 validation.A reference data set composed of the full set of in-situ temperature and salinity measurements collected by Coriolis during 2015 is used. These measurements have been made thanks to wide range of instruments (XBTs, CTDs, Argo floats, Instrumented sea mammals,...), covering the global ocean. The reference dataset has been validated simultaneously by both teams.An exhaustive comparison of the

  8. Secondary analysis of national survey datasets.

    PubMed

    Boo, Sunjoo; Froelicher, Erika Sivarajan

    2013-06-01

    This paper describes the methodological issues associated with secondary analysis of large national survey datasets. Issues about survey sampling, data collection, and non-response and missing data in terms of methodological validity and reliability are discussed. Although reanalyzing large national survey datasets is an expedient and cost-efficient way of producing nursing knowledge, successful investigations require a methodological consideration of the intrinsic limitations of secondary survey analysis. Nursing researchers using existing national survey datasets should understand potential sources of error associated with survey sampling, data collection, and non-response and missing data. Although it is impossible to eliminate all potential errors, researchers using existing national survey datasets must be aware of the possible influence of errors on the results of the analyses. © 2012 The Authors. Japan Journal of Nursing Science © 2012 Japan Academy of Nursing Science.

  9. Meta-Analysis in Genome-Wide Association Datasets: Strategies and Application in Parkinson Disease

    PubMed Central

    Evangelou, Evangelos; Maraganore, Demetrius M.; Ioannidis, John P.A.

    2007-01-01

    Background Genome-wide association studies hold substantial promise for identifying common genetic variants that regulate susceptibility to complex diseases. However, for the detection of small genetic effects, single studies may be underpowered. Power may be improved by combining genome-wide datasets with meta-analytic techniques. Methodology/Principal Findings Both single and two-stage genome-wide data may be combined and there are several possible strategies. In the two-stage framework, we considered the options of (1) enhancement of replication data and (2) enhancement of first-stage data, and then, we also considered (3) joint meta-analyses including all first-stage and second-stage data. These strategies were examined empirically using data from two genome-wide association studies (three datasets) on Parkinson disease. In the three strategies, we derived 12, 5, and 49 single nucleotide polymorphisms that show significant associations at conventional levels of statistical significance. None of these remained significant after conservative adjustment for the number of performed analyses in each strategy. However, some may warrant further consideration: 6 SNPs were identified with at least 2 of the 3 strategies and 3 SNPs [rs1000291 on chromosome 3, rs2241743 on chromosome 4 and rs3018626 on chromosome 11] were identified with all 3 strategies and had no or minimal between-dataset heterogeneity (I2 = 0, 0 and 15%, respectively). Analyses were primarily limited by the suboptimal overlap of tested polymorphisms across different datasets (e.g., only 31,192 shared polymorphisms between the two tier 1 datasets). Conclusions/Significance Meta-analysis may be used to improve the power and examine the between-dataset heterogeneity of genome-wide association studies. Prospective designs may be most efficient, if they try to maximize the overlap of genotyping platforms and anticipate the combination of data across many genome-wide association studies. PMID:17332845

  10. Revisiting the mechanism of coagulation factor XIII activation and regulation from a structure/functional perspective

    PubMed Central

    Gupta, Sneha; Biswas, Arijit; Akhter, Mohammad Suhail; Krettler, Christoph; Reinhart, Christoph; Dodt, Johannes; Reuter, Andreas; Philippou, Helen; Ivaskevicius, Vytautas; Oldenburg, Johannes

    2016-01-01

    The activation and regulation of coagulation Factor XIII (FXIII) protein has been the subject of active research for the past three decades. Although discrete evidence exists on various aspects of FXIII activation and regulation a combinatorial structure/functional view in this regard is lacking. In this study, we present results of a structure/function study of the functional chain of events for FXIII. Our study shows how subtle chronological submolecular changes within calcium binding sites can bring about the detailed transformation of the zymogenic FXIII to its activated form especially in the context of FXIIIA and FXIIIB subunit interactions. We demonstrate what aspects of FXIII are important for the stabilization (first calcium binding site) of its zymogenic form and the possible modes of deactivation (thrombin mediated secondary cleavage) of the activated form. Our study for the first time provides a structural outlook of the FXIIIA2B2 heterotetramer assembly, its association and dissociation. The FXIIIB subunits regulatory role in the overall process has also been elaborated upon. In summary, this study provides detailed structural insight into the mechanisms of FXIII activation and regulation that can be used as a template for the development of future highly specific therapeutic inhibitors targeting FXIII in pathological conditions like thrombosis. PMID:27453290

  11. U.S. Datasets

    Cancer.gov

    Datasets for U.S. mortality, U.S. populations, standard populations, county attributes, and expected survival. Plus SEER-linked databases (SEER-Medicare, SEER-Medicare Health Outcomes Survey [SEER-MHOS], SEER-Consumer Assessment of Healthcare Providers and Systems [SEER-CAHPS]).

  12. Uvf - Unified Volume Format: A General System for Efficient Handling of Large Volumetric Datasets.

    PubMed

    Krüger, Jens; Potter, Kristin; Macleod, Rob S; Johnson, Christopher

    2008-01-01

    With the continual increase in computing power, volumetric datasets with sizes ranging from only a few megabytes to petascale are generated thousands of times per day. Such data may come from an ordinary source such as simple everyday medical imaging procedures, while larger datasets may be generated from cluster-based scientific simulations or measurements of large scale experiments. In computer science an incredible amount of work worldwide is put into the efficient visualization of these datasets. As researchers in the field of scientific visualization, we often have to face the task of handling very large data from various sources. This data usually comes in many different data formats. In medical imaging, the DICOM standard is well established, however, most research labs use their own data formats to store and process data. To simplify the task of reading the many different formats used with all of the different visualization programs, we present a system for the efficient handling of many types of large scientific datasets (see Figure 1 for just a few examples). While primarily targeted at structured volumetric data, UVF can store just about any type of structured and unstructured data. The system is composed of a file format specification with a reference implementation of a reader. It is not only a common, easy to implement format but also allows for efficient rendering of most datasets without the need to convert the data in memory.

  13. Dataset of Scientific Inquiry Learning Environment

    ERIC Educational Resources Information Center

    Ting, Choo-Yee; Ho, Chiung Ching

    2015-01-01

    This paper presents the dataset collected from student interactions with INQPRO, a computer-based scientific inquiry learning environment. The dataset contains records of 100 students and is divided into two portions. The first portion comprises (1) "raw log data", capturing the student's name, interfaces visited, the interface…

  14. Simulation of Smart Home Activity Datasets

    PubMed Central

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-01-01

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371

  15. Simulation of Smart Home Activity Datasets.

    PubMed

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-06-16

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.

  16. UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets.

    PubMed

    Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K

    2015-06-04

    Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few

  17. Intracranial Hemorrhage: A Devastating Outcome of Congenital Bleeding Disorders-Prevalence, Diagnosis, and Management, with a Special Focus on Congenital Factor XIII Deficiency.

    PubMed

    Alavi, Seyed Ezatolla Rafiee; Jalalvand, Masumeh; Assadollahi, Vahideh; Tabibian, Shadi; Dorgalaleh, Akbar

    2018-04-01

    Intracranial hemorrhage (ICH) is a medical emergency. In congenital bleeding disorders, ICH is a devastating presentation accompanied with a high rate of morbidity and mortality. The prevalence of ICH is highly variable among congenital bleeding disorders, with the highest incidence observed in factor (F) XIII deficiency (FXIIID) (∼30%). This life-threatening presentation is less common in afibrinogenemia, FVIII, FIX, FVII, and FX deficiencies, and is rare in severe FV and FII deficiencies, type 3 von Willebrand disease and inherited platelet function disorders (IPFDs). In FXIIID, this diathesis most often occurs after trauma in children, whereas spontaneous ICH is more frequent in adults. About 15% of patients with FXIIID and ICH die; the bleeding causes 80% of deaths in this coagulopathy. Although in FXIIID, the bleed most commonly is intraparenchymal (> 90%), epidural, subdural, and subarachnoid hemorrhages also have been reported, albeit rarely. As this life-threatening bleeding causes neurological complications, early diagnosis can prevent further expansion of the hematoma and secondary damage. Neuroimaging plays a crucial role in the diagnosis of ICH, but signs and symptoms in patients with severe FXIIID should trigger replacement therapy even before establishment of the diagnosis. Although a high dose of FXIII concentrate can reduce the rate of morbidity and mortality of ICH in FXIIID, it may occasionally trigger inhibitor development, thus complicating ICH management and future prophylaxis. Nevertheless, replacement therapy is the mainstay of treatment for ICH in FXIIID. Neurosurgery is performed in patients with FXIIID and epidural hematoma and a hemorrhage diameter exceeding 2 cm or a volume of ICH is more than 30 cm 3 . Contact sports are not recommended in people with FXIIID as they can elicit ICH. However, a considerable number of safe sports and activities have been suggested to have more benefits than dangers for patients with congenital bleeding

  18. A multi-dataset data-collection strategy produces better diffraction data

    PubMed Central

    Liu, Zhi-Jie; Chen, Lirong; Wu, Dong; Ding, Wei; Zhang, Hua; Zhou, Weihong; Fu, Zheng-Qing; Wang, Bi-Cheng

    2011-01-01

    A multi-dataset (MDS) data-collection strategy is proposed and analyzed for macromolecular crystal diffraction data acquisition. The theoretical analysis indicated that the MDS strategy can reduce the standard deviation (background noise) of diffraction data compared with the commonly used single-dataset strategy for a fixed X-ray dose. In order to validate the hypothesis experimentally, a data-quality evaluation process, termed a readiness test of the X-ray data-collection system, was developed. The anomalous signals of sulfur atoms in zinc-free insulin crystals were used as the probe to differentiate the quality of data collected using different data-collection strategies. The data-collection results using home-laboratory-based rotating-anode X-ray and synchrotron X-ray systems indicate that the diffraction data collected with the MDS strategy contain more accurate anomalous signals from sulfur atoms than the data collected with a regular data-collection strategy. In addition, the MDS strategy offered more advantages with respect to radiation-damage-sensitive crystals and better usage of rotating-anode as well as synchrotron X-rays. PMID:22011470

  19. Providing Geographic Datasets as Linked Data in Sdi

    NASA Astrophysics Data System (ADS)

    Hietanen, E.; Lehto, L.; Latvala, P.

    2016-06-01

    In this study, a prototype service to provide data from Web Feature Service (WFS) as linked data is implemented. At first, persistent and unique Uniform Resource Identifiers (URI) are created to all spatial objects in the dataset. The objects are available from those URIs in Resource Description Framework (RDF) data format. Next, a Web Ontology Language (OWL) ontology is created to describe the dataset information content using the Open Geospatial Consortium's (OGC) GeoSPARQL vocabulary. The existing data model is modified in order to take into account the linked data principles. The implemented service produces an HTTP response dynamically. The data for the response is first fetched from existing WFS. Then the Geographic Markup Language (GML) format output of the WFS is transformed on-the-fly to the RDF format. Content Negotiation is used to serve the data in different RDF serialization formats. This solution facilitates the use of a dataset in different applications without replicating the whole dataset. In addition, individual spatial objects in the dataset can be referred with URIs. Furthermore, the needed information content of the objects can be easily extracted from the RDF serializations available from those URIs. A solution for linking data objects to the dataset URI is also introduced by using the Vocabulary of Interlinked Datasets (VoID). The dataset is divided to the subsets and each subset is given its persistent and unique URI. This enables the whole dataset to be explored with a web browser and all individual objects to be indexed by search engines.

  20. NATIONAL HYDROGRAPHY DATASET

    EPA Science Inventory

    Resource Purpose:The National Hydrography Dataset (NHD) is a comprehensive set of digital spatial data that contains information about surface water features such as lakes, ponds, streams, rivers, springs and wells. Within the NHD, surface water features are combined to fo...

  1. The Optimum Dataset method - examples of the application

    NASA Astrophysics Data System (ADS)

    Błaszczak-Bąk, Wioleta; Sobieraj-Żłobińska, Anna; Wieczorek, Beata

    2018-01-01

    Data reduction is a procedure to decrease the dataset in order to make their analysis more effective and easier. Reduction of the dataset is an issue that requires proper planning, so after reduction it meets all the user's expectations. Evidently, it is better if the result is an optimal solution in terms of adopted criteria. Within reduction methods, which provide the optimal solution there is the Optimum Dataset method (OptD) proposed by Błaszczak-Bąk (2016). The paper presents the application of this method for different datasets from LiDAR and the possibility of using the method for various purposes of the study. The following reduced datasets were presented: (a) measurement of Sielska street in Olsztyn (Airbrone Laser Scanning data - ALS data), (b) measurement of the bas-relief that is on the building in Gdańsk (Terrestrial Laser Scanning data - TLS data), (c) dataset from Biebrza river measurment (TLS data).

  2. The Role of Datasets on Scientific Influence within Conflict Research.

    PubMed

    Van Holt, Tracy; Johnson, Jeffery C; Moates, Shiloh; Carley, Kathleen M

    2016-01-01

    We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving "conflict" in the Web of Science (WoS) over a 66-year period (1945-2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed-such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957-1971 where ideas didn't persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped shape the

  3. A Combined Surface Temperature Dataset for the Arctic from MODIS and AVHRR

    NASA Astrophysics Data System (ADS)

    Dodd, E.; Veal, K. L.; Ghent, D.; Corlett, G. K.; Remedios, J. J.

    2017-12-01

    Surface Temperature (ST) changes in the Polar Regions are predicted to be more rapid than either global averages or responses in lower latitudes. Observations of STs and other changes associated with climate change increasingly confirm these predictions in the Arctic. Furthermore, recent high profile events of anomalously warm temperatures have increased interest in Arctic surface temperatures. It is, therefore, particularly important to monitor Arctic climate change. Satellites are particularly relevant to observations of Polar Regions as they are well-served by low-Earth orbiting satellites. Whilst clouds often cause problems for satellite observations of the surface, in situ observations of STs are much sparser. Previous work at the University of Leicester has produced a combined land, ocean and ice ST dataset for the Arctic using ATSR data (AAST) which covers the period 1995 to 2012. In order to facilitate investigation of more recent changes in the Arctic (2010 to 2016) we have produced another combined surface temperature dataset using MODIS and AVHRR; the Metop-A AVHRR and MODIS Arctic Surface Temperature dataset (AMAST). The method of cloud-clearing, use of auxiliary data for ice classification and the ST retrievals used for each surface-type in AMAST will be described. AAST and AMAST were compared in the time period common to both datasets. We will provide results from this intercomparison, as well as an assessment of the impact of utilising data from wide and narrow swath sensors. Time series of ST anomalies over the Arctic region produced from AMAST will be presented.

  4. Development of a global historic monthly mean precipitation dataset

    NASA Astrophysics Data System (ADS)

    Yang, Su; Xu, Wenhui; Xu, Yan; Li, Qingxiang

    2016-04-01

    Global historic precipitation dataset is the base for climate and water cycle research. There have been several global historic land surface precipitation datasets developed by international data centers such as the US National Climatic Data Center (NCDC), European Climate Assessment & Dataset project team, Met Office, etc., but so far there are no such datasets developed by any research institute in China. In addition, each dataset has its own focus of study region, and the existing global precipitation datasets only contain sparse observational stations over China, which may result in uncertainties in East Asian precipitation studies. In order to take into account comprehensive historic information, users might need to employ two or more datasets. However, the non-uniform data formats, data units, station IDs, and so on add extra difficulties for users to exploit these datasets. For this reason, a complete historic precipitation dataset that takes advantages of various datasets has been developed and produced in the National Meteorological Information Center of China. Precipitation observations from 12 sources are aggregated, and the data formats, data units, and station IDs are unified. Duplicated stations with the same ID are identified, with duplicated observations removed. Consistency test, correlation coefficient test, significance t-test at the 95% confidence level, and significance F-test at the 95% confidence level are conducted first to ensure the data reliability. Only those datasets that satisfy all the above four criteria are integrated to produce the China Meteorological Administration global precipitation (CGP) historic precipitation dataset version 1.0. It contains observations at 31 thousand stations with 1.87 × 107 data records, among which 4152 time series of precipitation are longer than 100 yr. This dataset plays a critical role in climate research due to its advantages in large data volume and high density of station network, compared to

  5. A dataset on tail risk of commodities markets.

    PubMed

    Powell, Robert J; Vo, Duc H; Pham, Thach N; Singh, Abhay K

    2017-12-01

    This article contains the datasets related to the research article "The long and short of commodity tails and their relationship to Asian equity markets"(Powell et al., 2017) [1]. The datasets contain the daily prices (and price movements) of 24 different commodities decomposed from the S&P GSCI index and the daily prices (and price movements) of three share market indices including World, Asia, and South East Asia for the period 2004-2015. Then, the dataset is divided into annual periods, showing the worst 5% of price movements for each year. The datasets are convenient to examine the tail risk of different commodities as measured by Conditional Value at Risk (CVaR) as well as their changes over periods. The datasets can also be used to investigate the association between commodity markets and share markets.

  6. Approximation-based common principal component for feature extraction in multi-class brain-computer interfaces.

    PubMed

    Hoang, Tuan; Tran, Dat; Huang, Xu

    2013-01-01

    Common Spatial Pattern (CSP) is a state-of-the-art method for feature extraction in Brain-Computer Interface (BCI) systems. However it is designed for 2-class BCI classification problems. Current extensions of this method to multiple classes based on subspace union and covariance matrix similarity do not provide a high performance. This paper presents a new approach to solving multi-class BCI classification problems by forming a subspace resembled from original subspaces and the proposed method for this approach is called Approximation-based Common Principal Component (ACPC). We perform experiments on Dataset 2a used in BCI Competition IV to evaluate the proposed method. This dataset was designed for motor imagery classification with 4 classes. Preliminary experiments show that the proposed ACPC feature extraction method when combining with Support Vector Machines outperforms CSP-based feature extraction methods on the experimental dataset.

  7. Publishing datasets with eSciDoc and panMetaDocs

    NASA Astrophysics Data System (ADS)

    Ulbricht, D.; Klump, J.; Bertelmann, R.

    2012-04-01

    publishing scientific datasets as electronic data supplements to research papers. Publication of research manuscripts has an already well established workflow that shares junctures with other processes and involves several parties in the process of dataset publication. Activities of the author, the reviewer, the print publisher and the data publisher have to be coordinated into a common data publication workflow. The case of data publication at GFZ Potsdam displays some specifics, e.g. the DOIDB webservice. The DOIDB is a proxy service at GFZ for the DataCite [4] DOI registration and its metadata store. DOIDB provides a local summary of the dataset DOIs registered through GFZ as a publication agent. An additional use case for the DOIDB is its function to enrich the datacite metadata with additional custom attributes, like a geographic reference in a DIF record. These attributes are at the moment not available in the datacite metadata schema but would be valuable elements for the compilation of data catalogues in the earth sciences and for dissemination of catalogue data via OAI-PMH. [1] http://www.escidoc.org , eSciDoc, FIZ Karlruhe, Germany [2] http://panmetadocs.sf.net , panMetaDocs, GFZ Potsdam, Germany [3] http://metaworks.pangaea.de , panMetaWorks, Dr. R. Huber, MARUM, Univ. Bremen, Germany [4] http://www.datacite.org

  8. Wind Integration National Dataset Toolkit | Grid Modernization | NREL

    Science.gov Websites

    information, share tips The WIND Toolkit includes meteorological conditions and turbine power for more than Integration National Dataset Toolkit Wind Integration National Dataset Toolkit The Wind Integration National Dataset (WIND) Toolkit is an update and expansion of the Eastern Wind Integration Data Set and

  9. Control Measure Dataset

    EPA Pesticide Factsheets

    The EPA Control Measure Dataset is a collection of documents describing air pollution control available to regulated facilities for the control and abatement of air pollution emissions from a range of regulated source types, whether directly through the use of technical measures, or indirectly through economic or other measures.

  10. Evaluating Factor XIII Specificity for Glutamine-Containing Substrates Using a MALDI-TOF Mass Spectrometry Assay

    PubMed Central

    Doiphode, Prakash G.; Malovichko, Marina V.; Mouapi, Kelly Njine; Maurer, Muriel C.

    2014-01-01

    Activated Factor XIII (FXIIIa) catalyzes the formation of γ-glutamyl-ε-lysyl cross-links within the fibrin blood clot network. Although several cross-linking targets have been identified, the characteristic features that define FXIIIa substrate specificity are not well understood. To learn more about how FXIIIa selects its targets, a matrix-assisted laser desorption ionization – time of flight mass spectrometry (MALDI-TOF MS) based assay was developed that could directly follow the consumption of a glutamine-containing substrate and the formation of a cross-linked product with glycine ethylester. This FXIIIa kinetics assay is no longer reliant on a secondary coupled reaction, on substrate labeling, or on detecting the final deacylation portion of the transglutaminase reaction. With the MALDI-TOF MS assay, glutamine-containing peptides derived from α2-antiplasmin, S. Aureus fibronectin binding protein A, and thrombin activatable fibrinolysis inhibitor were examined directly. Results suggest that the FXIIIa active site surface responds to changes in substrate residues following the reactive glutamine. The P-1 substrate position is sensitive to charge character and the P-2 and P-3 to the broad FXIIIa substrate specificity pockets. The more distant P-8 to P-11 region serves as a secondary substrate anchoring point. New knowledge on FXIIIa specificity may be used to design better substrates or inhibitors of this transglutaminase. PMID:24751466

  11. Two ultraviolet radiation datasets that cover China

    NASA Astrophysics Data System (ADS)

    Liu, Hui; Hu, Bo; Wang, Yuesi; Liu, Guangren; Tang, Liqin; Ji, Dongsheng; Bai, Yongfei; Bao, Weikai; Chen, Xin; Chen, Yunming; Ding, Weixin; Han, Xiaozeng; He, Fei; Huang, Hui; Huang, Zhenying; Li, Xinrong; Li, Yan; Liu, Wenzhao; Lin, Luxiang; Ouyang, Zhu; Qin, Boqiang; Shen, Weijun; Shen, Yanjun; Su, Hongxin; Song, Changchun; Sun, Bo; Sun, Song; Wang, Anzhi; Wang, Genxu; Wang, Huimin; Wang, Silong; Wang, Youshao; Wei, Wenxue; Xie, Ping; Xie, Zongqiang; Yan, Xiaoyuan; Zeng, Fanjiang; Zhang, Fawei; Zhang, Yangjian; Zhang, Yiping; Zhao, Chengyi; Zhao, Wenzhi; Zhao, Xueyong; Zhou, Guoyi; Zhu, Bo

    2017-07-01

    Ultraviolet (UV) radiation has significant effects on ecosystems, environments, and human health, as well as atmospheric processes and climate change. Two ultraviolet radiation datasets are described in this paper. One contains hourly observations of UV radiation measured at 40 Chinese Ecosystem Research Network stations from 2005 to 2015. CUV3 broadband radiometers were used to observe the UV radiation, with an accuracy of 5%, which meets the World Meteorology Organization's measurement standards. The extremum method was used to control the quality of the measured datasets. The other dataset contains daily cumulative UV radiation estimates that were calculated using an all-sky estimation model combined with a hybrid model. The reconstructed daily UV radiation data span from 1961 to 2014. The mean absolute bias error and root-mean-square error are smaller than 30% at most stations, and most of the mean bias error values are negative, which indicates underestimation of the UV radiation intensity. These datasets can improve our basic knowledge of the spatial and temporal variations in UV radiation. Additionally, these datasets can be used in studies of potential ozone formation and atmospheric oxidation, as well as simulations of ecological processes.

  12. BayesMotif: de novo protein sorting motif discovery from impure datasets.

    PubMed

    Hu, Jianjun; Zhang, Fan

    2010-01-18

    Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of

  13. The Harvard organic photovoltaic dataset

    DOE PAGES

    Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; ...

    2016-09-27

    Presented in this work is the Harvard Organic Photovoltaic Dataset (HOPV15), a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

  14. The Harvard organic photovoltaic dataset.

    PubMed

    Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán

    2016-09-27

    The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

  15. Impact of developing a multidisciplinary coded dataset standard on administrative data accuracy for septoplasty, septorhinoplasty and nasal trauma surgery.

    PubMed

    Nouraei, S A R; Hudovsky, A; Virk, J S; Saleh, H A

    2017-04-01

    This study aimed to develop a multidisciplinary coded dataset standard for nasal surgery and to assess its impact on data accuracy. An audit of 528 patients undergoing septal and/or inferior turbinate surgery, rhinoplasty and/or septorhinoplasty, and nasal fracture surgery was undertaken. A total of 200 septoplasties, 109 septorhinoplasties, 57 complex septorhinoplasties and 116 nasal fractures were analysed. There were 76 (14.4 per cent) changes to the primary diagnosis. Septorhinoplasties were the most commonly amended procedures. The overall audit-related income change for nasal surgery was £8.78 per patient. Use of a multidisciplinary coded dataset standard revealed that nasal diagnoses were under-coded; a significant proportion of patients received more precise diagnoses following the audit. There was also significant under-coding of both morbidities and revision surgery. The multidisciplinary coded dataset standard approach can improve the accuracy of both data capture and information flow, and, thus, ultimately create a more reliable dataset for use outcomes and health planning.

  16. Development of South Australian-Victorian Prostate Cancer Health Outcomes Research Dataset.

    PubMed

    Ruseckaite, Rasa; Beckmann, Kerri; O'Callaghan, Michael; Roder, David; Moretti, Kim; Zalcberg, John; Millar, Jeremy; Evans, Sue

    2016-01-22

    Prostate cancer is the most commonly diagnosed and prevalent malignancy reported to Australian cancer registries, with numerous studies from single institutions summarizing patient outcomes at individual hospitals or States. In order to provide an overview of patterns of care of men with prostate cancer across multiple institutions in Australia, a specialized dataset was developed. This dataset, containing amalgamated data from South Australian and Victorian prostate cancer registries, is called the South Australian-Victorian Prostate Cancer Health Outcomes Research Dataset (SA-VIC PCHORD). A total of 13,598 de-identified records of men with prostate cancer diagnosed and consented between 2008 and 2013 in South Australia and Victoria were merged into the SA-VIC PCHORD. SA-VIC PCHORD contains detailed information about socio-demographic, diagnostic and treatment characteristics of patients with prostate cancer in South Australia and Victoria. Data from individual registries are available to researchers and can be accessed under individual data access policies in each State. The SA-VIC PCHORD will be used for numerous studies summarizing trends in diagnostic characteristics, survival and patterns of care in men with prostate cancer in Victoria and South Australia. It is expected that in the future the SA-VIC PCHORD will become a principal component of the recently developed bi-national Australian and New Zealand Prostate Cancer Outcomes Registry to collect and report patterns of care and standardised patient reported outcome measures of men nation-wide in Australia and New Zealand.

  17. Genomics dataset of unidentified disclosed isolates.

    PubMed

    Rekadwad, Bhagwan N

    2016-09-01

    Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis.

  18. Key Lessons in Building "Data Commons": The Open Science Data Cloud Ecosystem

    NASA Astrophysics Data System (ADS)

    Patterson, M.; Grossman, R.; Heath, A.; Murphy, M.; Wells, W.

    2015-12-01

    Cloud computing technology has created a shift around data and data analysis by allowing researchers to push computation to data as opposed to having to pull data to an individual researcher's computer. Subsequently, cloud-based resources can provide unique opportunities to capture computing environments used both to access raw data in its original form and also to create analysis products which may be the source of data for tables and figures presented in research publications. Since 2008, the Open Cloud Consortium (OCC) has operated the Open Science Data Cloud (OSDC), which provides scientific researchers with computational resources for storing, sharing, and analyzing large (terabyte and petabyte-scale) scientific datasets. OSDC has provided compute and storage services to over 750 researchers in a wide variety of data intensive disciplines. Recently, internal users have logged about 2 million core hours each month. The OSDC also serves the research community by colocating these resources with access to nearly a petabyte of public scientific datasets in a variety of fields also accessible for download externally by the public. In our experience operating these resources, researchers are well served by "data commons," meaning cyberinfrastructure that colocates data archives, computing, and storage infrastructure and supports essential tools and services for working with scientific data. In addition to the OSDC public data commons, the OCC operates a data commons in collaboration with NASA and is developing a data commons for NOAA datasets. As cloud-based infrastructures for distributing and computing over data become more pervasive, we ask, "What does it mean to publish data in a data commons?" Here we present the OSDC perspective and discuss several services that are key in architecting data commons, including digital identifier services.

  19. Application of Huang-Hilbert Transforms to Geophysical Datasets

    NASA Technical Reports Server (NTRS)

    Duffy, Dean G.

    2003-01-01

    The Huang-Hilbert transform is a promising new method for analyzing nonstationary and nonlinear datasets. In this talk I will apply this technique to several important geophysical datasets. To understand the strengths and weaknesses of this method, multi- year, hourly datasets of the sea level heights and solar radiation will be analyzed. Then we will apply this transform to the analysis of gravity waves observed in a mesoscale observational net.

  20. The Harvard organic photovoltaic dataset

    PubMed Central

    Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán

    2016-01-01

    The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312

  1. The Role of Datasets on Scientific Influence within Conflict Research

    PubMed Central

    Van Holt, Tracy; Johnson, Jeffery C.; Moates, Shiloh; Carley, Kathleen M.

    2016-01-01

    We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving “conflict” in the Web of Science (WoS) over a 66-year period (1945–2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed—such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957–1971 where ideas didn’t persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped

  2. Interpolation of diffusion weighted imaging datasets.

    PubMed

    Dyrby, Tim B; Lundell, Henrik; Burke, Mark W; Reislev, Nina L; Paulson, Olaf B; Ptito, Maurice; Siebner, Hartwig R

    2014-12-01

    Diffusion weighted imaging (DWI) is used to study white-matter fibre organisation, orientation and structural connectivity by means of fibre reconstruction algorithms and tractography. For clinical settings, limited scan time compromises the possibilities to achieve high image resolution for finer anatomical details and signal-to-noise-ratio for reliable fibre reconstruction. We assessed the potential benefits of interpolating DWI datasets to a higher image resolution before fibre reconstruction using a diffusion tensor model. Simulations of straight and curved crossing tracts smaller than or equal to the voxel size showed that conventional higher-order interpolation methods improved the geometrical representation of white-matter tracts with reduced partial-volume-effect (PVE), except at tract boundaries. Simulations and interpolation of ex-vivo monkey brain DWI datasets revealed that conventional interpolation methods fail to disentangle fine anatomical details if PVE is too pronounced in the original data. As for validation we used ex-vivo DWI datasets acquired at various image resolutions as well as Nissl-stained sections. Increasing the image resolution by a factor of eight yielded finer geometrical resolution and more anatomical details in complex regions such as tract boundaries and cortical layers, which are normally only visualized at higher image resolutions. Similar results were found with typical clinical human DWI dataset. However, a possible bias in quantitative values imposed by the interpolation method used should be considered. The results indicate that conventional interpolation methods can be successfully applied to DWI datasets for mining anatomical details that are normally seen only at higher resolutions, which will aid in tractography and microstructural mapping of tissue compartments. Copyright © 2014. Published by Elsevier Inc.

  3. [German national consensus on wound documentation of leg ulcer : Part 1: Routine care - standard dataset and minimum dataset].

    PubMed

    Heyer, K; Herberger, K; Protz, K; Mayer, A; Dissemond, J; Debus, S; Augustin, M

    2017-09-01

    Standards for basic documentation and the course of treatment increase quality assurance and efficiency in health care. To date, no standards for the treatment of patients with leg ulcers are available in Germany. The aim of the study was to develop standards under routine conditions in the documentation of patients with leg ulcers. This article shows the recommended variables of a "standard dataset" and a "minimum dataset". Consensus building among experts from 38 scientific societies, professional associations, insurance and supply networks (n = 68 experts) took place. After conducting a systematic international literature research, available standards were reviewed and supplemented with our own considerations of the expert group. From 2012-2015 standards for documentation were defined in multistage online visits and personal meetings. A consensus was achieved for 18 variables for the minimum dataset and 48 variables for the standard dataset in a total of seven meetings and nine online Delphi visits. The datasets involve patient baseline data, data on the general health status, wound characteristics, diagnostic and therapeutic interventions, patient reported outcomes, nutrition, and education status. Based on a multistage continuous decision-making process, a standard in the measurement of events in routine care in patients with a leg ulcer was developed.

  4. An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling.

    PubMed

    Mansouri, K; Grulke, C M; Richard, A M; Judson, R S; Williams, A J

    2016-11-01

    The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.

  5. EEG datasets for motor imagery brain-computer interface.

    PubMed

    Cho, Hohyun; Ahn, Minkyu; Ahn, Sangtae; Kwon, Moonyoung; Jun, Sung Chan

    2017-07-01

    Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states. © The Authors 2017. Published by Oxford University Press.

  6. A high-resolution European dataset for hydrologic modeling

    NASA Astrophysics Data System (ADS)

    Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

    2013-04-01

    There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as

  7. ASSISTments Dataset from Multiple Randomized Controlled Experiments

    ERIC Educational Resources Information Center

    Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil

    2016-01-01

    In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…

  8. Open source platform for collaborative construction of wearable sensor datasets for human motion analysis and an application for gait analysis.

    PubMed

    Llamas, César; González, Manuel A; Hernández, Carmen; Vegas, Jesús

    2016-10-01

    Nearly every practical improvement in modeling human motion is well founded in a properly designed collection of data or datasets. These datasets must be made publicly available for the community could validate and accept them. It is reasonable to concede that a collective, guided enterprise could serve to devise solid and substantial datasets, as a result of a collaborative effort, in the same sense as the open software community does. In this way datasets could be complemented, extended and expanded in size with, for example, more individuals, samples and human actions. For this to be possible some commitments must be made by the collaborators, being one of them sharing the same data acquisition platform. In this paper, we offer an affordable open source hardware and software platform based on inertial wearable sensors in a way that several groups could cooperate in the construction of datasets through common software suitable for collaboration. Some experimental results about the throughput of the overall system are reported showing the feasibility of acquiring data from up to 6 sensors with a sampling frequency no less than 118Hz. Also, a proof-of-concept dataset is provided comprising sampled data from 12 subjects suitable for gait analysis. Copyright © 2016 Elsevier Inc. All rights reserved.

  9. Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets.

    PubMed

    McKinney, Bill; Meyer, Peter A; Crosas, Mercè; Sliz, Piotr

    2017-01-01

    Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of file system structure within Dataverse-which is essential for both in-place computation and supporting non-HTTP data transfers. © 2016 New York Academy of Sciences.

  10. Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets

    PubMed Central

    McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr

    2016-01-01

    Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010

  11. Estimating parameters for probabilistic linkage of privacy-preserved datasets.

    PubMed

    Brown, Adrian P; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Boyd, James H

    2017-07-10

    Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher

  12. Viking Seismometer PDS Archive Dataset

    NASA Astrophysics Data System (ADS)

    Lorenz, R. D.

    2016-12-01

    The Viking Lander 2 seismometer operated successfully for over 500 Sols on the Martian surface, recording at least one likely candidate Marsquake. The Viking mission, in an era when data handling hardware (both on board and on the ground) was limited in capability, predated modern planetary data archiving, and ad-hoc repositories of the data, and the very low-level record at NSSDC, were neither convenient to process nor well-known. In an effort supported by the NASA Mars Data Analysis Program, we have converted the bulk of the Viking dataset (namely the 49,000 and 270,000 records made in High- and Event- modes at 20 and 1 Hz respectively) into a simple ASCII table format. Additionally, since wind-generated lander motion is a major component of the signal, contemporaneous meteorological data are included in summary records to facilitate correlation. These datasets are being archived at the PDS Geosciences Node. In addition to brief instrument and dataset descriptions, the archive includes code snippets in the freely-available language 'R' to demonstrate plotting and analysis. Further, we present examples of lander-generated noise, associated with the sampler arm, instrument dumps and other mechanical operations.

  13. National Elevation Dataset

    USGS Publications Warehouse

    ,

    1999-01-01

    The National Elevation Dataset (NED) is a new raster product assembled by the U.S. Geological Survey (USGS). The NED is designed to provide national elevation data in a seamless form with a consistent datum, elevation unit, and projection. Data corrections were made in the NED assembly process to minimize artifacts, permit edge matching, and fill sliver areas of missing data.

  14. A Benchmark Dataset for SSVEP-Based Brain-Computer Interfaces.

    PubMed

    Wang, Yijun; Chen, Xiaogang; Gao, Xiaorong; Gao, Shangkai

    2017-10-01

    This paper presents a benchmark steady-state visual evoked potential (SSVEP) dataset acquired with a 40-target brain- computer interface (BCI) speller. The dataset consists of 64-channel Electroencephalogram (EEG) data from 35 healthy subjects (8 experienced and 27 naïve) while they performed a cue-guided target selecting task. The virtual keyboard of the speller was composed of 40 visual flickers, which were coded using a joint frequency and phase modulation (JFPM) approach. The stimulation frequencies ranged from 8 Hz to 15.8 Hz with an interval of 0.2 Hz. The phase difference between two adjacent frequencies was . For each subject, the data included six blocks of 40 trials corresponding to all 40 flickers indicated by a visual cue in a random order. The stimulation duration in each trial was five seconds. The dataset can be used as a benchmark dataset to compare the methods for stimulus coding and target identification in SSVEP-based BCIs. Through offline simulation, the dataset can be used to design new system diagrams and evaluate their BCI performance without collecting any new data. The dataset also provides high-quality data for computational modeling of SSVEPs. The dataset is freely available fromhttp://bci.med.tsinghua.edu.cn/download.html.

  15. Dataset-Driven Research to Support Learning and Knowledge Analytics

    ERIC Educational Resources Information Center

    Verbert, Katrien; Manouselis, Nikos; Drachsler, Hendrik; Duval, Erik

    2012-01-01

    In various research areas, the availability of open datasets is considered as key for research and application purposes. These datasets are used as benchmarks to develop new algorithms and to compare them to other algorithms in given settings. Finding such available datasets for experimentation can be a challenging task in technology enhanced…

  16. Method of generating features optimal to a dataset and classifier

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bruillard, Paul J.; Gosink, Luke J.; Jarman, Kenneth D.

    A method of generating features optimal to a particular dataset and classifier is disclosed. A dataset of messages is inputted and a classifier is selected. An algebra of features is encoded. Computable features that are capable of describing the dataset from the algebra of features are selected. Irredundant features that are optimal for the classifier and the dataset are selected.

  17. Optimized Varian aSi portal dosimetry: development of datasets for collective use.

    PubMed

    Van Esch, Ann; Huyskens, Dominique P; Hirschi, Lukas; Baltes, Christof

    2013-11-04

    Although much literature has been devoted to portal dosimetry with the Varian amorphous silicon (aSi) portal imager, the majority of the described methods are not routinely adopted because implementation procedures are cumbersome and not within easy reach of most radiotherapy centers. To make improved portal dosimetry solutions more generally available, we have investigated the possibility of converting optimized configurations into ready-to-use standardized datasets. Firstly, for all commonly used photon energies (6, 10, 15, 18, and 20 MV), basic beam data acquired on 20 aSi panels were used to assess the interpanel reproducibility. Secondly, a standardized portal dose image prediction (PDIP) algorithm configuration was created for every energy, using a three-step process to optimize the aSi dose response function and profile correction files for the dosimetric calibration of the imager panel. An approximate correction of the backscatter of the Exact arm was also incorporated. Thirdly, a set of validation fields was assembled to assess the accuracy of the standardized configuration. Variations in the basic beam data measured on different aSi panels very rarely exceeded 2% (2 mm) and are of the same order of magnitude as variations between different Clinacs when measuring in reference conditions in water. All studied aSi panels can hence be regarded as nearly identical. Standardized datasets were successfully created and implemented. The test package proved useful in highlighting possible problems and illustrating remaining limitations, but also in demonstrating the good overall results (95% pass rate for 3%,3 mm) that can be obtained. The dosimetric behavior of all tested aSi panels was found to be nearly identical for all tested energies. The approach of using standardized datasets was then successfully tested through the creation and evaluation of PDIP preconfigured datasets that can be used within the Varian portal dosimetry solution.

  18. Developing a regional retrospective ensemble precipitation dataset for watershed hydrology modeling, Idaho, USA

    NASA Astrophysics Data System (ADS)

    Flores, A. N.; Smith, K.; LaPorte, P.

    2011-12-01

    Applications like flood forecasting, military trafficability assessment, and slope stability analysis necessitate the use of models capable of resolving hydrologic states and fluxes at spatial scales of hillslopes (e.g., 10s to 100s m). These models typically require precipitation forcings at spatial scales of kilometers or better and time intervals of hours. Yet in especially rugged terrain that typifies much of the Western US and throughout much of the developing world, precipitation data at these spatiotemporal resolutions is difficult to come by. Ground-based weather radars have significant problems in high-relief settings and are sparsely located, leaving significant gaps in coverage and high uncertainties. Precipitation gages provide accurate data at points but are very sparsely located and their placement is often not representative, yielding significant coverage gaps in a spatial and physiographic sense. Numerical weather prediction efforts have made precipitation data, including critically important information on precipitation phase, available globally and in near real-time. However, these datasets present watershed modelers with two problems: (1) spatial scales of many of these datasets are tens of kilometers or coarser, (2) numerical weather models used to generate these datasets include a land surface parameterization that in some circumstances can significantly affect precipitation predictions. We report on the development of a regional precipitation dataset for Idaho that leverages: (1) a dataset derived from a numerical weather prediction model, (2) gages within Idaho that report hourly precipitation data, and (3) a long-term precipitation climatology dataset. Hourly precipitation estimates from the Modern Era Retrospective-analysis for Research and Applications (MERRA) are stochastically downscaled using a hybrid orographic and statistical model from their native resolution (1/2 x 2/3 degrees) to a resolution of approximately 1 km. Downscaled

  19. Querying Patterns in High-Dimensional Heterogenous Datasets

    ERIC Educational Resources Information Center

    Singh, Vishwakarma

    2012-01-01

    The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…

  20. A Compilation of Spatial Datasets to Support a Preliminary Assessment of Pesticides and Pesticide Use on Tribal Lands in Oklahoma

    USGS Publications Warehouse

    Mashburn, Shana L.; Winton, Kimberly T.

    2010-01-01

    This CD-ROM contains spatial datasets that describe natural and anthropogenic features and county-level estimates of agricultural pesticide use and pesticide data for surface-water, groundwater, and biological specimens in the state of Oklahoma. County-level estimates of pesticide use were compiled from the Pesticide National Synthesis Project of the U.S. Geological Survey, National Water-Quality Assessment Program. Pesticide data for surface water, groundwater, and biological specimens were compiled from U.S. Geological Survey National Water Information System database. These spatial datasets that describe natural and manmade features were compiled from several agencies and contain information collected by the U.S. Geological Survey. The U.S. Geological Survey datasets were not collected specifically for this compilation, but were previously collected for projects with various objectives. The spatial datasets were created by different agencies from sources with varied quality. As a result, features common to multiple layers may not overlay exactly. Users should check the metadata to determine proper use of these spatial datasets. These data were not checked for accuracy or completeness. If a question of accuracy or completeness arise, the user should contact the originator cited in the metadata.

  1. A hybrid organic-inorganic perovskite dataset

    NASA Astrophysics Data System (ADS)

    Kim, Chiho; Huan, Tran Doan; Krishnan, Sridevi; Ramprasad, Rampi

    2017-05-01

    Hybrid organic-inorganic perovskites (HOIPs) have been attracting a great deal of attention due to their versatility of electronic properties and fabrication methods. We prepare a dataset of 1,346 HOIPs, which features 16 organic cations, 3 group-IV cations and 4 halide anions. Using a combination of an atomic structure search method and density functional theory calculations, the optimized structures, the bandgap, the dielectric constant, and the relative energies of the HOIPs are uniformly prepared and validated by comparing with relevant experimental and/or theoretical data. We make the dataset available at Dryad Digital Repository, NoMaD Repository, and Khazana Repository (http://khazana.uconn.edu/), hoping that it could be useful for future data-mining efforts that can explore possible structure-property relationships and phenomenological models. Progressive extension of the dataset is expected as new organic cations become appropriate within the HOIP framework, and as additional properties are calculated for the new compounds found.

  2. The LANDFIRE Refresh strategy: updating the national dataset

    USGS Publications Warehouse

    Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley

    2013-01-01

    The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.

  3. Enabling Data Fusion via a Common Data Model and Programming Interface

    NASA Astrophysics Data System (ADS)

    Lindholm, D. M.; Wilson, A.

    2011-12-01

    Much progress has been made in scientific data interoperability, especially in the areas of metadata and discovery. However, while a data user may have improved techniques for finding data, there is often a large chasm to span when it comes to acquiring the desired subsets of various datasets and integrating them into a data processing environment. Some tools such as OPeNDAP servers and the Unidata Common Data Model (CDM) have introduced improved abstractions for accessing data via a common interface, but they alone do not go far enough to enable fusion of data from multidisciplinary sources. Although data from various scientific disciplines may represent semantically similar concepts (e.g. time series), the user may face widely varying structural representations of the data (e.g. row versus column oriented), not to mention radically different storage formats. It is not enough to convert data to a common format. The key to fusing scientific data is to represent each dataset with consistent sampling. This can best be done by using a data model that expresses the functional relationship that each dataset represents. The domain of those functions determines how the data can be combined. The Visualization for Algorithm Development (VisAD) Java API has provided a sophisticated data model for representing the functional nature of scientific datasets for well over a decade. Because VisAD is largely designed for its visualization capabilities, the data model can be cumbersome to use for numerical computation, especially for those not comfortable with Java. Although both VisAD and the implementation of the CDM are written in Java, neither defines a pure Java interface that others could implement and program to, further limiting potential for interoperability. In this talk, we will present a solution for data integration based on a simple discipline-agnostic scientific data model and programming interface that enables a dataset to be defined in terms of three variable types

  4. Comparing species tree estimation with large anchored phylogenomic and small Sanger-sequenced molecular datasets: an empirical study on Malagasy pseudoxyrhophiine snakes.

    PubMed

    Ruane, Sara; Raxworthy, Christopher J; Lemmon, Alan R; Lemmon, Emily Moriarty; Burbrink, Frank T

    2015-10-12

    Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar's serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species

  5. Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants.

    PubMed

    Smith, Stephen A; Moore, Michael J; Brown, Joseph W; Yang, Ya

    2015-08-05

    The use of transcriptomic and genomic datasets for phylogenetic reconstruction has become increasingly common as researchers attempt to resolve recalcitrant nodes with increasing amounts of data. The large size and complexity of these datasets introduce significant phylogenetic noise and conflict into subsequent analyses. The sources of conflict may include hybridization, incomplete lineage sorting, or horizontal gene transfer, and may vary across the phylogeny. For phylogenetic analysis, this noise and conflict has been accommodated in one of several ways: by binning gene regions into subsets to isolate consistent phylogenetic signal; by using gene-tree methods for reconstruction, where conflict is presumed to be explained by incomplete lineage sorting (ILS); or through concatenation, where noise is presumed to be the dominant source of conflict. The results provided herein emphasize that analysis of individual homologous gene regions can greatly improve our understanding of the underlying conflict within these datasets. Here we examined two published transcriptomic datasets, the angiosperm group Caryophyllales and the aculeate Hymenoptera, for the presence of conflict, concordance, and gene duplications in individual homologs across the phylogeny. We found significant conflict throughout the phylogeny in both datasets and in particular along the backbone. While some nodes in each phylogeny showed patterns of conflict similar to what might be expected with ILS alone, the backbone nodes also exhibited low levels of phylogenetic signal. In addition, certain nodes, especially in the Caryophyllales, had highly elevated levels of strongly supported conflict that cannot be explained by ILS alone. This study demonstrates that phylogenetic signal is highly variable in phylogenomic data sampled across related species and poses challenges when conducting species tree analyses on large genomic and transcriptomic datasets. Further insight into the conflict and processes

  6. Omicseq: a web-based search engine for exploring omics datasets

    PubMed Central

    Sun, Xiaobo; Pittard, William S.; Xu, Tianlei; Chen, Li; Zwick, Michael E.; Jiang, Xiaoqian; Wang, Fusheng

    2017-01-01

    Abstract The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve ‘findability’ of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. PMID:28402462

  7. Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications

    NASA Astrophysics Data System (ADS)

    Maskey, M.; Ramachandran, R.; Miller, J.

    2017-12-01

    Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.

  8. Developing a Resource for Implementing ArcSWAT Using Global Datasets

    NASA Astrophysics Data System (ADS)

    Taggart, M.; Caraballo Álvarez, I. O.; Mueller, C.; Palacios, S. L.; Schmidt, C.; Milesi, C.; Palmer-Moloney, L. J.

    2015-12-01

    This project developed a comprehensive user manual outlining methods for adapting and implementing global datasets for use within ArcSWAT for international and worldwide applications. The Soil and Water Assessment Tool (SWAT) is a hydrologic model that looks at a number of hydrologic variables including runoff and the chemical makeup of water at a given location on the Earth's surface using Digital Elevation Models (DEM), land cover, soil, and weather data. However, the application of ArcSWAT for projects outside of the United States is challenging as there is no standard framework for inputting global datasets into ArcSWAT. This project aims to remove this obstacle by outlining methods for adapting and implementing these global datasets via the user manual. The manual takes the user through the processes of data conditioning while providing solutions and suggestions for common errors. The efficacy of the manual was explored using examples from watersheds located in Puerto Rico, Mexico and Western Africa. Each run explored the various options for setting up a ArcSWAT project as well as a range of satellite data products and soil databases. Future work will incorporate in-situ data for validation and calibration of the model and outline additional resources to assist future users in efficiently implementing the model for worldwide applications. The capacity to manage and monitor freshwater availability is of critical importance in both developed and developing countries. As populations grow and climate changes, both the quality and quantity of freshwater are affected resulting in negative impacts on the health of the surrounding population. The use of hydrologic models such as ArcSWAT can help stakeholders and decision makers understand the future impacts of these changes enabling informed and substantiated decisions.

  9. Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets

    NASA Astrophysics Data System (ADS)

    Day-Lewis, F. D.; Slater, L. D.; Johnson, T.

    2012-12-01

    Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.

  10. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

    NASA Astrophysics Data System (ADS)

    Lary, D. J.

    2013-12-01

    A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.

  11. BigNeuron dataset V.0.0

    DOE Data Explorer

    Ramanathan, Arvind

    2016-01-01

    The cleaned bench testing reconstructions for the gold166 datasets have been put online at github https://github.com/BigNeuron/Events-and-News/wiki/BigNeuron-Events-and-News https://github.com/BigNeuron/Data/releases/tag/gold166_bt_v1.0 The respective image datasets were released a while ago from other sites (major pointer is available at github as well https://github.com/BigNeuron/Data/releases/tag/Gold166_v1 but since the files were big, the actual downloading was distributed at 3 continents separately)

  12. Validating Variational Bayes Linear Regression Method With Multi-Central Datasets.

    PubMed

    Murata, Hiroshi; Zangwill, Linda M; Fujino, Yuri; Matsuura, Masato; Miki, Atsuya; Hirasawa, Kazunori; Tanito, Masaki; Mizoue, Shiro; Mori, Kazuhiko; Suzuki, Katsuyoshi; Yamashita, Takehiro; Kashiwagi, Kenji; Shoji, Nobuyuki; Asaoka, Ryo

    2018-04-01

    To validate the prediction accuracy of variational Bayes linear regression (VBLR) with two datasets external to the training dataset. The training dataset consisted of 7268 eyes of 4278 subjects from the University of Tokyo Hospital. The Japanese Archive of Multicentral Databases in Glaucoma (JAMDIG) dataset consisted of 271 eyes of 177 patients, and the Diagnostic Innovations in Glaucoma Study (DIGS) dataset includes 248 eyes of 173 patients, which were used for validation. Prediction accuracy was compared between the VBLR and ordinary least squared linear regression (OLSLR). First, OLSLR and VBLR were carried out using total deviation (TD) values at each of the 52 test points from the second to fourth visual fields (VFs) (VF2-4) to 2nd to 10th VF (VF2-10) of each patient in JAMDIG and DIGS datasets, and the TD values of the 11th VF test were predicted every time. The predictive accuracy of each method was compared through the root mean squared error (RMSE) statistic. OLSLR RMSEs with the JAMDIG and DIGS datasets were between 31 and 4.3 dB, and between 19.5 and 3.9 dB. On the other hand, VBLR RMSEs with JAMDIG and DIGS datasets were between 5.0 and 3.7, and between 4.6 and 3.6 dB. There was statistically significant difference between VBLR and OLSLR for both datasets at every series (VF2-4 to VF2-10) (P < 0.01 for all tests). However, there was no statistically significant difference in VBLR RMSEs between JAMDIG and DIGS datasets at any series of VFs (VF2-2 to VF2-10) (P > 0.05). VBLR outperformed OLSLR to predict future VF progression, and the VBLR has a potential to be a helpful tool at clinical settings.

  13. Identification of Common Differentially Expressed Genes in Urinary Bladder Cancer

    PubMed Central

    Zaravinos, Apostolos; Lambrou, George I.; Boulalas, Ioannis; Delakas, Dimitris; Spandidos, Demetrios A.

    2011-01-01

    Background Current diagnosis and treatment of urinary bladder cancer (BC) has shown great progress with the utilization of microarrays. Purpose Our goal was to identify common differentially expressed (DE) genes among clinically relevant subclasses of BC using microarrays. Methodology/Principal Findings BC samples and controls, both experimental and publicly available datasets, were analyzed by whole genome microarrays. We grouped the samples according to their histology and defined the DE genes in each sample individually, as well as in each tumor group. A dual analysis strategy was followed. First, experimental samples were analyzed and conclusions were formulated; and second, experimental sets were combined with publicly available microarray datasets and were further analyzed in search of common DE genes. The experimental dataset identified 831 genes that were DE in all tumor samples, simultaneously. Moreover, 33 genes were up-regulated and 85 genes were down-regulated in all 10 BC samples compared to the 5 normal tissues, simultaneously. Hierarchical clustering partitioned tumor groups in accordance to their histology. K-means clustering of all genes and all samples, as well as clustering of tumor groups, presented 49 clusters. K-means clustering of common DE genes in all samples revealed 24 clusters. Genes manifested various differential patterns of expression, based on PCA. YY1 and NFκB were among the most common transcription factors that regulated the expression of the identified DE genes. Chromosome 1 contained 32 DE genes, followed by chromosomes 2 and 11, which contained 25 and 23 DE genes, respectively. Chromosome 21 had the least number of DE genes. GO analysis revealed the prevalence of transport and binding genes in the common down-regulated DE genes; the prevalence of RNA metabolism and processing genes in the up-regulated DE genes; as well as the prevalence of genes responsible for cell communication and signal transduction in the DE genes that

  14. National Elevation Dataset

    USGS Publications Warehouse

    ,

    2002-01-01

    The National Elevation Dataset (NED) is a new raster product assembled by the U.S. Geological Survey. NED is designed to provide National elevation data in a seamless form with a consistent datum, elevation unit, and projection. Data corrections were made in the NED assembly process to minimize artifacts, perform edge matching, and fill sliver areas of missing data. NED has a resolution of one arc-second (approximately 30 meters) for the conterminous United States, Hawaii, Puerto Rico and the island territories and a resolution of two arc-seconds for Alaska. NED data sources have a variety of elevation units, horizontal datums, and map projections. In the NED assembly process the elevation values are converted to decimal meters as a consistent unit of measure, NAD83 is consistently used as horizontal datum, and all the data are recast in a geographic projection. Older DEM's produced by methods that are now obsolete have been filtered during the NED assembly process to minimize artifacts that are commonly found in data produced by these methods. Artifact removal greatly improves the quality of the slope, shaded-relief, and synthetic drainage information that can be derived from the elevation data. Figure 2 illustrates the results of this artifact removal filtering. NED processing also includes steps to adjust values where adjacent DEM's do not match well, and to fill sliver areas of missing data between DEM's. These processing steps ensure that NED has no void areas and artificial discontinuities have been minimized. The artifact removal filtering process does not eliminate all of the artifacts. In areas where the only available DEM is produced by older methods, then "striping" may still occur.

  15. Squish: Near-Optimal Compression for Archival of Relational Datasets

    PubMed Central

    Gao, Yihan; Parameswaran, Aditya

    2017-01-01

    Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: Squish achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets. PMID:28180028

  16. Omicseq: a web-based search engine for exploring omics datasets.

    PubMed

    Sun, Xiaobo; Pittard, William S; Xu, Tianlei; Chen, Li; Zwick, Michael E; Jiang, Xiaoqian; Wang, Fusheng; Qin, Zhaohui S

    2017-07-03

    The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  17. ISRUC-Sleep: A comprehensive public dataset for sleep researchers.

    PubMed

    Khalighi, Sirvan; Sousa, Teresa; Santos, José Moutinho; Nunes, Urbano

    2016-02-01

    To facilitate the performance comparison of new methods for sleep patterns analysis, datasets with quality content, publicly-available, are very important and useful. We introduce an open-access comprehensive sleep dataset, called ISRUC-Sleep. The data were obtained from human adults, including healthy subjects, subjects with sleep disorders, and subjects under the effect of sleep medication. Each recording was randomly selected between PSG recordings that were acquired by the Sleep Medicine Centre of the Hospital of Coimbra University (CHUC). The dataset comprises three groups of data: (1) data concerning 100 subjects, with one recording session per subject; (2) data gathered from 8 subjects; two recording sessions were performed per subject, and (3) data collected from one recording session related to 10 healthy subjects. The polysomnography (PSG) recordings, associated with each subject, were visually scored by two human experts. Comparing the existing sleep-related public datasets, ISRUC-Sleep provides data of a reasonable number of subjects with different characteristics such as: data useful for studies involving changes in the PSG signals over time; and data of healthy subjects useful for studies involving comparison of healthy subjects with the patients, suffering from sleep disorders. This dataset was created aiming to complement existing datasets by providing easy-to-apply data collection with some characteristics not covered yet. ISRUC-Sleep can be useful for analysis of new contributions: (i) in biomedical signal processing; (ii) in development of ASSC methods; and (iii) on sleep physiology studies. To evaluate and compare new contributions, which use this dataset as a benchmark, results of applying a subject-independent automatic sleep stage classification (ASSC) method on ISRUC-Sleep dataset are presented. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  18. Quantifying uncertainty in observational rainfall datasets

    NASA Astrophysics Data System (ADS)

    Lennard, Chris; Dosio, Alessandro; Nikulin, Grigory; Pinto, Izidine; Seid, Hussen

    2015-04-01

    The CO-ordinated Regional Downscaling Experiment (CORDEX) has to date seen the publication of at least ten journal papers that examine the African domain during 2012 and 2013. Five of these papers consider Africa generally (Nikulin et al. 2012, Kim et al. 2013, Hernandes-Dias et al. 2013, Laprise et al. 2013, Panitz et al. 2013) and five have regional foci: Tramblay et al. (2013) on Northern Africa, Mariotti et al. (2014) and Gbobaniyi el al. (2013) on West Africa, Endris et al. (2013) on East Africa and Kalagnoumou et al. (2013) on southern Africa. There also are a further three papers that the authors know about under review. These papers all use an observed rainfall and/or temperature data to evaluate/validate the regional model output and often proceed to assess projected changes in these variables due to climate change in the context of these observations. The most popular reference rainfall data used are the CRU, GPCP, GPCC, TRMM and UDEL datasets. However, as Kalagnoumou et al. (2013) point out there are many other rainfall datasets available for consideration, for example, CMORPH, FEWS, TAMSAT & RIANNAA, TAMORA and the WATCH & WATCH-DEI data. They, with others (Nikulin et al. 2012, Sylla et al. 2012) show that the observed datasets can have a very wide spread at a particular space-time coordinate. As more ground, space and reanalysis-based rainfall products become available, all which use different methods to produce precipitation data, the selection of reference data is becoming an important factor in model evaluation. A number of factors can contribute to a uncertainty in terms of the reliability and validity of the datasets such as radiance conversion algorithims, the quantity and quality of available station data, interpolation techniques and blending methods used to combine satellite and guage based products. However, to date no comprehensive study has been performed to evaluate the uncertainty in these observational datasets. We assess 18 gridded

  19. Collection, processing, and quality assurance of time-series electromagnetic-induction log datasets, 1995–2016, south Florida

    USGS Publications Warehouse

    Prinos, Scott T.; Valderrama, Robert

    2016-12-13

    Time-series electromagnetic-induction log (TSEMIL) datasets are collected from polyvinyl-chloride cased or uncased monitoring wells to evaluate changes in water conductivity over time. TSEMIL datasets consist of a series of individual electromagnetic-induction logs, generally collected at a frequency of once per month or once per year that have been compiled into a dataset by eliminating small uniform offsets in bulk conductivity between logs probably caused by minor variations in calibration. These offsets are removed by selecting a depth at which no changes are apparent from year to year, and by adjusting individual logs to the median of all logs at the selected depth. Generally, the selected depths are within the freshwater saturated part of the aquifer, well below the water table. TSEMIL datasets can be used to monitor changes in water conductivity throughout the full thickness of an aquifer, without the need for long open-interval wells which have, in some instances, allowed vertical water flow within the well bore that has biased water conductivity profiles. The TSEMIL dataset compilation process enhances the ability to identify small differences between logs that were otherwise obscured by the offsets. As a result of TSEMIL dataset compilation, the root mean squared error of the linear regression between bulk conductivity of the electromagnetic-induction log measurements and the chloride concentration of water samples decreased from 17.4 to 1.7 millisiemens per meter in well G–3611 and from 3.7 to 2.2 millisiemens per meter in well G–3609. The primary use of the TSEMIL datasets in south Florida is to detect temporal changes in bulk conductivity associated with saltwater intrusion in the aquifer; however, other commonly observed changes include (1) variations in bulk conductivity near the water table where water saturation of pore spaces might vary and water temperature might be more variable, and (2) dissipation of conductive water in high-porosity rock

  20. Internal Consistency of the NVAP Water Vapor Dataset

    NASA Technical Reports Server (NTRS)

    Suggs, Ronnie J.; Jedlovec, Gary J.; Arnold, James E. (Technical Monitor)

    2001-01-01

    The NVAP (NASA Water Vapor Project) dataset is a global dataset at 1 x 1 degree spatial resolution consisting of daily, pentad, and monthly atmospheric precipitable water (PW) products. The analysis blends measurements from the Television and Infrared Operational Satellite (TIROS) Operational Vertical Sounder (TOVS), the Special Sensor Microwave/Imager (SSM/I), and radiosonde observations into a daily collage of PW. The original dataset consisted of five years of data from 1988 to 1992. Recent updates have added three additional years (1993-1995) and incorporated procedural and algorithm changes from the original methodology. Since each of the PW sources (TOVS, SSM/I, and radiosonde) do not provide global coverage, each of these sources compliment one another by providing spatial coverage over regions and during times where the other is not available. For this type of spatial and temporal blending to be successful, each of the source components should have similar or compatible accuracies. If this is not the case, regional and time varying biases may be manifested in the NVAP dataset. This study examines the consistency of the NVAP source data by comparing daily collocated TOVS and SSM/I PW retrievals with collocated radiosonde PW observations. The daily PW intercomparisons are performed over the time period of the dataset and for various regions.

  1. Atlantic small-mammal: a dataset of communities of rodents and marsupials of the Atlantic forests of South America.

    PubMed

    Bovendorp, Ricardo S; Villar, Nacho; de Abreu-Junior, Edson F; Bello, Carolina; Regolin, André L; Percequillo, Alexandre R; Galetti, Mauro

    2017-08-01

    The contribution of small mammal ecology to the understanding of macroecological patterns of biodiversity, population dynamics, and community assembly has been hindered by the absence of large datasets of small mammal communities from tropical regions. Here we compile the largest dataset of inventories of small mammal communities for the Neotropical region. The dataset reviews small mammal communities from the Atlantic forest of South America, one of the regions with the highest diversity of small mammals and a global biodiversity hotspot, though currently covering less than 12% of its original area due to anthropogenic pressures. The dataset comprises 136 references from 300 locations covering seven vegetation types of tropical and subtropical Atlantic forests of South America, and presents data on species composition, richness, and relative abundance (captures/trap-nights). One paper was published more than 70 yr ago, but 80% of them were published after 2000. The dataset comprises 53,518 individuals of 124 species of small mammals, including 30 species of marsupials and 94 species of rodents. Species richness averaged 8.2 species (1-21) per site. Only two species occurred in more than 50% of the sites (the common opossum, Didelphis aurita and black-footed pigmy rice rat Oligoryzomys nigripes). Mean species abundance varied 430-fold, from 4.3 to 0.01 individuals/trap-night. The dataset also revealed a hyper-dominance of 22 species that comprised 78.29% of all individuals captured, with only seven species representing 44% of all captures. The information contained on this dataset can be applied in the study of macroecological patterns of biodiversity, communities, and populations, but also to evaluate the ecological consequences of fragmentation and defaunation, and predict disease outbreaks, trophic interactions and community dynamics in this biodiversity hotspot. © 2017 by the Ecological Society of America.

  2. Dataset for Reporting of Malignant Mesothelioma of the Pleura or Peritoneum: Recommendations From the International Collaboration on Cancer Reporting (ICCR).

    PubMed

    Churg, Andrew; Attanoos, Richard; Borczuk, Alain C; Chirieac, Lucian R; Galateau-Sallé, Françoise; Gibbs, Allen; Henderson, Douglas; Roggli, Victor; Rusch, Valerie; Judge, Meagan J; Srigley, John R

    2016-10-01

    -The International Collaboration on Cancer Reporting is a not-for-profit organization formed by the Royal Colleges of Pathologists of Australasia and the United Kingdom; the College of American Pathologists; the Canadian Association of Pathologists-Association Canadienne des Pathologists, in association with the Canadian Partnership Against Cancer; and the European Society of Pathology. Its goal is to produce common, internationally agreed upon, evidence-based datasets for use throughout the world. -To describe a dataset developed by the Expert Panel of the International Collaboration on Cancer Reporting for reporting malignant mesothelioma of both the pleura and peritoneum. The dataset is composed of "required" (mandatory) and "recommended" (nonmandatory) elements. -Based on a review of the most recent evidence and supported by explanatory commentary. -Eight required elements and 7 recommended elements were agreed upon by the Expert Panel to represent the essential information for reporting malignant mesothelioma of the pleura and peritoneum. -In time, the widespread use of an internationally agreed upon, structured, pathology dataset for mesothelioma will lead not only to improved patient management but also provide valuable data for research and international benchmarks.

  3. Topic modeling for cluster analysis of large biological and medical datasets

    PubMed Central

    2014-01-01

    Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than

  4. Topic modeling for cluster analysis of large biological and medical datasets.

    PubMed

    Zhao, Weizhong; Zou, Wen; Chen, James J

    2014-01-01

    The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting

  5. Food Recognition: A New Dataset, Experiments, and Results.

    PubMed

    Ciocca, Gianluigi; Napoletano, Paolo; Schettini, Raimondo

    2017-05-01

    We propose a new dataset for the evaluation of food recognition algorithms that can be used in dietary monitoring applications. Each image depicts a real canteen tray with dishes and foods arranged in different ways. Each tray contains multiple instances of food classes. The dataset contains 1027 canteen trays for a total of 3616 food instances belonging to 73 food classes. The food on the tray images has been manually segmented using carefully drawn polygonal boundaries. We have benchmarked the dataset by designing an automatic tray analysis pipeline that takes a tray image as input, finds the regions of interest, and predicts for each region the corresponding food class. We have experimented with three different classification strategies using also several visual descriptors. We achieve about 79% of food and tray recognition accuracy using convolutional-neural-networks-based features. The dataset, as well as the benchmark framework, are available to the research community.

  6. A reanalysis dataset of the South China Sea.

    PubMed

    Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

    2014-01-01

    Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992-2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability.

  7. Dataset definition for CMS operations and physics analyses

    NASA Astrophysics Data System (ADS)

    Franzoni, Giovanni; Compact Muon Solenoid Collaboration

    2016-04-01

    Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets and secondary datasets/dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concepts of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the LHC run I, and we discuss the plans for the run II.

  8. Compaction of fibrin clots reveals the antifibrinolytic effect of factor XIII.

    PubMed

    Rijken, D C; Abdul, S; Malfliet, J J M C; Leebeek, F W G; Uitte de Willige, S

    2016-07-01

    Essentials Factor XIIIa inhibits fibrinolysis by forming fibrin-fibrin and fibrin-inhibitor cross-links. Conflicting studies about magnitude and mechanisms of inhibition have been reported. Factor XIIIa most strongly inhibits lysis of mechanically compacted or retracted plasma clots. Cross-links of α2-antiplasmin to fibrin prevent the inhibitor from being expelled from the clot. Background Although insights into the underlying mechanisms of the effect of factor XIII on fibrinolysis have improved considerably in the last few decades, in particular with the discovery that activated FXIII (FXIIIa) cross-links α2 -antiplasmin to fibrin, the topic remains a matter of debate. Objective To elucidate the mechanisms of the antifibrinolytic effect of FXIII. Methods and Results Platelet-poor plasma clot lysis, induced by the addition of tissue-type plasminogen activator, was measured in the presence or absence of a specific FXIIIa inhibitor. Both in a turbidity assay and in a fluorescence assay, the FXIIIa inhibitor had only a small inhibitory effect: 1.6-fold less tissue-type plasminogen activator was required for 50% clot lysis in the presence of the FXIIIa inhibitor. However, when the plasma clot was compacted by centrifugation, the FXIIIa inhibitor had a strong inhibitory effect, with 7.7-fold less tissue-type plasminogen activator being required for 50% clot lysis in the presence of the FXIIIa inhibitor. In both experiments, the effects of the FXIIIa inhibitor were entirely dependent on the cross-linking of α2 -antiplasmin to fibrin. The FXIIIa inhibitor reduced the amount of α2 -antiplasmin present in the compacted clots from approximately 30% to < 4%. The results were confirmed with experiments in which compaction was achieved by platelet-mediated clot retraction. Conclusions Compaction or retraction of fibrin clots reveals the strong antifibrinolytic effect of FXIII. This is explained by the cross-linking of α2 -antiplasmin to fibrin by FXIIIa, which prevents the

  9. Network Intrusion Dataset Assessment

    DTIC Science & Technology

    2013-03-01

    Security, 6(1):173–180, October 2009. abs/0911.0787. 70 • Jungsuk Song, Hiroki Takakura, Yasuo Okabe, and Koji Nakao. “Toward a more practical...Inoue, and Koji Nakao. “Statistical analysis of honeypot data and building of Kyoto 2006+ dataset for NIDS evaluation”. BADGERS ’11: Proceedings of

  10. National Hydrography Dataset Plus (NHDPlus)

    EPA Pesticide Factsheets

    The NHDPlus Version 1.0 is an integrated suite of application-ready geospatial data sets that incorporate many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,000-scale NHD), improved networking, naming, and value-added attributes (VAA's). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainageenforcement technique first broadly applied in New England, and thus dubbed The New-England Method. This technique involves burning-in the 1:100,000-scale NHD and when available building walls using the national WatershedBoundary Dataset (WBD). The resulting modified digital elevation model(HydroDEM) is used to produce hydrologic derivatives that agree with the NHDand WBD. An interdisciplinary team from the U. S. Geological Survey (USGS), U.S. Environmental Protection Agency (USEPA), and contractors, over the lasttwo years has found this method to produce the best quality NHD catchments using an automated process.The VAAs include greatly enhanced capabilities for upstream and downstream navigation, analysis and modeling. Examples include: retrieve all flowlines (predominantly confluence-to-confluence stream segments) and catchments upstream of a given flowline using queries rather than by slower flowline-by flowline navigation; retrieve flowlines by stream order; subset a stream level path sorted in hydrologic order for st

  11. Visualization of conserved structures by fusing highly variable datasets.

    PubMed

    Silverstein, Jonathan C; Chhadia, Ankur; Dech, Fred

    2002-01-01

    Skill, effort, and time are required to identify and visualize anatomic structures in three-dimensions from radiological data. Fundamentally, automating these processes requires a technique that uses symbolic information not in the dynamic range of the voxel data. We were developing such a technique based on mutual information for automatic multi-modality image fusion (MIAMI Fuse, University of Michigan). This system previously demonstrated facility at fusing one voxel dataset with integrated symbolic structure information to a CT dataset (different scale and resolution) from the same person. The next step of development of our technique was aimed at accommodating the variability of anatomy from patient to patient by using warping to fuse our standard dataset to arbitrary patient CT datasets. A standard symbolic information dataset was created from the full color Visible Human Female by segmenting the liver parenchyma, portal veins, and hepatic veins and overwriting each set of voxels with a fixed color. Two arbitrarily selected patient CT scans of the abdomen were used for reference datasets. We used the warping functions in MIAMI Fuse to align the standard structure data to each patient scan. The key to successful fusion was the focused use of multiple warping control points that place themselves around the structure of interest automatically. The user assigns only a few initial control points to align the scans. Fusion 1 and 2 transformed the atlas with 27 points around the liver to CT1 and CT2 respectively. Fusion 3 transformed the atlas with 45 control points around the liver to CT1 and Fusion 4 transformed the atlas with 5 control points around the portal vein. The CT dataset is augmented with the transformed standard structure dataset, such that the warped structure masks are visualized in combination with the original patient dataset. This combined volume visualization is then rendered interactively in stereo on the ImmersaDesk in an immersive Virtual

  12. Energy levels, lifetimes, and transition rates for the selenium isoelectronic sequence Pd XIII-Te XIX, Xe XXI-Nd XXVII, W XLI

    NASA Astrophysics Data System (ADS)

    Wang, K.; Yang, X.; Chen, Z. B.; Si, R.; Chen, C. Y.; Yan, J.; Zhao, X. H.; Dang, W.

    2017-09-01

    Energy levels, wavelengths, lifetimes, oscillator strengths, and electric dipole (E1), magnetic dipole (M1), electric quadrupole (E2), magnetic quadrupole (M2) transition rates among the 46 fine structure levels belonging to the ([ Ar ] 3d10) 4s2 4p4, ([ Ar ] 3d10) 4s2 4p3 4 d, and ([ Ar ] 3d10) 4 s 4p5 configurations for the selenium isoelectronic sequence Pd XIII-Te XIX, Xe XXI-Nd XXVII, W XLI are reported. These data are determined in the multi-configuration Dirac-Fock (MCDF) approach, in which relativistic effects, main electron correlations within the n = 7 complex, Breit interaction (BI), and quantum electrodynamic (QED) corrections are included. The many-body perturbation theory (MBPT) method is also employed as an independent calculation to confirm the present accuracy, taking W XLI as an example. Comparisons and analysis are made between the present results and available experimental and theoretical ones, and good agreements are obtained. These accurate data are expected to be useful in nuclear fusion research and astrophysical applications.

  13. Inadequate Reference Datasets Biased toward Short Non-epitopes Confound B-cell Epitope Prediction*

    PubMed Central

    Rahman, Kh. Shamsur; Chowdhury, Erfan Ullah; Sachse, Konrad; Kaltenboeck, Bernhard

    2016-01-01

    X-ray crystallography has shown that an antibody paratope typically binds 15–22 amino acids (aa) of an epitope, of which 2–5 randomly distributed amino acids contribute most of the binding energy. In contrast, researchers typically choose for B-cell epitope mapping short peptide antigens in antibody binding assays. Furthermore, short 6–11-aa epitopes, and in particular non-epitopes, are over-represented in published B-cell epitope datasets that are commonly used for development of B-cell epitope prediction approaches from protein antigen sequences. We hypothesized that such suboptimal length peptides result in weak antibody binding and cause false-negative results. We tested the influence of peptide antigen length on antibody binding by analyzing data on more than 900 peptides used for B-cell epitope mapping of immunodominant proteins of Chlamydia spp. We demonstrate that short 7–12-aa peptides of B-cell epitopes bind antibodies poorly; thus, epitope mapping with short peptide antigens falsely classifies many B-cell epitopes as non-epitopes. We also show in published datasets of confirmed epitopes and non-epitopes a direct correlation between length of peptide antigens and antibody binding. Elimination of short, ≤11-aa epitope/non-epitope sequences improved datasets for evaluation of in silico B-cell epitope prediction. Achieving up to 86% accuracy, protein disorder tendency is the best indicator of B-cell epitope regions for chlamydial and published datasets. For B-cell epitope prediction, the most effective approach is plotting disorder of protein sequences with the IUPred-L scale, followed by antibody reactivity testing of 16–30-aa peptides from peak regions. This strategy overcomes the well known inaccuracy of in silico B-cell epitope prediction from primary protein sequences. PMID:27189949

  14. Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset

    DTIC Science & Technology

    2014-12-23

    publications for benchmarking prognostics algorithms. The turbofan degradation datasets have received over seven thousand unique downloads in the last five...approaches that researchers have taken to implement prognostics using these turbofan datasets. Some unique characteristics of these datasets are also...Description of the five turbofan degradation datasets available from NASA repository. Datasets #Fault Modes #Conditions #Train Units #Test Units

  15. Safety of recombinant human factor XIII in a cynomolgus monkey model of extracorporeal blood circulation.

    PubMed

    Ponce, R; Armstrong, K; Andrews, K; Hensler, J; Waggie, K; Heffernan, J; Reynolds, T; Rogge, M

    2005-01-01

    Factor XIII (FXIII) is a thrombin-activated plasma coagulation factor critical for blood clot stabilization and longevity. Administration of exogenous FXIII to replenish depleted stores after major surgery, including cardiopulmonary bypass, may reduce bleeding complications and transfusion requirements. Thus, a model of extracorporeal circulation (ECC) was developed in adult male cynomolgus monkeys (Macaca fascicularis) to evaluate the nonclinical safety of recombinant human FXIII (rFXIII). The hematological and coagulation profile in study animals during and after 2 h of ECC was similar to that reported for humans during and after cardiopulmonary bypass, including observations of anemia, thrombocytopenia, and activation of coagulation and platelets. Intravenous slow bolus injection of 300 U/kg (2.1 mg/kg) or 1000 U/kg (7 mg/kg) rFXIII after 2 h of ECC was well tolerated in study animals, and was associated with a dose-dependent increase in FXIII activity. No clinically significant effects in respiration, ECG, heart rate, blood pressure, body temperature, clinical chemistry, hematology (including platelet counts), or indicators of thrombosis (thrombin:anti-thrombin complex and D-Dimer) or platelet activation (platelet factor 4 and beta-thromboglobulin) were related to rFXIII administration. Specific examination of brain, heart, lung, liver, and kidney from rFXIII-treated animals provided no evidence of histopathological alterations suggestive of subclinical hemorrhage or thrombosis. Taken as a whole, the results demonstrate the ECC model suitably replicated the clinical presentation reported for humans during and after cardiopulmonary bypass surgery, and do not suggest significant concerns regarding use of rFXIII in replacement therapy after extracorporeal circulation.

  16. RECOMBINANT FACTOR XIII DIMINISHES MULTIPLE ORGAN DYSFUNCTION IN RATS CAUSED BY GUT ISCHEMIA-REPERFUSION INJURY

    PubMed Central

    Zaets, Sergey B.; Xu, Da-Zhong; Lu, Qi; Feketova, Eleonora; Berezina, Tamara L.; Gruda, Maryann; Malinina, Inga V.; Deitch, Edwin A.; Olsen, Eva H. N.

    2010-01-01

    Plasma factor XIII (FXIII) is responsible for stabilization of fibrin clot at the final stage of blood coagulation. Because FXIII has also been shown to modulate inflammation and endothelial permeability, we hypothesized that FXIII diminishes multiple organ dysfunction caused by gut I/R injury. A model of superior mesenteric artery occlusion (SMAO) was used to induce gut I/R injury. Rats were subjected to 45-min SMAO or sham SMAO and treated with recombinant human FXIII A2 subunit (rFXIII) or placebo at the beginning of the reperfusion period. Lung permeability, lung and gut myeloperoxidase activity, gut histology, neutrophil respiratory burst, and microvascular blood flow in the liver and muscles were measured after a 3-h reperfusion period. The effect of activated rFXIII on transendothelial resistance of human umbilical vein endothelial cells was tested in vitro. Superior mesenteric artery occlusion–induced lung permeability as well as lung and gut myeloperoxidase activity was significantly lower in rFXIII-treated versus untreated animals. Similarly, rFXIII-treated rats had lower neutrophil respiratory burst activity and ileal mucosal injury. Rats treated with rFXIII also had higher liver microvascular blood flow compared with the placebo group. Superior mesenteric artery occlusion did not cause FXIII consumption during the study period. In vitro, activated rFXIII caused a dose-dependent increase in human umbilical vein endothelial cell monolayer resistance to thrombin-induced injury. Thus, administration of rFXIII diminishes SMAO-induced multiple organ dysfunction in rats, presumably by preservation of endothelial barrier function and the limitation of polymorphonuclear leukocyte activation. PMID:18948851

  17. Implementation of Common Core State Standards: Voices, Positions, and Frames

    ERIC Educational Resources Information Center

    Pense, Seburn L.; Freeburg, Beth Winfrey; Clemons, Christopher A.

    2015-01-01

    The purpose of this study was to describe the voices heard, positions portrayed, and frames of newspaper messages regarding the implementation of Common Core State Standards (CCSS). The dataset contained 69 articles from 38 community newspapers in 24 states (n = 62) and from three national newspapers (n = 7). Researchers identified five voices…

  18. SisFall: A Fall and Movement Dataset

    PubMed Central

    Sucerquia, Angela; López, José David; Vargas-Bonilla, Jesús Francisco

    2017-01-01

    Research on fall and movement detection with wearable devices has witnessed promising growth. However, there are few publicly available datasets, all recorded with smartphones, which are insufficient for testing new proposals due to their absence of objective population, lack of performed activities, and limited information. Here, we present a dataset of falls and activities of daily living (ADLs) acquired with a self-developed device composed of two types of accelerometer and one gyroscope. It consists of 19 ADLs and 15 fall types performed by 23 young adults, 15 ADL types performed by 14 healthy and independent participants over 62 years old, and data from one participant of 60 years old that performed all ADLs and falls. These activities were selected based on a survey and a literature analysis. We test the dataset with widely used feature extraction and a simple to implement threshold based classification, achieving up to 96% of accuracy in fall detection. An individual activity analysis demonstrates that most errors coincide in a few number of activities where new approaches could be focused. Finally, validation tests with elderly people significantly reduced the fall detection performance of the tested features. This validates findings of other authors and encourages developing new strategies with this new dataset as the benchmark. PMID:28117691

  19. SisFall: A Fall and Movement Dataset.

    PubMed

    Sucerquia, Angela; López, José David; Vargas-Bonilla, Jesús Francisco

    2017-01-20

    Research on fall and movement detection with wearable devices has witnessed promising growth. However, there are few publicly available datasets, all recorded with smartphones, which are insufficient for testing new proposals due to their absence of objective population, lack of performed activities, and limited information. Here, we present a dataset of falls and activities of daily living (ADLs) acquired with a self-developed device composed of two types of accelerometer and one gyroscope. It consists of 19 ADLs and 15 fall types performed by 23 young adults, 15 ADL types performed by 14 healthy and independent participants over 62 years old, and data from one participant of 60 years old that performed all ADLs and falls. These activities were selected based on a survey and a literature analysis. We test the dataset with widely used feature extraction and a simple to implement threshold based classification, achieving up to 96% of accuracy in fall detection. An individual activity analysis demonstrates that most errors coincide in a few number of activities where new approaches could be focused. Finally, validation tests with elderly people significantly reduced the fall detection performance of the tested features. This validates findings of other authors and encourages developing new strategies with this new dataset as the benchmark.

  20. Seismic data enhancement and regularization using finite offset Common Diffraction Surface (CDS) stack

    NASA Astrophysics Data System (ADS)

    Garabito, German; Cruz, João Carlos Ribeiro; Oliva, Pedro Andrés Chira; Söllner, Walter

    2017-01-01

    The Common Reflection Surface stack is a robust method for simulating zero-offset and common-offset sections with high accuracy from multi-coverage seismic data. For simulating common-offset sections, the Common-Reflection-Surface stack method uses a hyperbolic traveltime approximation that depends on five kinematic parameters for each selected sample point of the common-offset section to be simulated. The main challenge of this method is to find a computationally efficient data-driven optimization strategy for accurately determining the five kinematic stacking parameters on which each sample of the stacked common-offset section depends. Several authors have applied multi-step strategies to obtain the optimal parameters by combining different pre-stack data configurations. Recently, other authors used one-step data-driven strategies based on a global optimization for estimating simultaneously the five parameters from multi-midpoint and multi-offset gathers. In order to increase the computational efficiency of the global optimization process, we use in this paper a reduced form of the Common-Reflection-Surface traveltime approximation that depends on only four parameters, the so-called Common Diffraction Surface traveltime approximation. By analyzing the convergence of both objective functions and the data enhancement effect after applying the two traveltime approximations to the Marmousi synthetic dataset and a real land dataset, we conclude that the Common-Diffraction-Surface approximation is more efficient within certain aperture limits and preserves at the same time a high image accuracy. The preserved image quality is also observed in a direct comparison after applying both approximations for simulating common-offset sections on noisy pre-stack data.

  1. Viability of Controlling Prosthetic Hand Utilizing Electroencephalograph (EEG) Dataset Signal

    NASA Astrophysics Data System (ADS)

    Miskon, Azizi; A/L Thanakodi, Suresh; Raihan Mazlan, Mohd; Mohd Haziq Azhar, Satria; Nooraya Mohd Tawil, Siti

    2016-11-01

    This project presents the development of an artificial hand controlled by Electroencephalograph (EEG) signal datasets for the prosthetic application. The EEG signal datasets were used as to improvise the way to control the prosthetic hand compared to the Electromyograph (EMG). The EMG has disadvantages to a person, who has not used the muscle for a long time and also to person with degenerative issues due to age factor. Thus, the EEG datasets found to be an alternative for EMG. The datasets used in this work were taken from Brain Computer Interface (BCI) Project. The datasets were already classified for open, close and combined movement operations. It served the purpose as an input to control the prosthetic hand by using an Interface system between Microsoft Visual Studio and Arduino. The obtained results reveal the prosthetic hand to be more efficient and faster in response to the EEG datasets with an additional LiPo (Lithium Polymer) battery attached to the prosthetic. Some limitations were also identified in terms of the hand movements, weight of the prosthetic, and the suggestions to improve were concluded in this paper. Overall, the objective of this paper were achieved when the prosthetic hand found to be feasible in operation utilizing the EEG datasets.

  2. CIFAR10-DVS: An Event-Stream Dataset for Object Classification

    PubMed Central

    Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

    2017-01-01

    Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582

  3. CIFAR10-DVS: An Event-Stream Dataset for Object Classification.

    PubMed

    Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

    2017-01-01

    Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.

  4. Finding Spatio-Temporal Patterns in Large Sensor Datasets

    ERIC Educational Resources Information Center

    McGuire, Michael Patrick

    2010-01-01

    Spatial or temporal data mining tasks are performed in the context of the relevant space, defined by a spatial neighborhood, and the relevant time period, defined by a specific time interval. Furthermore, when mining large spatio-temporal datasets, interesting patterns typically emerge where the dataset is most dynamic. This dissertation is…

  5. Sea Surface Temperature for Climate Applications: A New Dataset from the European Space Agency Climate Change Initiative

    NASA Astrophysics Data System (ADS)

    Merchant, C. J.; Hulley, G. C.

    2013-12-01

    There are many datasets describing the evolution of global sea surface temperature (SST) over recent decades -- so why make another one? Answer: to provide observations of SST that have particular qualities relevant to climate applications: independence, accuracy and stability. This has been done within the European Space Agency (ESA) Climate Change Initative (CCI) project on SST. Independence refers to the fact that the new SST CCI dataset is not derived from or tuned to in situ observations. This matters for climate because the in situ observing network used to assess marine climate change (1) was not designed to monitor small changes over decadal timescales, and (2) has evolved significantly in its technology and mix of types of observation, even during the past 40 years. The potential for significant artefacts in our picture of global ocean surface warming is clear. Only by having an independent record can we confirm (or refute) that the work done to remove biases/trend artefacts in in-situ datasets has been successful. Accuracy is the degree to which SSTs are unbiased. For climate applications, a common accuracy target is 0.1 K for all regions of the ocean. Stability is the degree to which the bias, if any, in a dataset is constant over time. Long-term instability introduces trend artefacts. To observe trends of the magnitude of 'global warming', SST datasets need to be stable to <5 mK/year. The SST CCI project has produced a satellite-based dataset that addresses these characteristics relevant to climate applications. Satellite radiances (brightness temperatures) have been harmonised exploiting periods of overlapping observations between sensors. Less well-characterised sensors have had their calibration tuned to that of better characterised sensors (at radiance level). Non-conventional retrieval methods (optimal estimation) have been employed to reduce regional biases to the 0.1 K level, a target violated in most satellite SST datasets. Models for

  6. The National Hydrography Dataset

    USGS Publications Warehouse

    ,

    1999-01-01

    The National Hydrography Dataset (NHD) is a newly combined dataset that provides hydrographic data for the United States. The NHD is the culmination of recent cooperative efforts of the U.S. Environmental Protection Agency (USEPA) and the U.S. Geological Survey (USGS). It combines elements of USGS digital line graph (DLG) hydrography files and the USEPA Reach File (RF3). The NHD supersedes RF3 and DLG files by incorporating them, not by replacing them. Users of RF3 or DLG files will find the same data in a new, more flexible format. They will find that the NHD is familiar but greatly expanded and refined. The DLG files contribute a national coverage of millions of features, including water bodies such as lakes and ponds, linear water features such as streams and rivers, and also point features such as springs and wells. These files provide standardized feature types, delineation, and spatial accuracy. From RF3, the NHD acquires hydrographic sequencing, upstream and downstream navigation for modeling applications, and reach codes. The reach codes provide a way to integrate data from organizations at all levels by linking the data to this nationally consistent hydrographic network. The feature names are from the Geographic Names Information System (GNIS). The NHD provides comprehensive coverage of hydrographic data for the United States. Some of the anticipated end-user applications of the NHD are multiuse hydrographic modeling and water-quality studies of fish habitats. Although based on 1:100,000-scale data, the NHD is planned so that it can incorporate and encourage the development of the higher resolution data that many users require. The NHD can be used to promote the exchange of data between users at the national, State, and local levels. Many users will benefit from the NHD and will want to contribute to the dataset as well.

  7. Harmonization of forest disturbance datasets of the conterminous USA from 1986 to 2011

    USGS Publications Warehouse

    Soulard, Christopher E.; Acevedo, William; Cohen, Warren B.; Yang, Zhiqiang; Stehman, Stephen V.; Taylor, Janis L.

    2017-01-01

    Several spatial forest disturbance datasets exist for the conterminous USA. The major problem with forest disturbance mapping is that variability between map products leads to uncertainty regarding the actual rate of disturbance. In this article, harmonized maps were produced from multiple data sources (i.e., Global Forest Change, LANDFIRE Vegetation Disturbance, National Land Cover Database, Vegetation Change Tracker, and Web-Enabled Landsat Data). The harmonization process involved fitting common class ontologies and determining spatial congruency to produce forest disturbance maps for four time intervals (1986–1992, 1992–2001, 2001–2006, and 2006–2011). Pixels mapped as disturbed for two or more datasets were labeled as disturbed in the harmonized maps. The primary advantage gained by harmonization was improvement in commission error rates relative to the individual disturbance products. Disturbance omission errors were high for both harmonized and individual forest disturbance maps due to underlying limitations in mapping subtle disturbances with Landsat classification algorithms. To enhance the value of the harmonized disturbance products, we used fire perimeter maps to add information on the cause of disturbance.

  8. PHOXTRACK-a tool for interpreting comprehensive datasets of post-translational modifications of proteins.

    PubMed

    Weidner, Christopher; Fischer, Cornelius; Sauer, Sascha

    2014-12-01

    We introduce PHOXTRACK (PHOsphosite-X-TRacing Analysis of Causal Kinases), a user-friendly freely available software tool for analyzing large datasets of post-translational modifications of proteins, such as phosphorylation, which are commonly gained by mass spectrometry detection. In contrast to other currently applied data analysis approaches, PHOXTRACK uses full sets of quantitative proteomics data and applies non-parametric statistics to calculate whether defined kinase-specific sets of phosphosite sequences indicate statistically significant concordant differences between various biological conditions. PHOXTRACK is an efficient tool for extracting post-translational information of comprehensive proteomics datasets to decipher key regulatory proteins and to infer biologically relevant molecular pathways. PHOXTRACK will be maintained over the next years and is freely available as an online tool for non-commercial use at http://phoxtrack.molgen.mpg.de. Users will also find a tutorial at this Web site and can additionally give feedback at https://groups.google.com/d/forum/phoxtrack-discuss. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  9. Harmonization of forest disturbance datasets of the conterminous USA from 1986 to 2011.

    PubMed

    Soulard, Christopher E; Acevedo, William; Cohen, Warren B; Yang, Zhiqiang; Stehman, Stephen V; Taylor, Janis L

    2017-04-01

    Several spatial forest disturbance datasets exist for the conterminous USA. The major problem with forest disturbance mapping is that variability between map products leads to uncertainty regarding the actual rate of disturbance. In this article, harmonized maps were produced from multiple data sources (i.e., Global Forest Change, LANDFIRE Vegetation Disturbance, National Land Cover Database, Vegetation Change Tracker, and Web-Enabled Landsat Data). The harmonization process involved fitting common class ontologies and determining spatial congruency to produce forest disturbance maps for four time intervals (1986-1992, 1992-2001, 2001-2006, and 2006-2011). Pixels mapped as disturbed for two or more datasets were labeled as disturbed in the harmonized maps. The primary advantage gained by harmonization was improvement in commission error rates relative to the individual disturbance products. Disturbance omission errors were high for both harmonized and individual forest disturbance maps due to underlying limitations in mapping subtle disturbances with Landsat classification algorithms. To enhance the value of the harmonized disturbance products, we used fire perimeter maps to add information on the cause of disturbance.

  10. Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

    PubMed

    Spjuth, Ola; Willighagen, Egon L; Guha, Rajarshi; Eklund, Martin; Wikberg, Jarl Es

    2010-06-30

    QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but

  11. Towards interoperable and reproducible QSAR analyses: Exchange of datasets

    PubMed Central

    2010-01-01

    Background QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. Results We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Conclusions Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets

  12. Genomic Datasets for Cancer Research

    Cancer.gov

    A variety of datasets from genome-wide association studies of cancer and other genotype-phenotype studies, including sequencing and molecular diagnostic assays, are available to approved investigators through the Extramural National Cancer Institute Data Access Committee.

  13. NHDPlus (National Hydrography Dataset Plus)

    EPA Pesticide Factsheets

    NHDPlus is a geospatial, hydrologic framework dataset that is intended for use by geospatial analysts and modelers to support water resources related applications. NHDPlus was developed by the USEPA in partnership with the US Geologic Survey

  14. VideoWeb Dataset for Multi-camera Activities and Non-verbal Communication

    NASA Astrophysics Data System (ADS)

    Denina, Giovanni; Bhanu, Bir; Nguyen, Hoang Thanh; Ding, Chong; Kamal, Ahmed; Ravishankar, Chinya; Roy-Chowdhury, Amit; Ivers, Allen; Varda, Brenda

    Human-activity recognition is one of the most challenging problems in computer vision. Researchers from around the world have tried to solve this problem and have come a long way in recognizing simple motions and atomic activities. As the computer vision community heads toward fully recognizing human activities, a challenging and labeled dataset is needed. To respond to that need, we collected a dataset of realistic scenarios in a multi-camera network environment (VideoWeb) involving multiple persons performing dozens of different repetitive and non-repetitive activities. This chapter describes the details of the dataset. We believe that this VideoWeb Activities dataset is unique and it is one of the most challenging datasets available today. The dataset is publicly available online at http://vwdata.ee.ucr.edu/ along with the data annotation.

  15. Toward Computational Cumulative Biology by Combining Models of Biological Datasets

    PubMed Central

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176

  16. Toward computational cumulative biology by combining models of biological datasets.

    PubMed

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.

  17. Improving the discoverability, accessibility, and citability of omics datasets: a case report.

    PubMed

    Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J

    2017-03-01

    Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  18. Scalable and Interactive Segmentation and Visualization of Neural Processes in EM Datasets

    PubMed Central

    Jeong, Won-Ki; Beyer, Johanna; Hadwiger, Markus; Vazquez, Amelio; Pfister, Hanspeter; Whitaker, Ross T.

    2011-01-01

    Recent advances in scanning technology provide high resolution EM (Electron Microscopy) datasets that allow neuroscientists to reconstruct complex neural connections in a nervous system. However, due to the enormous size and complexity of the resulting data, segmentation and visualization of neural processes in EM data is usually a difficult and very time-consuming task. In this paper, we present NeuroTrace, a novel EM volume segmentation and visualization system that consists of two parts: a semi-automatic multiphase level set segmentation with 3D tracking for reconstruction of neural processes, and a specialized volume rendering approach for visualization of EM volumes. It employs view-dependent on-demand filtering and evaluation of a local histogram edge metric, as well as on-the-fly interpolation and ray-casting of implicit surfaces for segmented neural structures. Both methods are implemented on the GPU for interactive performance. NeuroTrace is designed to be scalable to large datasets and data-parallel hardware architectures. A comparison of NeuroTrace with a commonly used manual EM segmentation tool shows that our interactive workflow is faster and easier to use for the reconstruction of complex neural processes. PMID:19834227

  19. A dataset of forest biomass structure for Eurasia.

    PubMed

    Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

    2017-05-16

    The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.

  20. A reanalysis dataset of the South China Sea

    PubMed Central

    Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

    2014-01-01

    Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992–2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability. PMID:25977803

  1. A dataset of forest biomass structure for Eurasia

    NASA Astrophysics Data System (ADS)

    Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

    2017-05-01

    The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.

  2. Optimizing tertiary storage organization and access for spatio-temporal datasets

    NASA Technical Reports Server (NTRS)

    Chen, Ling Tony; Rotem, Doron; Shoshani, Arie; Drach, Bob; Louis, Steve; Keating, Meridith

    1994-01-01

    We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. This paper focuses on very large spatial and temporal datasets generated by simulation programs in the area of climate modeling, but the techniques developed can be applied to other applications that deal with large multidimensional datasets. The main requirement we have addressed is the efficient access of subsets of information contained within much larger datasets, for the purpose of analysis and interactive visualization. We have developed data partitioning techniques that partition datasets into 'clusters' based on analysis of data access patterns and storage device characteristics. The goal is to minimize the number of clusters read from mass storage systems when subsets are requested. We emphasize in this paper proposed enhancements to current storage server protocols to permit control over physical placement of data on storage devices. We also discuss in some detail the aspects of the interface between the application programs and the mass storage system, as well as a workbench to help scientists to design the best reorganization of a dataset for anticipated access patterns.

  3. Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

    PubMed Central

    Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

    2014-01-01

    SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111

  4. Assessment of the NASA-USGS Global Land Survey (GLS) Datasets

    USGS Publications Warehouse

    Gutman, Garik; Huang, Chengquan; Chander, Gyanesh; Noojipady, Praveen; Masek, Jeffery G.

    2013-01-01

    The Global Land Survey (GLS) datasets are a collection of orthorectified, cloud-minimized Landsat-type satellite images, providing near complete coverage of the global land area decadally since the early 1970s. The global mosaics are centered on 1975, 1990, 2000, 2005, and 2010, and consist of data acquired from four sensors: Enhanced Thematic Mapper Plus, Thematic Mapper, Multispectral Scanner, and Advanced Land Imager. The GLS datasets have been widely used in land-cover and land-use change studies at local, regional, and global scales. This study evaluates the GLS datasets with respect to their spatial coverage, temporal consistency, geodetic accuracy, radiometric calibration consistency, image completeness, extent of cloud contamination, and residual gaps. In general, the three latest GLS datasets are of a better quality than the GLS-1990 and GLS-1975 datasets, with most of the imagery (85%) having cloud cover of less than 10%, the acquisition years clustered much more tightly around their target years, better co-registration relative to GLS-2000, and better radiometric absolute calibration. Probably, the most significant impediment to scientific use of the datasets is the variability of image phenology (i.e., acquisition day of year). This paper provides end-users with an assessment of the quality of the GLS datasets for specific applications, and where possible, suggestions for mitigating their deficiencies.

  5. Brown CA et al 2016 Dataset

    EPA Pesticide Factsheets

    This dataset contains the research described in the following publication:Brown, C.A., D. Sharp, and T. Mochon Collura. 2016. Effect of Climate Change on Water Temperature and Attainment of Water Temperature Criteria in the Yaquina Estuary, Oregon (USA). Estuarine, Coastal and Shelf Science. 169:136-146, doi: 10.1016/j.ecss.2015.11.006.This dataset is associated with the following publication:Brown , C., D. Sharp, and T. MochonCollura. Effect of Climate Change on Water Temperature and Attainment of Water Temperature Criteria in the Yaquina Estuary, Oregon (USA). ESTUARINE, COASTAL AND SHELF SCIENCE. Elsevier Science Ltd, New York, NY, USA, 169: 136-146, (2016).

  6. Conducting high-value secondary dataset analysis: an introductory guide and resources.

    PubMed

    Smith, Alexander K; Ayanian, John Z; Covinsky, Kenneth E; Landon, Bruce E; McCarthy, Ellen P; Wee, Christina C; Steinman, Michael A

    2011-08-01

    Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.

  7. Generation of openEHR Test Datasets for Benchmarking.

    PubMed

    El Helou, Samar; Karvonen, Tuukka; Yamamoto, Goshiro; Kume, Naoto; Kobayashi, Shinji; Kondo, Eiji; Hiragi, Shusuke; Okamoto, Kazuya; Tamura, Hiroshi; Kuroda, Tomohiro

    2017-01-01

    openEHR is a widely used EHR specification. Given its technology-independent nature, different approaches for implementing openEHR data repositories exist. Public openEHR datasets are needed to conduct benchmark analyses over different implementations. To address their current unavailability, we propose a method for generating openEHR test datasets that can be publicly shared and used.

  8. Parallel task processing of very large datasets

    NASA Astrophysics Data System (ADS)

    Romig, Phillip Richardson, III

    This research concerns the use of distributed computer technologies for the analysis and management of very large datasets. Improvements in sensor technology, an emphasis on global change research, and greater access to data warehouses all are increase the number of non-traditional users of remotely sensed data. We present a framework for distributed solutions to the challenges of datasets which exceed the online storage capacity of individual workstations. This framework, called parallel task processing (PTP), incorporates both the task- and data-level parallelism exemplified by many image processing operations. An implementation based on the principles of PTP, called Tricky, is also presented. Additionally, we describe the challenges and practical issues in modeling the performance of parallel task processing with large datasets. We present a mechanism for estimating the running time of each unit of work within a system and an algorithm that uses these estimates to simulate the execution environment and produce estimated runtimes. Finally, we describe and discuss experimental results which validate the design. Specifically, the system (a) is able to perform computation on datasets which exceed the capacity of any one disk, (b) provides reduction of overall computation time as a result of the task distribution even with the additional cost of data transfer and management, and (c) in the simulation mode accurately predicts the performance of the real execution environment.

  9. Wind and wave dataset for Matara, Sri Lanka

    NASA Astrophysics Data System (ADS)

    Luo, Yao; Wang, Dongxiao; Priyadarshana Gamage, Tilak; Zhou, Fenghua; Madusanka Widanage, Charith; Liu, Taiwei

    2018-01-01

    We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1) is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017) is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447).

  10. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance.

    PubMed

    Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S

    2017-01-01

    As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross

  11. Using Graph Indices for the Analysis and Comparison of Chemical Datasets.

    PubMed

    Fourches, Denis; Tropsha, Alexander

    2013-10-01

    In cheminformatics, compounds are represented as points in multidimensional space of chemical descriptors. When all pairs of points found within certain distance threshold in the original high dimensional chemistry space are connected by distance-labeled edges, the resulting data structure can be defined as Dataset Graph (DG). We show that, similarly to the conventional description of organic molecules, many graph indices can be computed for DGs as well. We demonstrate that chemical datasets can be effectively characterized and compared by computing simple graph indices such as the average vertex degree or Randic connectivity index. This approach is used to characterize and quantify the similarity between different datasets or subsets of the same dataset (e.g., training, test, and external validation sets used in QSAR modeling). The freely available ADDAGRA program has been implemented to build and visualize DGs. The approach proposed and discussed in this report could be further explored and utilized for different cheminformatics applications such as dataset diversification by acquiring external compounds, dataset processing prior to QSAR modeling, or (dis)similarity modeling of multiple datasets studied in chemical genomics applications. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  12. Analysis of Public Datasets for Wearable Fall Detection Systems.

    PubMed

    Casilari, Eduardo; Santoyo-Ramón, José-Antonio; Cano-García, José-Manuel

    2017-06-27

    Due to the boom of wireless handheld devices such as smartwatches and smartphones, wearable Fall Detection Systems (FDSs) have become a major focus of attention among the research community during the last years. The effectiveness of a wearable FDS must be contrasted against a wide variety of measurements obtained from inertial sensors during the occurrence of falls and Activities of Daily Living (ADLs). In this regard, the access to public databases constitutes the basis for an open and systematic assessment of fall detection techniques. This paper reviews and appraises twelve existing available data repositories containing measurements of ADLs and emulated falls envisaged for the evaluation of fall detection algorithms in wearable FDSs. The analysis of the found datasets is performed in a comprehensive way, taking into account the multiple factors involved in the definition of the testbeds deployed for the generation of the mobility samples. The study of the traces brings to light the lack of a common experimental benchmarking procedure and, consequently, the large heterogeneity of the datasets from a number of perspectives (length and number of samples, typology of the emulated falls and ADLs, characteristics of the test subjects, features and positions of the sensors, etc.). Concerning this, the statistical analysis of the samples reveals the impact of the sensor range on the reliability of the traces. In addition, the study evidences the importance of the selection of the ADLs and the need of categorizing the ADLs depending on the intensity of the movements in order to evaluate the capability of a certain detection algorithm to discriminate falls from ADLs.

  13. Interactive visualization and analysis of multimodal datasets for surgical applications.

    PubMed

    Kirmizibayrak, Can; Yim, Yeny; Wakid, Mike; Hahn, James

    2012-12-01

    Surgeons use information from multiple sources when making surgical decisions. These include volumetric datasets (such as CT, PET, MRI, and their variants), 2D datasets (such as endoscopic videos), and vector-valued datasets (such as computer simulations). Presenting all the information to the user in an effective manner is a challenging problem. In this paper, we present a visualization approach that displays the information from various sources in a single coherent view. The system allows the user to explore and manipulate volumetric datasets, display analysis of dataset values in local regions, combine 2D and 3D imaging modalities and display results of vector-based computer simulations. Several interaction methods are discussed: in addition to traditional interfaces including mouse and trackers, gesture-based natural interaction methods are shown to control these visualizations with real-time performance. An example of a medical application (medialization laryngoplasty) is presented to demonstrate how the combination of different modalities can be used in a surgical setting with our approach.

  14. Five year global dataset: NMC operational analyses (1978 to 1982)

    NASA Technical Reports Server (NTRS)

    Straus, David; Ardizzone, Joseph

    1987-01-01

    This document describes procedures used in assembling a five year dataset (1978 to 1982) using NMC Operational Analysis data. These procedures entailed replacing missing and unacceptable data in order to arrive at a complete dataset that is continuous in time. In addition, a subjective assessment on the integrity of all data (both preliminary and final) is presented. Documentation on tapes comprising the Five Year Global Dataset is also included.

  15. Exploring patterns enriched in a dataset with contrastive principal component analysis.

    PubMed

    Abid, Abubakar; Zhang, Martin J; Bagaria, Vivek K; Zou, James

    2018-05-30

    Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.

  16. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.

    PubMed

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-07-02

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.

  17. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare

    PubMed Central

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-01-01

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a “data modeler” tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets. PMID:26147731

  18. Geoseq: a tool for dissecting deep-sequencing datasets.

    PubMed

    Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi

    2010-10-12

    Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

  19. A Research Graph dataset for connecting research data repositories using RD-Switchboard.

    PubMed

    Aryani, Amir; Poblet, Marta; Unsworth, Kathryn; Wang, Jingbo; Evans, Ben; Devaraju, Anusuriya; Hausstein, Brigitte; Klas, Claus-Peter; Zapilko, Benjamin; Kaplun, Samuele

    2018-05-29

    This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants. The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation. Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.

  20. Process mining in oncology using the MIMIC-III dataset

    NASA Astrophysics Data System (ADS)

    Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen

    2018-03-01

    Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.

  1. Microarray Analysis Dataset

    EPA Pesticide Factsheets

    This file contains a link for Gene Expression Omnibus and the GSE designations for the publicly available gene expression data used in the study and reflected in Figures 6 and 7 for the Das et al., 2016 paper.This dataset is associated with the following publication:Das, K., C. Wood, M. Lin, A.A. Starkov, C. Lau, K.B. Wallace, C. Corton, and B. Abbott. Perfluoroalky acids-induced liver steatosis: Effects on genes controlling lipid homeostasis. TOXICOLOGY. Elsevier Science Ltd, New York, NY, USA, 378: 32-52, (2017).

  2. Metabarcoding of marine nematodes - evaluation of reference datasets used in tree-based taxonomy assignment approach.

    PubMed

    Holovachov, Oleksandr

    2016-01-01

    Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand.Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset.Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach.

  3. A comparison of public datasets for acceleration-based fall detection.

    PubMed

    Igual, Raul; Medrano, Carlos; Plaza, Inmaculada

    2015-09-01

    Falls are one of the leading causes of mortality among the older population, being the rapid detection of a fall a key factor to mitigate its main adverse health consequences. In this context, several authors have conducted studies on acceleration-based fall detection using external accelerometers or smartphones. The published detection rates are diverse, sometimes close to a perfect detector. This divergence may be explained by the difficulties in comparing different fall detection studies in a fair play since each study uses its own dataset obtained under different conditions. In this regard, several datasets have been made publicly available recently. This paper presents a comparison, to the best of our knowledge for the first time, of these public fall detection datasets in order to determine whether they have an influence on the declared performances. Using two different detection algorithms, the study shows that the performances of the fall detection techniques are affected, to a greater or lesser extent, by the specific datasets used to validate them. We have also found large differences in the generalization capability of a fall detector depending on the dataset used for training. In fact, the performance decreases dramatically when the algorithms are tested on a dataset different from the one used for training. Other characteristics of the datasets like the number of training samples also have an influence on the performance while algorithms seem less sensitive to the sampling frequency or the acceleration range. Copyright © 2015 IPEM. Published by Elsevier Ltd. All rights reserved.

  4. Neurodata Without Borders: Creating a Common Data Format for Neurophysiology.

    PubMed

    Teeters, Jeffery L; Godfrey, Keith; Young, Rob; Dang, Chinh; Friedsam, Claudia; Wark, Barry; Asari, Hiroki; Peron, Simon; Li, Nuo; Peyrache, Adrien; Denisov, Gennady; Siegle, Joshua H; Olsen, Shawn R; Martin, Christopher; Chun, Miyoung; Tripathy, Shreejoy; Blanche, Timothy J; Harris, Kenneth; Buzsáki, György; Koch, Christof; Meister, Markus; Svoboda, Karel; Sommer, Friedrich T

    2015-11-18

    The Neurodata Without Borders (NWB) initiative promotes data standardization in neuroscience to increase research reproducibility and opportunities. In the first NWB pilot project, neurophysiologists and software developers produced a common data format for recordings and metadata of cellular electrophysiology and optical imaging experiments. The format specification, application programming interfaces, and sample datasets have been released. Copyright © 2015 Elsevier Inc. All rights reserved.

  5. SAR image classification based on CNN in real and simulation datasets

    NASA Astrophysics Data System (ADS)

    Peng, Lijiang; Liu, Ming; Liu, Xiaohua; Dong, Liquan; Hui, Mei; Zhao, Yuejin

    2018-04-01

    Convolution neural network (CNN) has made great success in image classification tasks. Even in the field of synthetic aperture radar automatic target recognition (SAR-ATR), state-of-art results has been obtained by learning deep representation of features on the MSTAR benchmark. However, the raw data of MSTAR have shortcomings in training a SAR-ATR model because of high similarity in background among the SAR images of each kind. This indicates that the CNN would learn the hierarchies of features of backgrounds as well as the targets. To validate the influence of the background, some other SAR images datasets have been made which contains the simulation SAR images of 10 manufactured targets such as tank and fighter aircraft, and the backgrounds of simulation SAR images are sampled from the whole original MSTAR data. The simulation datasets contain the dataset that the backgrounds of each kind images correspond to the one kind of backgrounds of MSTAR targets or clutters and the dataset that each image shares the random background of whole MSTAR targets or clutters. In addition, mixed datasets of MSTAR and simulation datasets had been made to use in the experiments. The CNN architecture proposed in this paper are trained on all datasets mentioned above. The experimental results shows that the architecture can get high performances on all datasets even the backgrounds of the images are miscellaneous, which indicates the architecture can learn a good representation of the targets even though the drastic changes on background.

  6. On sample size and different interpretations of snow stability datasets

    NASA Astrophysics Data System (ADS)

    Schirmer, M.; Mitterer, C.; Schweizer, J.

    2009-04-01

    Interpretations of snow stability variations need an assessment of the stability itself, independent of the scale investigated in the study. Studies on stability variations at a regional scale have often chosen stability tests such as the Rutschblock test or combinations of various tests in order to detect differences in aspect and elevation. The question arose: ‘how capable are such stability interpretations in drawing conclusions'. There are at least three possible errors sources: (i) the variance of the stability test itself; (ii) the stability variance at an underlying slope scale, and (iii) that the stability interpretation might not be directly related to the probability of skier triggering. Various stability interpretations have been proposed in the past that provide partly different results. We compared a subjective one based on expert knowledge with a more objective one based on a measure derived from comparing skier-triggered slopes vs. slopes that have been skied but not triggered. In this study, the uncertainties are discussed and their effects on regional scale stability variations will be quantified in a pragmatic way. An existing dataset with very large sample sizes was revisited. This dataset contained the variance of stability at a regional scale for several situations. The stability in this dataset was determined using the subjective interpretation scheme based on expert knowledge. The question to be answered was how many measurements were needed to obtain similar results (mainly stability differences in aspect or elevation) as with the complete dataset. The optimal sample size was obtained in several ways: (i) assuming a nominal data scale the sample size was determined with a given test, significance level and power, and by calculating the mean and standard deviation of the complete dataset. With this method it can also be determined if the complete dataset consists of an appropriate sample size. (ii) Smaller subsets were created with similar

  7. Really big data: Processing and analysis of large datasets

    USDA-ARS?s Scientific Manuscript database

    Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidl...

  8. Domino: Extracting, Comparing, and Manipulating Subsets across Multiple Tabular Datasets

    PubMed Central

    Gratzl, Samuel; Gehlenborg, Nils; Lex, Alexander; Pfister, Hanspeter; Streit, Marc

    2016-01-01

    Answering questions about complex issues often requires analysts to take into account information contained in multiple interconnected datasets. A common strategy in analyzing and visualizing large and heterogeneous data is dividing it into meaningful subsets. Interesting subsets can then be selected and the associated data and the relationships between the subsets visualized. However, neither the extraction and manipulation nor the comparison of subsets is well supported by state-of-the-art techniques. In this paper we present Domino, a novel multiform visualization technique for effectively representing subsets and the relationships between them. By providing comprehensive tools to arrange, combine, and extract subsets, Domino allows users to create both common visualization techniques and advanced visualizations tailored to specific use cases. In addition to the novel technique, we present an implementation that enables analysts to manage the wide range of options that our approach offers. Innovative interactive features such as placeholders and live previews support rapid creation of complex analysis setups. We introduce the technique and the implementation using a simple example and demonstrate scalability and effectiveness in a use case from the field of cancer genomics. PMID:26356916

  9. A polymer dataset for accelerated property prediction and design.

    PubMed

    Huan, Tran Doan; Mannodi-Kanakkithodi, Arun; Kim, Chiho; Sharma, Vinit; Pilania, Ghanshyam; Ramprasad, Rampi

    2016-03-01

    Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate target of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. It will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.

  10. A polymer dataset for accelerated property prediction and design

    DOE PAGES

    Huan, Tran Doan; Mannodi-Kanakkithodi, Arun; Kim, Chiho; ...

    2016-03-01

    Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate targetmore » of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. As a result, it will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.« less

  11. A robust dataset-agnostic heart disease classifier from Phonocardiogram.

    PubMed

    Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M

    2017-07-01

    Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.

  12. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

    PubMed Central

    Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.

    2017-01-01

    Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic

  13. Determining Scale-dependent Patterns in Spatial and Temporal Datasets

    NASA Astrophysics Data System (ADS)

    Roy, A.; Perfect, E.; Mukerji, T.; Sylvester, L.

    2016-12-01

    Spatial and temporal datasets of interest to Earth scientists often contain plots of one variable against another, e.g., rainfall magnitude vs. time or fracture aperture vs. spacing. Such data, comprised of distributions of events along a transect / timeline along with their magnitudes, can display persistent or antipersistent trends, as well as random behavior, that may contain signatures of underlying physical processes. Lacunarity is a technique that was originally developed for multiscale analysis of data. In a recent study we showed that lacunarity can be used for revealing changes in scale-dependent patterns in fracture spacing data. Here we present a further improvement in our technique, with lacunarity applied to various non-binary datasets comprised of event spacings and magnitudes. We test our technique on a set of four synthetic datasets, three of which are based on an autoregressive model and have magnitudes at every point along the "timeline" thus representing antipersistent, persistent, and random trends. The fourth dataset is made up of five clusters of events, each containing a set of random magnitudes. The concept of lacunarity ratio, LR, is introduced; this is the lacunarity of a given dataset normalized to the lacunarity of its random counterpart. It is demonstrated that LR can successfully delineate scale-dependent changes in terms of antipersistence and persistence in the synthetic datasets. This technique is then applied to three different types of data: a hundred-year rainfall record from Knoxville, TN, USA, a set of varved sediments from Marca Shale, and a set of fracture aperture and spacing data from NE Mexico. While the rainfall data and varved sediments both appear to be persistent at small scales, at larger scales they both become random. On the other hand, the fracture data shows antipersistence at small scale (within cluster) and random behavior at large scales. Such differences in behavior with respect to scale-dependent changes in

  14. An assessment of differences in gridded precipitation datasets in complex terrain

    NASA Astrophysics Data System (ADS)

    Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.

    2018-01-01

    Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.

  15. Accuracy assessment of the U.S. Geological Survey National Elevation Dataset, and comparison with other large-area elevation datasets: SRTM and ASTER

    USGS Publications Warehouse

    Gesch, Dean B.; Oimoen, Michael J.; Evans, Gayla A.

    2014-01-01

    The National Elevation Dataset (NED) is the primary elevation data product produced and distributed by the U.S. Geological Survey. The NED provides seamless raster elevation data of the conterminous United States, Alaska, Hawaii, U.S. island territories, Mexico, and Canada. The NED is derived from diverse source datasets that are processed to a specification with consistent resolutions, coordinate system, elevation units, and horizontal and vertical datums. The NED serves as the elevation layer of The National Map, and it provides basic elevation information for earth science studies and mapping applications in the United States and most of North America. An important part of supporting scientific and operational use of the NED is provision of thorough dataset documentation including data quality and accuracy metrics. The focus of this report is on the vertical accuracy of the NED and on comparison of the NED with other similar large-area elevation datasets, namely data from the Shuttle Radar Topography Mission (SRTM) and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER).

  16. Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems

    NASA Astrophysics Data System (ADS)

    Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille

    2016-04-01

    Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not

  17. NCAR's Research Data Archive: OPeNDAP Access for Complex Datasets

    NASA Astrophysics Data System (ADS)

    Dattore, R.; Worley, S. J.

    2014-12-01

    Many datasets have complex structures including hundreds of parameters and numerous vertical levels, grid resolutions, and temporal products. Making these data accessible is a challenge for a data provider. OPeNDAP is powerful protocol for delivering in real-time multi-file datasets that can be ingested by many analysis and visualization tools, but for these datasets there are too many choices about how to aggregate. Simple aggregation schemes can fail to support, or at least make it very challenging, for many potential studies based on complex datasets. We address this issue by using a rich file content metadata collection to create a real-time customized OPeNDAP service to match the full suite of access possibilities for complex datasets. The Climate Forecast System Reanalysis (CFSR) and it's extension, the Climate Forecast System Version 2 (CFSv2) datasets produced by the National Centers for Environmental Prediction (NCEP) and hosted by the Research Data Archive (RDA) at the Computational and Information Systems Laboratory (CISL) at NCAR are examples of complex datasets that are difficult to aggregate with existing data server software. CFSR and CFSv2 contain 141 distinct parameters on 152 vertical levels, six grid resolutions and 36 products (analyses, n-hour forecasts, multi-hour averages, etc.) where not all parameter/level combinations are available at all grid resolution/product combinations. These data are archived in the RDA with the data structure provided by the producer; no additional re-organization or aggregation have been applied. Since 2011, users have been able to request customized subsets (e.g. - temporal, parameter, spatial) from the CFSR/CFSv2, which are processed in delayed-mode and then downloaded to a user's system. Until now, the complexity has made it difficult to provide real-time OPeNDAP access to the data. We have developed a service that leverages the already-existing subsetting interface and allows users to create a virtual dataset

  18. Toxics Release Inventory Chemical Hazard Information Profiles (TRI-CHIP) Dataset

    EPA Pesticide Factsheets

    The Toxics Release Inventory (TRI) Chemical Hazard Information Profiles (TRI-CHIP) dataset contains hazard information about the chemicals reported in TRI. Users can use this XML-format dataset to create their own databases and hazard analyses of TRI chemicals. The hazard information is compiled from a series of authoritative sources including the Integrated Risk Information System (IRIS). The dataset is provided as a downloadable .zip file that when extracted provides XML files and schemas for the hazard information tables.

  19. Progress report on new antiepileptic drugs: A summary of the Thirteenth Eilat Conference on New Antiepileptic Drugs and Devices (EILAT XIII).

    PubMed

    Bialer, Meir; Johannessen, Svein I; Levy, René H; Perucca, Emilio; Tomson, Torbjörn; White, H Steve

    2017-02-01

    The Thirteenth Eilat Conference on New Antiepileptic Drugs and Devices (EILAT XIII) took place in Madrid, Spain, on June 26-29, 2016, and was attended by >200 delegates from 31 countries. The present Progress Report provides an update on experimental and clinical results for drugs presented at the Conference. Compounds for which summary data are presented include an AED approved in 2016 (brivaracetam), 12 drugs in phase I-III clinical development (adenosine, allopregnanolone, bumetanide, cannabidiol, cannabidivarin, 2-deoxy-d-glucose, everolimus, fenfluramine, huperzine A, minocycline, SAGE-217, and valnoctamide) and 6 compounds or classes of compounds for which only preclinical data are available (bumetanide derivatives, sec-butylpropylacetamide, FV-082, 1OP-2198, NAX 810-2, and SAGE-689). Overall, the results presented at the Conference show that considerable efforts are ongoing into discovery and development of AEDs with potentially improved therapeutic profiles compared with existing agents. Many of the drugs discussed in this report show innovative mechanisms of action and many have shown promising results in patients with pharmacoresistant epilepsies, including previously neglected rare and severe epilepsy syndromes. Wiley Periodicals, Inc. © 2017 International League Against Epilepsy.

  20. Atlas-Guided Cluster Analysis of Large Tractography Datasets

    PubMed Central

    Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

    2013-01-01

    Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292

  1. Atlas-guided cluster analysis of large tractography datasets.

    PubMed

    Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

    2013-01-01

    Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.

  2. Management and assimilation of diverse, distributed watershed datasets

    NASA Astrophysics Data System (ADS)

    Varadharajan, C.; Faybishenko, B.; Versteeg, R.; Agarwal, D.; Hubbard, S. S.; Hendrix, V.

    2016-12-01

    The U.S. Department of Energy's (DOE) Watershed Function Scientific Focus Area (SFA) seeks to determine how perturbations to mountainous watersheds (e.g., floods, drought, early snowmelt) impact the downstream delivery of water, nutrients, carbon, and metals over seasonal to decadal timescales. We are building a software platform that enables integration of diverse and disparate field, laboratory, and simulation datasets, of various types including hydrological, geological, meteorological, geophysical, geochemical, ecological and genomic datasets across a range of spatial and temporal scales within the Rifle floodplain and the East River watershed, Colorado. We are using agile data management and assimilation approaches, to enable web-based integration of heterogeneous, multi-scale dataSensor-based observations of water-level, vadose zone and groundwater temperature, water quality, meteorology as well as biogeochemical analyses of soil and groundwater samples have been curated and archived in federated databases. Quality Assurance and Quality Control (QA/QC) are performed on priority datasets needed for on-going scientific analyses, and hydrological and geochemical modeling. Automated QA/QC methods are used to identify and flag issues in the datasets. Data integration is achieved via a brokering service that dynamically integrates data from distributed databases via web services, based on user queries. The integrated results are presented to users in a portal that enables intuitive search, interactive visualization and download of integrated datasets. The concepts, approaches and codes being used are shared across various data science components of various large DOE-funded projects such as the Watershed Function SFA, Next Generation Ecosystem Experiment (NGEE) Tropics, Ameriflux/FLUXNET, and Advanced Simulation Capability for Environmental Management (ASCEM), and together contribute towards DOE's cyberinfrastructure for data management and model-data integration.

  3. Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification.

    PubMed

    Li, Jinyan; Fong, Simon; Sung, Yunsick; Cho, Kyungeun; Wong, Raymond; Wong, Kelvin K L

    2016-01-01

    An imbalanced dataset is defined as a training dataset that has imbalanced proportions of data in both interesting and uninteresting classes. Often in biomedical applications, samples from the stimulating class are rare in a population, such as medical anomalies, positive clinical tests, and particular diseases. Although the target samples in the primitive dataset are small in number, the induction of a classification model over such training data leads to poor prediction performance due to insufficient training from the minority class. In this paper, we use a novel class-balancing method named adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique (ASCB_DmSMOTE) to solve this imbalanced dataset problem, which is common in biomedical applications. The proposed method combines under-sampling and over-sampling into a swarm optimisation algorithm. It adaptively selects suitable parameters for the rebalancing algorithm to find the best solution. Compared with the other versions of the SMOTE algorithm, significant improvements, which include higher accuracy and credibility, are observed with ASCB_DmSMOTE. Our proposed method tactfully combines two rebalancing techniques together. It reasonably re-allocates the majority class in the details and dynamically optimises the two parameters of SMOTE to synthesise a reasonable scale of minority class for each clustered sub-imbalanced dataset. The proposed methods ultimately overcome other conventional methods and attains higher credibility with even greater accuracy of the classification model.

  4. NP_PAH_interaction dataset

    EPA Pesticide Factsheets

    Concentrations of different polyaromatic hydrocarbons in water before and after interaction with nanomaterials. The results show the capacity of engineer nanomaterials for adsorbing different organic pollutants. This dataset is associated with the following publication:Sahle-Demessie, E., A. Zhao, C. Han, B. Hann, and H. Grecsek. Interaction of engineered nanomaterials with hydrophobic organic pollutants.. Journal of Nanotechnology. Hindawi Publishing Corporation, New York, NY, USA, 27(28): 284003, (2016).

  5. BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten Bangla Isolated characters.

    PubMed

    Biswas, Mithun; Islam, Rafiqul; Shom, Gautam Kumar; Shopon, Md; Mohammed, Nabeel; Momen, Sifat; Abedin, Anowarul

    2017-06-01

    BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.

  6. Scalable persistent identifier systems for dynamic datasets

    NASA Astrophysics Data System (ADS)

    Golodoniuc, P.; Cox, S. J. D.; Klump, J. F.

    2016-12-01

    Reliable and persistent identification of objects, whether tangible or not, is essential in information management. Many Internet-based systems have been developed to identify digital data objects, e.g., PURL, LSID, Handle, ARK. These were largely designed for identification of static digital objects. The amount of data made available online has grown exponentially over the last two decades and fine-grained identification of dynamically generated data objects within large datasets using conventional systems (e.g., PURL) has become impractical. We have compared capabilities of various technological solutions to enable resolvability of data objects in dynamic datasets, and developed a dataset-centric approach to resolution of identifiers. This is particularly important in Semantic Linked Data environments where dynamic frequently changing data is delivered live via web services, so registration of individual data objects to obtain identifiers is impractical. We use identifier patterns and pattern hierarchies for identification of data objects, which allows relationships between identifiers to be expressed, and also provides means for resolving a single identifier into multiple forms (i.e. views or representations of an object). The latter can be implemented through (a) HTTP content negotiation, or (b) use of URI querystring parameters. The pattern and hierarchy approach has been implemented in the Linked Data API supporting the United Nations Spatial Data Infrastructure (UNSDI) initiative and later in the implementation of geoscientific data delivery for the Capricorn Distal Footprints project using International Geo Sample Numbers (IGSN). This enables flexible resolution of multi-view persistent identifiers and provides a scalable solution for large heterogeneous datasets.

  7. An Intercomparison of Large-Extent Tree Canopy Cover Geospatial Datasets

    NASA Astrophysics Data System (ADS)

    Bender, S.; Liknes, G.; Ruefenacht, B.; Reynolds, J.; Miller, W. P.

    2017-12-01

    As a member of the Multi-Resolution Land Characteristics Consortium (MRLC), the U.S. Forest Service (USFS) is responsible for producing and maintaining the tree canopy cover (TCC) component of the National Land Cover Database (NLCD). The NLCD-TCC data are available for the conterminous United States (CONUS), coastal Alaska, Hawai'i, Puerto Rico, and the U.S. Virgin Islands. The most recent official version of the NLCD-TCC data is based primarily on reference data from 2010-2011 and is part of the multi-component 2011 version of the NLCD. NLCD data are updated on a five-year cycle. The USFS is currently producing the next official version (2016) of the NLCD-TCC data for the United States, and it will be made publicly-available in early 2018. In this presentation, we describe the model inputs, modeling methods, and tools used to produce the 30-m NLCD-TCC data. Several tree cover datasets at 30-m, as well as datasets at finer resolution, have become available in recent years due to advancements in earth observation data and their availability, computing, and sensors. We compare multiple tree cover datasets that have similar resolution to the NLCD-TCC data. We also aggregate the tree class from fine-resolution land cover datasets to a percent canopy value on a 30-m pixel, in order to compare the fine-resolution datasets to the datasets created directly from 30-m Landsat data. The extent of the tree canopy cover datasets included in the study ranges from global and national to the state level. Preliminary investigation of multiple tree cover datasets over the CONUS indicates a high amount of spatial variability. For example, in a comparison of the NLCD-TCC and the Global Land Cover Facility's Landsat Tree Cover Continuous Fields (2010) data by MRLC mapping zones, the zone-level root mean-square deviation ranges from 2% to 39% (mean=17%, median=15%). The analysis outcomes are expected to inform USFS decisions with regard to the next cycle (2021) of NLCD-TCC production.

  8. Scalable Visual Analytics of Massive Textual Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Krishnan, Manoj Kumar; Bohn, Shawn J.; Cowley, Wendy E.

    2007-04-01

    This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.

  9. Harvard Aging Brain Study: Dataset and accessibility.

    PubMed

    Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G; Chatwal, Jasmeer P; Papp, Kathryn V; Amariglio, Rebecca E; Blacker, Deborah; Rentz, Dorene M; Johnson, Keith A; Sperling, Reisa A; Schultz, Aaron P

    2017-01-01

    The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging. To promote more extensive analyses, imaging data was designed to be compatible with other publicly available datasets. A cloud-based system enables access to interested researchers with blinded data available contingent upon completion of a data usage agreement and administrative approval. Data collection is ongoing and currently in its fifth year. Copyright © 2015 Elsevier Inc. All rights reserved.

  10. Sensitivity of a numerical wave model on wind re-analysis datasets

    NASA Astrophysics Data System (ADS)

    Lavidas, George; Venugopal, Vengatesan; Friedrich, Daniel

    2017-03-01

    Wind is the dominant process for wave generation. Detailed evaluation of metocean conditions strengthens our understanding of issues concerning potential offshore applications. However, the scarcity of buoys and high cost of monitoring systems pose a barrier to properly defining offshore conditions. Through use of numerical wave models, metocean conditions can be hindcasted and forecasted providing reliable characterisations. This study reports the sensitivity of wind inputs on a numerical wave model for the Scottish region. Two re-analysis wind datasets with different spatio-temporal characteristics are used, the ERA-Interim Re-Analysis and the CFSR-NCEP Re-Analysis dataset. Different wind products alter results, affecting the accuracy obtained. The scope of this study is to assess different available wind databases and provide information concerning the most appropriate wind dataset for the specific region, based on temporal, spatial and geographic terms for wave modelling and offshore applications. Both wind input datasets delivered results from the numerical wave model with good correlation. Wave results by the 1-h dataset have higher peaks and lower biases, in expense of a high scatter index. On the other hand, the 6-h dataset has lower scatter but higher biases. The study shows how wind dataset affects the numerical wave modelling performance, and that depending on location and study needs, different wind inputs should be considered.

  11. Querying Large Biological Network Datasets

    ERIC Educational Resources Information Center

    Gulsoy, Gunhan

    2013-01-01

    New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…

  12. Dataset used to improve liquid water absorption models in the microwave

    DOE Data Explorer

    Turner, David

    2015-12-14

    Two datasets, one a compilation of laboratory data and one a compilation from three field sites, are provided here. These datasets provide measurements of the real and imaginary refractive indices and absorption as a function of cloud temperature. These datasets were used in the development of the new liquid water absorption model that was published in Turner et al. 2015.

  13. Primary Datasets for Case Studies of River-Water Quality

    ERIC Educational Resources Information Center

    Goulder, Raymond

    2008-01-01

    Level 6 (final-year BSc) students undertook case studies on between-site and temporal variation in river-water quality. They used professionally-collected datasets supplied by the Environment Agency. The exercise gave students the experience of working with large, real-world datasets and led to their understanding how the quality of river water is…

  14. A dataset of human decision-making in teamwork management.

    PubMed

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-17

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.

  15. A dataset of human decision-making in teamwork management

    PubMed Central

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-01

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members’ capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches. PMID:28094787

  16. A global experimental dataset for assessing grain legume production

    PubMed Central

    Cernay, Charles; Pelzer, Elise; Makowski, David

    2016-01-01

    Grain legume crops are a significant component of the human diet and animal feed and have an important role in the environment, but the global diversity of agricultural legume species is currently underexploited. Experimental assessments of grain legume performances are required, to identify potential species with high yields. Here, we introduce a dataset including results of field experiments published in 173 articles. The selected experiments were carried out over five continents on 39 grain legume species. The dataset includes measurements of grain yield, aerial biomass, crop nitrogen content, residual soil nitrogen content and water use. When available, yields for cereals and oilseeds grown after grain legumes in the crop sequence are also included. The dataset is arranged into a relational database with nine structured tables and 198 standardized attributes. Tillage, fertilization, pest and irrigation management are systematically recorded for each of the 8,581 crop*field site*growing season*treatment combinations. The dataset is freely reusable and easy to update. We anticipate that it will provide valuable information for assessing grain legume production worldwide. PMID:27676125

  17. A dataset of human decision-making in teamwork management

    NASA Astrophysics Data System (ADS)

    Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

    2017-01-01

    Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.

  18. Discovering Cortical Folding Patterns in Neonatal Cortical Surfaces Using Large-Scale Dataset

    PubMed Central

    Meng, Yu; Li, Gang; Wang, Li; Lin, Weili; Gilmore, John H.

    2017-01-01

    The cortical folding of the human brain is highly complex and variable across individuals. Mining the major patterns of cortical folding from modern large-scale neuroimaging datasets is of great importance in advancing techniques for neuroimaging analysis and understanding the inter-individual variations of cortical folding and its relationship with cognitive function and disorders. As the primary cortical folding is genetically influenced and has been established at term birth, neonates with the minimal exposure to the complicated postnatal environmental influence are the ideal candidates for understanding the major patterns of cortical folding. In this paper, for the first time, we propose a novel method for discovering the major patterns of cortical folding in a large-scale dataset of neonatal brain MR images (N = 677). In our method, first, cortical folding is characterized by the distribution of sulcal pits, which are the locally deepest points in cortical sulci. Because deep sulcal pits are genetically related, relatively consistent across individuals, and also stable during brain development, they are well suitable for representing and characterizing cortical folding. Then, the similarities between sulcal pit distributions of any two subjects are measured from spatial, geometrical, and topological points of view. Next, these different measurements are adaptively fused together using a similarity network fusion technique, to preserve their common information and also catch their complementary information. Finally, leveraging the fused similarity measurements, a hierarchical affinity propagation algorithm is used to group similar sulcal folding patterns together. The proposed method has been applied to 677 neonatal brains (the largest neonatal dataset to our knowledge) in the central sulcus, superior temporal sulcus, and cingulate sulcus, and revealed multiple distinct and meaningful folding patterns in each region. PMID:28229131

  19. Analysis of Public Datasets for Wearable Fall Detection Systems

    PubMed Central

    Santoyo-Ramón, José-Antonio; Cano-García, José-Manuel

    2017-01-01

    Due to the boom of wireless handheld devices such as smartwatches and smartphones, wearable Fall Detection Systems (FDSs) have become a major focus of attention among the research community during the last years. The effectiveness of a wearable FDS must be contrasted against a wide variety of measurements obtained from inertial sensors during the occurrence of falls and Activities of Daily Living (ADLs). In this regard, the access to public databases constitutes the basis for an open and systematic assessment of fall detection techniques. This paper reviews and appraises twelve existing available data repositories containing measurements of ADLs and emulated falls envisaged for the evaluation of fall detection algorithms in wearable FDSs. The analysis of the found datasets is performed in a comprehensive way, taking into account the multiple factors involved in the definition of the testbeds deployed for the generation of the mobility samples. The study of the traces brings to light the lack of a common experimental benchmarking procedure and, consequently, the large heterogeneity of the datasets from a number of perspectives (length and number of samples, typology of the emulated falls and ADLs, characteristics of the test subjects, features and positions of the sensors, etc.). Concerning this, the statistical analysis of the samples reveals the impact of the sensor range on the reliability of the traces. In addition, the study evidences the importance of the selection of the ADLs and the need of categorizing the ADLs depending on the intensity of the movements in order to evaluate the capability of a certain detection algorithm to discriminate falls from ADLs. PMID:28653991

  20. Study of the Integration of LIDAR and Photogrammetric Datasets by in Situ Camera Calibration and Integrated Sensor Orientation

    NASA Astrophysics Data System (ADS)

    Mitishita, E.; Costa, F.; Martins, M.

    2017-05-01

    Photogrammetric and Lidar datasets should be in the same mapping or geodetic frame to be used simultaneously in an engineering project. Nowadays direct sensor orientation is a common procedure used in simultaneous photogrammetric and Lidar surveys. Although the direct sensor orientation technologies provide a high degree of automation process due to the GNSS/INS technologies, the accuracies of the results obtained from the photogrammetric and Lidar surveys are dependent on the quality of a group of parameters that models accurately the user conditions of the system at the moment the job is performed. This paper shows the study that was performed to verify the importance of the in situ camera calibration and Integrated Sensor Orientation without control points to increase the accuracies of the photogrammetric and LIDAR datasets integration. The horizontal and vertical accuracies of photogrammetric and Lidar datasets integration by photogrammetric procedure improved significantly when the Integrated Sensor Orientation (ISO) approach was performed using Interior Orientation Parameter (IOP) values estimated from the in situ camera calibration. The horizontal and vertical accuracies, estimated by the Root Mean Square Error (RMSE) of the 3D discrepancies from the Lidar check points, increased around of 37% and 198% respectively.

  1. A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie.

    PubMed

    Hanke, Michael; Baumgartner, Florian J; Ibe, Pierre; Kaule, Falko R; Pollmann, Stefan; Speck, Oliver; Zinke, Wolf; Stadler, Jörg

    2014-01-01

    Here we present a high-resolution functional magnetic resonance (fMRI) dataset - 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film ("Forrest Gump"). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response patterns to complex auditory stimulation. Among the potential uses of this dataset are the study of auditory attention and cognition, language and music perception, and social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures - from stimulus creation to data analysis. In order to facilitate replicative and derived works, only free and open-source software was utilized.

  2. Reference datasets for bioequivalence trials in a two-group parallel design.

    PubMed

    Fuglsang, Anders; Schütz, Helmut; Labes, Detlew

    2015-03-01

    In order to help companies qualify and validate the software used to evaluate bioequivalence trials with two parallel treatment groups, this work aims to define datasets with known results. This paper puts a total 11 datasets into the public domain along with proposed consensus obtained via evaluations from six different software packages (R, SAS, WinNonlin, OpenOffice Calc, Kinetica, EquivTest). Insofar as possible, datasets were evaluated with and without the assumption of equal variances for the construction of a 90% confidence interval. Not all software packages provide functionality for the assumption of unequal variances (EquivTest, Kinetica), and not all packages can handle datasets with more than 1000 subjects per group (WinNonlin). Where results could be obtained across all packages, one showed questionable results when datasets contained unequal group sizes (Kinetica). A proposal is made for the results that should be used as validation targets.

  3. Development of a SPARK Training Dataset

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sayre, Amanda M.; Olson, Jarrod R.

    2015-03-01

    In its first five years, the National Nuclear Security Administration’s (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and acrossmore » locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK’s intended analysis capability. The analysis demonstration sought to answer

  4. Validation of the Hospital Episode Statistics Outpatient Dataset in England.

    PubMed

    Thorn, Joanna C; Turner, Emma; Hounsome, Luke; Walsh, Eleanor; Donovan, Jenny L; Verne, Julia; Neal, David E; Hamdy, Freddie C; Martin, Richard M; Noble, Sian M

    2016-02-01

    The Hospital Episode Statistics (HES) dataset is a source of administrative 'big data' with potential for costing purposes in economic evaluations alongside clinical trials. This study assesses the validity of coverage in the HES outpatient dataset. Men who died of, or with, prostate cancer were selected from a prostate-cancer screening trial (CAP, Cluster randomised triAl of PSA testing for Prostate cancer). Details of visits that took place after 1/4/2003 to hospital outpatient departments for conditions related to prostate cancer were extracted from medical records (MR); these appointments were sought in the HES outpatient dataset based on date. The matching procedure was repeated for periods before and after 1/4/2008, when the HES outpatient dataset was accredited as a national statistic. 4922 outpatient appointments were extracted from MR for 370 men. 4088 appointments recorded in MR were identified in the HES outpatient dataset (83.1%; 95% confidence interval [CI] 82.0-84.1). For appointments occurring prior to 1/4/2008, 2195/2755 (79.7%; 95% CI 78.2-81.2) matches were observed, while 1893/2167 (87.4%; 95% CI 86.0-88.9) appointments occurring after 1/4/2008 were identified (p for difference <0.001). 215/370 men (58.1%) had at least one appointment in the MR review that was unmatched in HES, 155 men (41.9%) had all their appointments identified, and 20 men (5.4%) had no appointments identified in HES. The HES outpatient dataset appears reasonably valid for research, particularly following accreditation. The dataset may be a suitable alternative to collecting MR data from hospital notes within a trial, although caution should be exercised with data collected prior to accreditation.

  5. ClimateNet: A Machine Learning dataset for Climate Science Research

    NASA Astrophysics Data System (ADS)

    Prabhat, M.; Biard, J.; Ganguly, S.; Ames, S.; Kashinath, K.; Kim, S. K.; Kahou, S.; Maharaj, T.; Beckham, C.; O'Brien, T. A.; Wehner, M. F.; Williams, D. N.; Kunkel, K.; Collins, W. D.

    2017-12-01

    Deep Learning techniques have revolutionized commercial applications in Computer vision, speech recognition and control systems. The key for all of these developments was the creation of a curated, labeled dataset ImageNet, for enabling multiple research groups around the world to develop methods, benchmark performance and compete with each other. The success of Deep Learning can be largely attributed to the broad availability of this dataset. Our empirical investigations have revealed that Deep Learning is similarly poised to benefit the task of pattern detection in climate science. Unfortunately, labeled datasets, a key pre-requisite for training, are hard to find. Individual research groups are typically interested in specialized weather patterns, making it hard to unify, and share datasets across groups and institutions. In this work, we are proposing ClimateNet: a labeled dataset that provides labeled instances of extreme weather patterns, as well as associated raw fields in model and observational output. We develop a schema in NetCDF to enumerate weather pattern classes/types, store bounding boxes, and pixel-masks. We are also working on a TensorFlow implementation to natively import such NetCDF datasets, and are providing a reference convolutional architecture for binary classification tasks. Our hope is that researchers in Climate Science, as well as ML/DL, will be able to use (and extend) ClimateNet to make rapid progress in the application of Deep Learning for Climate Science research.

  6. Regional climate change study requires new temperature datasets

    NASA Astrophysics Data System (ADS)

    Wang, K.; Zhou, C.

    2016-12-01

    Analyses of global mean air temperature (Ta), i. e., NCDC GHCN, GISS, and CRUTEM4, are the fundamental datasets for climate change study and provide key evidence for global warming. All of the global temperature analyses over land are primarily based on meteorological observations of the daily maximum and minimum temperatures (Tmax and Tmin) and their averages (T2) because in most weather stations, the measurements of Tmax and Tmin may be the only choice for a homogenous century-long analysis of mean temperature. Our studies show that these datasets are suitable for long-term global warming studies. However, they may introduce substantial bias in quantifying local and regional warming rates, i.e., with a root mean square error of more than 25% at 5°x 5° grids. From 1973 to 1997, the current datasets tend to significantly underestimate the warming rate over the central U.S. and overestimate the warming rate over the northern high latitudes. Similar results revealed during the period 1998-2013, the warming hiatus period, indicate the use of T2 enlarges the spatial contrast of temperature trends. This because T2 over land only sample air temperature twice daily and cannot accurately reflect land-atmosphere and incoming radiation variations in the temperature diurnal cycle. For better regional climate change detection and attribution, we suggest creating new global mean air temperature datasets based on the recently available high spatiotemporal resolution meteorological observations, i.e., daily four observations weather station since 1960s, These datasets will not only help investigate dynamical processes on temperature variances but also help better evaluate the reanalyzed and modeled simulations of temperature and make some substantial improvements for other related climate variables in models, especially over regional and seasonal aspects.

  7. Regional climate change study requires new temperature datasets

    NASA Astrophysics Data System (ADS)

    Wang, Kaicun; Zhou, Chunlüe

    2017-04-01

    Analyses of global mean air temperature (Ta), i. e., NCDC GHCN, GISS, and CRUTEM4, are the fundamental datasets for climate change study and provide key evidence for global warming. All of the global temperature analyses over land are primarily based on meteorological observations of the daily maximum and minimum temperatures (Tmax and Tmin) and their averages (T2) because in most weather stations, the measurements of Tmax and Tmin may be the only choice for a homogenous century-long analysis of mean temperature. Our studies show that these datasets are suitable for long-term global warming studies. However, they may have substantial biases in quantifying local and regional warming rates, i.e., with a root mean square error of more than 25% at 5 degree grids. From 1973 to 1997, the current datasets tend to significantly underestimate the warming rate over the central U.S. and overestimate the warming rate over the northern high latitudes. Similar results revealed during the period 1998-2013, the warming hiatus period, indicate the use of T2 enlarges the spatial contrast of temperature trends. This is because T2 over land only samples air temperature twice daily and cannot accurately reflect land-atmosphere and incoming radiation variations in the temperature diurnal cycle. For better regional climate change detection and attribution, we suggest creating new global mean air temperature datasets based on the recently available high spatiotemporal resolution meteorological observations, i.e., daily four observations weather station since 1960s. These datasets will not only help investigate dynamical processes on temperature variances but also help better evaluate the reanalyzed and modeled simulations of temperature and make some substantial improvements for other related climate variables in models, especially over regional and seasonal aspects.

  8. Open and scalable analytics of large Earth observation datasets: From scenes to multidimensional arrays using SciDB and GDAL

    NASA Astrophysics Data System (ADS)

    Appel, Marius; Lahn, Florian; Buytaert, Wouter; Pebesma, Edzer

    2018-04-01

    Earth observation (EO) datasets are commonly provided as collection of scenes, where individual scenes represent a temporal snapshot and cover a particular region on the Earth's surface. Using these data in complex spatiotemporal modeling becomes difficult as soon as data volumes exceed a certain capacity or analyses include many scenes, which may spatially overlap and may have been recorded at different dates. In order to facilitate analytics on large EO datasets, we combine and extend the geospatial data abstraction library (GDAL) and the array-based data management and analytics system SciDB. We present an approach to automatically convert collections of scenes to multidimensional arrays and use SciDB to scale computationally intensive analytics. We evaluate the approach in three study cases on national scale land use change monitoring with Landsat imagery, global empirical orthogonal function analysis of daily precipitation, and combining historical climate model projections with satellite-based observations. Results indicate that the approach can be used to represent various EO datasets and that analyses in SciDB scale well with available computational resources. To simplify analyses of higher-dimensional datasets as from climate model output, however, a generalization of the GDAL data model might be needed. All parts of this work have been implemented as open-source software and we discuss how this may facilitate open and reproducible EO analyses.

  9. Outcomes of a NASA Workshop to Develop a Portfolio of Low Latency Datasets for Time-Sensitive Applications

    NASA Technical Reports Server (NTRS)

    Davies, Diane K.; Brown, Molly E.; Green, David S.; Michael, Karen A.; Murray, John J.; Justice, Christopher O.; Soja, Amber J.

    2016-01-01

    It is widely accepted that time-sensitive remote sensing data serve the needs of decision makers in the applications communities and yet to date, a comprehensive portfolio of NASA low latency datasets has not been available. This paper will describe the NASA low latency, or Near-Real Time (NRT), portfolio, how it was developed and plans to make it available online through a portal that leverages the existing EOSDIS capabilities such as the Earthdata Search Client (https:search.earthdata.nasa.gov), the Common Metadata Repository (CMR) and the Global Imagery Browse Service (GIBS). This paper will report on the outcomes of a NASA Workshop to Develop a Portfolio of Low Latency Datasets for Time-Sensitive Applications (27-29 September 2016 at NASA Langley Research Center, Hampton VA). The paper will also summarize findings and recommendations from the meeting outlining perceived shortfalls and opportunities for low latency research and application science.

  10. Image segmentation evaluation for very-large datasets

    NASA Astrophysics Data System (ADS)

    Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

    2016-03-01

    With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.

  11. A critical evaluation of ecological indices for the comparative analysis of microbial communities based on molecular datasets.

    PubMed

    Lucas, Rico; Groeneveld, Jürgen; Harms, Hauke; Johst, Karin; Frank, Karin; Kleinsteuber, Sabine

    2017-01-01

    In times of global change and intensified resource exploitation, advanced knowledge of ecophysiological processes in natural and engineered systems driven by complex microbial communities is crucial for both safeguarding environmental processes and optimising rational control of biotechnological processes. To gain such knowledge, high-throughput molecular techniques are routinely employed to investigate microbial community composition and dynamics within a wide range of natural or engineered environments. However, for molecular dataset analyses no consensus about a generally applicable alpha diversity concept and no appropriate benchmarking of corresponding statistical indices exist yet. To overcome this, we listed criteria for the appropriateness of an index for such analyses and systematically scrutinised commonly employed ecological indices describing diversity, evenness and richness based on artificial and real molecular datasets. We identified appropriate indices warranting interstudy comparability and intuitive interpretability. The unified diversity concept based on 'effective numbers of types' provides the mathematical framework for describing community composition. Additionally, the Bray-Curtis dissimilarity as a beta-diversity index was found to reflect compositional changes. The employed statistical procedure is presented comprising commented R-scripts and example datasets for user-friendly trial application. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  12. Synthesizing plant phenological indicators from multispecies datasets

    NASA Astrophysics Data System (ADS)

    Rutishauser, This; Peñuelas, Josep; Filella, Iolanda; Gehrig, Regula; Scherrer, Simon C.; Röthlisberger, Christian

    2014-05-01

    Changes in the seasonality of life cycles of plants from phenological observations are traditionally analysed at the species level. Trends and correlations with main environmental driving variables show a coherent picture across the globe. The question arises whether there is an integrated phenological signal across species that describes common interannual variability. Is there a way to express synthetic phenological indicators from multispecies datasets that serve decision makers as usefull tools? Can these indicators be derived in such a robust way that systematic updates yield necessary information for adaptation measures? We address these questions by analysing multi-species phenological data sets with leaf-unfolding and flowering observations from 30 sites across Europe between 40° and 63°N including data from PEP725, the Swiss Plant Phenological Observation Network and one legacy data set. Starting in 1951 the data sets were synthesized by multivariate analysis (Principal Component Analysis). The representativeness of the site specific indicator was tested against subsets including only leaf-unfolding or flowering phases, and by a comparison with a 50% random sample of the available phenophases for 500 time steps. Results show that a synthetic indicators explains up to 79% of the variance at each site - usually 40-50% or more. Robust linear trends over the common period 1971-2000 indicate an overall change of the indicator of -0.32 days/year with lower uncertainty than previous studies. Advances were more pronounced in southern and northern Europe. The indicator-based analysis provides a promising tool for synthesizing site-based plant phenological records and is a companion to, and validating data for, an increasing number of phenological measurements derived from phenological models and satellite sensors.

  13. The CMS dataset bookkeeping service

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Afaq, Anzar,; /Fermilab; Dolgert, Andrew

    2007-10-01

    The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS ismore » available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.« less

  14. The CMS dataset bookkeeping service

    NASA Astrophysics Data System (ADS)

    Afaq, A.; Dolgert, A.; Guo, Y.; Jones, C.; Kosyakov, S.; Kuznetsov, V.; Lueking, L.; Riley, D.; Sekhri, V.

    2008-07-01

    The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.

  15. How robust is the pre-1931 National Climatic Data Center—climate divisional dataset? Examples from Georgia and Louisiana

    NASA Astrophysics Data System (ADS)

    Allard, Jason; Thompson, Clint; Keim, Barry D.

    2015-04-01

    The National Climatic Data Center's climate divisional dataset (CDD) is commonly used in climate change analyses. This dataset is a spatially continuous dataset for the conterminous USA from 1895 to the present. The CDD since 1931 is computed by averaging all available representative cooperative weather station data into a single monthly value for each of the 344 climate divisions of the conterminous USA, while pre-1931 data for climate divisions are derived from statewide averages using regression equations. This study examines the veracity of these pre-1931 data. All available Cooperative Observer Program (COOP) stations within each climate division in Georgia and Louisiana were averaged into a single monthly value for each month and each climate division from 1897 to 1930 to generate a divisional dataset (COOP DD), using similar methods to those used by the National Climatic Data Center to generate the post-1931 CDD. The reliability of the official CDD—derived from statewide averages—to produce temperature and precipitation means and trends prior to 1931 are then evaluated by comparing that dataset with the COOP DD with difference-of-means tests, correlations, and linear regression techniques. The CDD and the COOP DD are also compared to a divisional dataset derived from the United States Historical Climatology Network (USHCN) data (USHCN DD), with difference of means and correlation techniques, to demonstrate potential impacts of inhomogeneities within the CDD and the COOP DD. The statistical results, taken as a whole, not only indicate broad similarities between the CDD and COOP DD but also show that the CDD does not adequately portray pre-1931 temperature and precipitation in certain climate divisions within Georgia and Louisiana. In comparison with the USHCN DD, both the CDD and the COOP DD appear to be subject to biases that probably result from changing stations within climate divisions. As such, the CDD should be used judiciously for long-term studies

  16. A Merged Dataset for Solar Probe Plus FIELDS Magnetometers

    NASA Astrophysics Data System (ADS)

    Bowen, T. A.; Dudok de Wit, T.; Bale, S. D.; Revillet, C.; MacDowall, R. J.; Sheppard, D.

    2016-12-01

    The Solar Probe Plus FIELDS experiment will observe turbulent magnetic fluctuations deep in the inner heliosphere. The FIELDS magnetometer suite implements a set of three magnetometers: two vector DC fluxgate magnetometers (MAGs), sensitive from DC- 100Hz, as well as a vector search coil magnetometer (SCM), sensitive from 10Hz-50kHz. Single axis measurements are additionally made up to 1MHz. To study the full range of observations, we propose merging data from the individual magnetometers into a single dataset. A merged dataset will improve the quality of observations in the range of frequencies observed by both magnetometers ( 10-100 Hz). Here we present updates on the individual MAG and SCM calibrations as well as our results on generating a cross-calibrated and merged dataset.

  17. A cross-country Exchange Market Pressure (EMP) dataset.

    PubMed

    Desai, Mohit; Patnaik, Ila; Felman, Joshua; Shah, Ajay

    2017-06-01

    The data presented in this article are related to the research article titled - "An exchange market pressure measure for cross country analysis" (Patnaik et al. [1]). In this article, we present the dataset for Exchange Market Pressure values (EMP) for 139 countries along with their conversion factors, ρ (rho). Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values) for the point estimates of ρ 's. Using the standard errors of estimates of ρ 's, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.

  18. Central nervous system bleeding in pediatric patients with factor XIII deficiency: a study on 23 new cases.

    PubMed

    Naderi, Majid; Alizadeh, Shaban; Kazemi, Ahmad; Tabibian, Shadi; Zaker, Farhad; Bamedi, Taregh; Kashani Khatib, Zahra; Dorgalaleh, Akbar

    2015-03-01

    Factor XIII (FXIII) deficiency is an extremely rare bleeding disorder, which has the highest incidence in Sistan and Baluchistan Province in Iran, compared to its overall incidence around the world. This disorder has different clinical manifestations ranging from mild bleeding tendency to lethal bleeding episodes including central nervous system (CNS) hemorrhage. The aim of this study was to evaluate the demographic data, pattern of CNS bleeding, and the role of plasminogen activator inhibitor-1 (PAI-1) (PAI-1) 4G/5G and thrombin activatable fibrinolysis inhibitor (TAFI) Thr325Ile polymorphisms in intracranial and extracranial hemorrhages in 23 new cases of FXIII-deficient subjects. This case-control study was conducted on 23 FXIII-deficient patients with CNS bleeding episodes and 23 patients as the control group with FXIII deficiency but without any history of CNS bleeding. Initially, to confirm the molecular defect, both groups were evaluated for the most frequently reported mutation of FXIII (Trp187Arg mutation) in a previous study in Sistan and Baluchistan Province. Then, demographic data, clinical manifestations, and pattern of CNS bleeding were determined. Eventually, the patients were assessed for PAI-14G/5G and TAFI Thr325Ile polymorphisms. The results of this study revealed that all the subjects (including the case and control groups) were homozygous for Trp187Arg mutation. Nineteen patients (82.6%) had intracranial hemorrhage (ICH) and four patients (17.4%) had extracranial hemorrhage (ECH). Intraparenchymal hemorrhage was the most common form of ICH (89.5%), and epidural hemorrhage was observed in two patients (10.5%). Anatomic regions in patients with intraparenchymal hemorrhage were temporal in six (35.3%), occipital in four (23.5%), diffused intraparenchymal in four (23.5%), temporal-occipital in two (11.8%), and subdural with temporal in one (5.9%) patient. We found that in the case group, 14 patients (60.8%) were homozygous for TAFI Thr325Ile

  19. The NASA Subsonic Jet Particle Image Velocimetry (PIV) Dataset

    NASA Technical Reports Server (NTRS)

    Bridges, James; Wernet, Mark P.

    2011-01-01

    Many tasks in fluids engineering require prediction of turbulence of jet flows. The present document documents the single-point statistics of velocity, mean and variance, of cold and hot jet flows. The jet velocities ranged from 0.5 to 1.4 times the ambient speed of sound, and temperatures ranged from unheated to static temperature ratio 2.7. Further, the report assesses the accuracies of the data, e.g., establish uncertainties for the data. This paper covers the following five tasks: (1) Document acquisition and processing procedures used to create the particle image velocimetry (PIV) datasets. (2) Compare PIV data with hotwire and laser Doppler velocimetry (LDV) data published in the open literature. (3) Compare different datasets acquired at the same flow conditions in multiple tests to establish uncertainties. (4) Create a consensus dataset for a range of hot jet flows, including uncertainty bands. (5) Analyze this consensus dataset for self-consistency and compare jet characteristics to those of the open literature. The final objective was fulfilled by using the potential core length and the spread rate of the half-velocity radius to collapse of the mean and turbulent velocity fields over the first 20 jet diameters.

  20. A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video

    DTIC Science & Technology

    2011-06-01

    orders of magnitude larger than existing datasets such CAVIAR [7]. TRECVID 2008 airport dataset [16] contains 100 hours of video, but, it provides only...entire human figure (e.g., above shoulder), amounting to 500% human to video 2Some statistics are approximate, obtained from the CAVIAR 1st scene and...and diversity in both col- lection sites and viewpoints. In comparison to surveillance datasets such as CAVIAR [7] and TRECVID [16] shown in Fig. 3

  1. Animal Viruses Probe dataset (AVPDS) for microarray-based diagnosis and identification of viruses.

    PubMed

    Yadav, Brijesh S; Pokhriyal, Mayank; Vasishtha, Dinesh P; Sharma, Bhaskar

    2014-03-01

    AVPDS (Animal Viruses Probe dataset) is a dataset of virus-specific and conserve oligonucleotides for identification and diagnosis of viruses infecting animals. The current dataset contain 20,619 virus specific probes for 833 viruses and their subtypes and 3,988 conserved probes for 146 viral genera. Dataset of virus specific probe has been divided into two fields namely virus name and probe sequence. Similarly conserved probes for virus genera table have genus, and subgroup within genus name and probe sequence. The subgroup within genus is artificially divided subgroups with no taxonomic significance and contains probes which identifies viruses in that specific subgroup of the genus. Using this dataset we have successfully diagnosed the first case of Newcastle disease virus in sheep and reported a mixed infection of Bovine viral diarrhea and Bovine herpesvirus in cattle. These dataset also contains probes which cross reacts across species experimentally though computationally they meet specifications. These probes have been marked. We hope that this dataset will be useful in microarray-based detection of viruses. The dataset can be accessed through the link https://dl.dropboxusercontent.com/u/94060831/avpds/HOME.html.

  2. Dataset from chemical gas sensor array in turbulent wind tunnel.

    PubMed

    Fonollosa, Jordi; Rodríguez-Luján, Irene; Trincavelli, Marco; Huerta, Ramón

    2015-06-01

    The dataset includes the acquired time series of a chemical detection platform exposed to different gas conditions in a turbulent wind tunnel. The chemo-sensory elements were sampling directly the environment. In contrast to traditional approaches that include measurement chambers, open sampling systems are sensitive to dispersion mechanisms of gaseous chemical analytes, namely diffusion, turbulence, and advection, making the identification and monitoring of chemical substances more challenging. The sensing platform included 72 metal-oxide gas sensors that were positioned at 6 different locations of the wind tunnel. At each location, 10 distinct chemical gases were released in the wind tunnel, the sensors were evaluated at 5 different operating temperatures, and 3 different wind speeds were generated in the wind tunnel to induce different levels of turbulence. Moreover, each configuration was repeated 20 times, yielding a dataset of 18,000 measurements. The dataset was collected over a period of 16 months. The data is related to "On the performance of gas sensor arrays in open sampling systems using Inhibitory Support Vector Machines", by Vergara et al.[1]. The dataset can be accessed publicly at the UCI repository upon citation of [1]: http://archive.ics.uci.edu/ml/datasets/Gas+sensor+arrays+in+open+sampling+settings.

  3. Knowledge mining from clinical datasets using rough sets and backpropagation neural network.

    PubMed

    Nahato, Kindie Biredagn; Harichandran, Khanna Nehemiah; Arputharaj, Kannan

    2015-01-01

    The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN) is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI) machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.

  4. A photogrammetric technique for generation of an accurate multispectral optical flow dataset

    NASA Astrophysics Data System (ADS)

    Kniaz, V. V.

    2017-06-01

    A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.

  5. Development of large scale riverine terrain-bathymetry dataset by integrating NHDPlus HR with NED,CoNED and HAND data

    NASA Astrophysics Data System (ADS)

    Li, Z.; Clark, E. P.

    2017-12-01

    Large scale and fine resolution riverine bathymetry data is critical for flood inundation modelingbut not available over the continental United States (CONUS). Previously we implementedbankfull hydraulic geometry based approaches to simulate bathymetry for individual riversusing NHDPlus v2.1 data and 10 m National Elevation Dataset (NED). USGS has recentlydeveloped High Resolution NHD data (NHDPlus HR Beta) (USGS, 2017), and thisenhanced dataset has a significant improvement on its spatial correspondence with 10 m DEM.In this study, we used this high resolution data, specifically NHDFlowline and NHDArea,to create bathymetry/terrain for CONUS river channels and floodplains. A software packageNHDPlus Inundation Modeler v5.0 Beta was developed for this project as an Esri ArcGIShydrological analysis extension. With the updated tools, raw 10 m DEM was first hydrologicallytreated to remove artificial blockages (e.g., overpasses, bridges and eve roadways, etc.) usinglow pass moving window filters. Cross sections were then automatically constructed along eachflowline to extract elevation from the hydrologically treated DEM. In this study, river channelshapes were approximated using quadratic curves to reduce uncertainties from commonly usedtrapezoids. We calculated underneath water channel elevation at each cross section samplingpoint using bankfull channel dimensions that were estimated from physiographicprovince/division based regression equations (Bieger et al. 2015). These elevation points werethen interpolated to generate bathymetry raster. The simulated bathymetry raster wasintegrated with USGS NED and Coastal National Elevation Database (CoNED) (whereveravailable) to make seamless terrain-bathymetry dataset. Channel bathymetry was alsointegrated to the HAND (Height above Nearest Drainage) dataset to improve large scaleinundation modeling. The generated terrain-bathymetry was processed at WatershedBoundary Dataset Hydrologic Unit 4 (WBDHU4) level.

  6. A novel Sarcocystis neurona genotype XIII is associated with severe encephalitis in an unexpectedly broad range of marine mammals from the northeastern Pacific Ocean.

    PubMed

    Barbosa, Lorraine; Johnson, Christine K; Lambourn, Dyanna M; Gibson, Amanda K; Haman, Katherine H; Huggins, Jessica L; Sweeny, Amy R; Sundar, Natarajan; Raverty, Stephen A; Grigg, Michael E

    2015-08-01

    Sarcocystis neurona is an important cause of protozoal encephalitis among marine mammals in the northeastern Pacific Ocean. To characterise the genetic type of S. neurona in this region, samples from 227 stranded marine mammals, most with clinical or pathological evidence of protozoal disease, were tested for the presence of coccidian parasites using a nested PCR assay. The frequency of S. neurona infection was 60% (136/227) among pinnipeds and cetaceans, including seven marine mammal species not previously known to be susceptible to infection by this parasite. Eight S. neurona fetal infections identified this coccidian parasite as capable of being transmitted transplacentally. Thirty-seven S. neurona-positive samples were multilocus sequence genotyped using three genetic markers: SnSAG1-5-6, SnSAG3 and SnSAG4. A novel genotype, referred to as Type XIII within the S. neurona population genetic structure, has emerged recently in the northeastern Pacific Ocean and is significantly associated with an increased severity of protozoal encephalitis and mortality among multiple stranded marine mammal species. Published by Elsevier Ltd.

  7. A novel Sarcocystis neurona genotype XIII is associated with severe encephalitis in an unexpectedly broad range of marine mammals from the northeastern Pacific Ocean

    PubMed Central

    Barbosa, Lorraine; Johnson, Christine K.; Lambourn, Dyanna M.; Gibson, Amanda K.; Haman, Katherine H.; Huggins, Jessica L.; Sweeny, Amy R.; Sundar, Natarajan; Raverty, Stephen A.; Grigg, Michael E.

    2015-01-01

    Sarcocystis neurona is an important cause of protozoal encephalitis among marine mammals in the northeastern Pacific Ocean. To characterize the genetic type of S. neurona in this region, samples from 227 stranded marine mammals, most with clinical or pathological evidence of protozoal disease, were tested for the presence of coccidian parasites using a nested PCR assay. The frequency of S. neurona infection was 60% (136/227) among pinnipeds and cetaceans, including seven marine mammal species not previously known to be susceptible to infection by this parasite. Eight S. neurona fetal infections identified this coccidian parasite as capable of being transmitted transplacentally. Thirty-seven S. neurona-positive samples were multilocus sequence genotyped using three genetic markers: SnSAG1-5-6, SnSAG3 and SnSAG4. A novel genotype, referred to as Type XIII within the S. neurona population genetic structure, has emerged recently in the northeastern Pacific Ocean and is significantly associated with an increased severity of protozoal encephalitis and mortality among multiple stranded marine mammal species. PMID:25997588

  8. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Gray, A.

    2014-04-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors. This is likely of particular interest to the radio astronomy community given, for example, that survey projects contain groups dedicated to this topic. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex

  9. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Astronomy Data Centre, Canadian

    2014-01-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors, and the local outlier factor. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.

  10. Wide-Open: Accelerating public data release by automating detection of overdue datasets

    PubMed Central

    Poon, Hoifung; Howe, Bill

    2017-01-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819

  11. Wide-Open: Accelerating public data release by automating detection of overdue datasets.

    PubMed

    Grechkin, Maxim; Poon, Hoifung; Howe, Bill

    2017-06-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

  12. Evaluation of bulk heat fluxes from atmospheric datasets

    NASA Astrophysics Data System (ADS)

    Farmer, Benton

    Heat fluxes at the air-sea interface are an important component of the Earth's heat budget. In addition, they are an integral factor in determining the sea surface temperature (SST) evolution of the oceans. Different representations of these fluxes are used in both the atmospheric and oceanic communities for the purpose of heat budget studies and, in particular, for forcing oceanic models. It is currently difficult to quantify the potential impact varying heat flux representations have on the ocean response. In this study, a diagnostic tool is presented that allows for a straightforward comparison of surface heat flux formulations and atmospheric data sets. Two variables, relaxation time (RT) and the apparent temperature (T*), are derived from the linearization of the bulk formulas. They are then calculated to compare three bulk formulae and five atmospheric datasets. Additionally, the linearization is expanded to the second order to compare the amount of residual flux present. It is found that the use of a bulk formula employing a constant heat transfer coefficient produces longer relaxation times and contains a greater amount of residual flux in the higher order terms of the linearization. Depending on the temperature difference, the residual flux remaining in the second order and above terms can reach as much as 40--50% of the total residual on a monthly time scale. This is certainly a non-negligible residual flux. In contrast, a bulk formula using a stability and wind dependent transfer coefficient retains much of the total flux in the first order term, as only a few percent remain in the residual flux. Most of the difference displayed among the bulk formulas stems from the sensitivity to wind speed and the choice of a constant or spatially varying transfer coefficient. Comparing the representation of RT and T* provides insight into the differences among various atmospheric datasets. In particular, the representations of the western boundary current, upwelling

  13. Antibody-protein interactions: benchmark datasets and prediction tools evaluation

    PubMed Central

    Ponomarenko, Julia V; Bourne, Philip E

    2007-01-01

    Background The ability to predict antibody binding sites (aka antigenic determinants or B-cell epitopes) for a given protein is a precursor to new vaccine design and diagnostics. Among the various methods of B-cell epitope identification X-ray crystallography is one of the most reliable methods. Using these experimental data computational methods exist for B-cell epitope prediction. As the number of structures of antibody-protein complexes grows, further interest in prediction methods using 3D structure is anticipated. This work aims to establish a benchmark for 3D structure-based epitope prediction methods. Results Two B-cell epitope benchmark datasets inferred from the 3D structures of antibody-protein complexes were defined. The first is a dataset of 62 representative 3D structures of protein antigens with inferred structural epitopes. The second is a dataset of 82 structures of antibody-protein complexes containing different structural epitopes. Using these datasets, eight web-servers developed for antibody and protein binding sites prediction have been evaluated. In no method did performance exceed a 40% precision and 46% recall. The values of the area under the receiver operating characteristic curve for the evaluated methods were about 0.6 for ConSurf, DiscoTope, and PPI-PRED methods and above 0.65 but not exceeding 0.70 for protein-protein docking methods when the best of the top ten models for the bound docking were considered; the remaining methods performed close to random. The benchmark datasets are included as a supplement to this paper. Conclusion It may be possible to improve epitope prediction methods through training on datasets which include only immune epitopes and through utilizing more features characterizing epitopes, for example, the evolutionary conservation score. Notwithstanding, overall poor performance may reflect the generality of antigenicity and hence the inability to decipher B-cell epitopes as an intrinsic feature of the protein. It

  14. A daily global mesoscale ocean eddy dataset from satellite altimetry.

    PubMed

    Faghmous, James H; Frenger, Ivy; Yao, Yuanshun; Warmka, Robert; Lindell, Aron; Kumar, Vipin

    2015-01-01

    Mesoscale ocean eddies are ubiquitous coherent rotating structures of water with radial scales on the order of 100 kilometers. Eddies play a key role in the transport and mixing of momentum and tracers across the World Ocean. We present a global daily mesoscale ocean eddy dataset that contains ~45 million mesoscale features and 3.3 million eddy trajectories that persist at least two days as identified in the AVISO dataset over a period of 1993-2014. This dataset, along with the open-source eddy identification software, extract eddies with any parameters (minimum size, lifetime, etc.), to study global eddy properties and dynamics, and to empirically estimate the impact eddies have on mass or heat transport. Furthermore, our open-source software may be used to identify mesoscale features in model simulations and compare them to observed features. Finally, this dataset can be used to study the interaction between mesoscale ocean eddies and other components of the Earth System.

  15. A daily global mesoscale ocean eddy dataset from satellite altimetry

    PubMed Central

    Faghmous, James H.; Frenger, Ivy; Yao, Yuanshun; Warmka, Robert; Lindell, Aron; Kumar, Vipin

    2015-01-01

    Mesoscale ocean eddies are ubiquitous coherent rotating structures of water with radial scales on the order of 100 kilometers. Eddies play a key role in the transport and mixing of momentum and tracers across the World Ocean. We present a global daily mesoscale ocean eddy dataset that contains ~45 million mesoscale features and 3.3 million eddy trajectories that persist at least two days as identified in the AVISO dataset over a period of 1993–2014. This dataset, along with the open-source eddy identification software, extract eddies with any parameters (minimum size, lifetime, etc.), to study global eddy properties and dynamics, and to empirically estimate the impact eddies have on mass or heat transport. Furthermore, our open-source software may be used to identify mesoscale features in model simulations and compare them to observed features. Finally, this dataset can be used to study the interaction between mesoscale ocean eddies and other components of the Earth System. PMID:26097744

  16. Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

    PubMed

    Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

    2014-01-01

    Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution

  17. Comparison and validation of gridded precipitation datasets for Spain

    NASA Astrophysics Data System (ADS)

    Quintana-Seguí, Pere; Turco, Marco; Míguez-Macho, Gonzalo

    2016-04-01

    In this study, two gridded precipitation datasets are compared and validated in Spain: the recently developed SAFRAN dataset and the Spain02 dataset. These are validated using rain gauges and they are also compared to the low resolution ERA-Interim reanalysis. The SAFRAN precipitation dataset has been recently produced, using the SAFRAN meteorological analysis, which is extensively used in France (Durand et al. 1993, 1999; Quintana-Seguí et al. 2008; Vidal et al., 2010) and which has recently been applied to Spain (Quintana-Seguí et al., 2015). SAFRAN uses an optimal interpolation (OI) algorithm and uses all available rain gauges from the Spanish State Meteorological Agency (Agencia Estatal de Meteorología, AEMET). The product has a spatial resolution of 5 km and it spans from September 1979 to August 2014. This dataset has been produced mainly to be used in large scale hydrological applications. Spain02 (Herrera et al. 2012, 2015) is another high quality precipitation dataset for Spain based on a dense network of quality-controlled stations and it has different versions at different resolutions. In this study we used the version with a resolution of 0.11°. The product spans from 1971 to 2010. Spain02 is well tested and widely used, mainly, but not exclusively, for RCM model validation and statistical downscliang. ERA-Interim is a well known global reanalysis with a spatial resolution of ˜79 km. It has been included in the comparison because it is a widely used product for continental and global scale studies and also in smaller scale studies in data poor countries. Thus, its comparison with higher resolution products of a data rich country, such as Spain, allows us to quantify the errors made when using such datasets for national scale studies, in line with some of the objectives of the EU-FP7 eartH2Observe project. The comparison shows that SAFRAN and Spain02 perform similarly, even though their underlying principles are different. Both products are largely

  18. The Global Precipitation Climatology Project (GPCP) Combined Precipitation Dataset

    NASA Technical Reports Server (NTRS)

    Huffman, George J.; Adler, Robert F.; Arkin, Philip; Chang, Alfred; Ferraro, Ralph; Gruber, Arnold; Janowiak, John; McNab, Alan; Rudolf, Bruno; Schneider, Udo

    1997-01-01

    The Global Precipitation Climatology Project (GPCP) has released the GPCP Version 1 Combined Precipitation Data Set, a global, monthly precipitation dataset covering the period July 1987 through December 1995. The primary product in the dataset is a merged analysis incorporating precipitation estimates from low-orbit-satellite microwave data, geosynchronous-orbit -satellite infrared data, and rain gauge observations. The dataset also contains the individual input fields, a combination of the microwave and infrared satellite estimates, and error estimates for each field. The data are provided on 2.5 deg x 2.5 deg latitude-longitude global grids. Preliminary analyses show general agreement with prior studies of global precipitation and extends prior studies of El Nino-Southern Oscillation precipitation patterns. At the regional scale there are systematic differences with standard climatologies.

  19. Geospatial datasets for watershed delineation and characterization used in the Hawaii StreamStats web application

    USGS Publications Warehouse

    Rea, Alan; Skinner, Kenneth D.

    2012-01-01

    The U.S. Geological Survey Hawaii StreamStats application uses an integrated suite of raster and vector geospatial datasets to delineate and characterize watersheds. The geospatial datasets used to delineate and characterize watersheds on the StreamStats website, and the methods used to develop the datasets are described in this report. The datasets for Hawaii were derived primarily from 10 meter resolution National Elevation Dataset (NED) elevation models, and the National Hydrography Dataset (NHD), using a set of procedures designed to enforce the drainage pattern from the NHD into the NED, resulting in an integrated suite of elevation-derived datasets. Additional sources of data used for computing basin characteristics include precipitation, land cover, soil permeability, and elevation-derivative datasets. The report also includes links for metadata and downloads of the geospatial datasets.

  20. XPAT: a toolkit to conduct cross-platform association studies with heterogeneous sequencing datasets.

    PubMed

    Yu, Yao; Hu, Hao; Bohlender, Ryan J; Hu, Fulan; Chen, Jiun-Sheng; Holt, Carson; Fowler, Jerry; Guthery, Stephen L; Scheet, Paul; Hildebrandt, Michelle A T; Yandell, Mark; Huff, Chad D

    2018-04-06

    High-throughput sequencing data are increasingly being made available to the research community for secondary analyses, providing new opportunities for large-scale association studies. However, heterogeneity in target capture and sequencing technologies often introduce strong technological stratification biases that overwhelm subtle signals of association in studies of complex traits. Here, we introduce the Cross-Platform Association Toolkit, XPAT, which provides a suite of tools designed to support and conduct large-scale association studies with heterogeneous sequencing datasets. XPAT includes tools to support cross-platform aware variant calling, quality control filtering, gene-based association testing and rare variant effect size estimation. To evaluate the performance of XPAT, we conducted case-control association studies for three diseases, including 783 breast cancer cases, 272 ovarian cancer cases, 205 Crohn disease cases and 3507 shared controls (including 1722 females) using sequencing data from multiple sources. XPAT greatly reduced Type I error inflation in the case-control analyses, while replicating many previously identified disease-gene associations. We also show that association tests conducted with XPAT using cross-platform data have comparable performance to tests using matched platform data. XPAT enables new association studies that combine existing sequencing datasets to identify genetic loci associated with common diseases and other complex traits.

  1. The Wind Integration National Dataset (WIND) toolkit (Presentation)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Caroline Draxl: NREL

    2014-01-01

    Regional wind integration studies require detailed wind power output data at many locations to perform simulations of how the power system will operate under high penetration scenarios. The wind datasets that serve as inputs into the study must realistically reflect the ramping characteristics, spatial and temporal correlations, and capacity factors of the simulated wind plants, as well as being time synchronized with available load profiles.As described in this presentation, the WIND Toolkit fulfills these requirements by providing a state-of-the-art national (US) wind resource, power production and forecast dataset.

  2. FieldSAFE: Dataset for Obstacle Detection in Agriculture.

    PubMed

    Kragh, Mikkel Fly; Christiansen, Peter; Laursen, Morten Stigaard; Larsen, Morten; Steen, Kim Arild; Green, Ole; Karstoft, Henrik; Jørgensen, Rasmus Nyholm

    2017-11-09

    In this paper, we present a multi-modal dataset for obstacle detection in agriculture. The dataset comprises approximately 2 h of raw sensor data from a tractor-mounted sensor system in a grass mowing scenario in Denmark, October 2016. Sensing modalities include stereo camera, thermal camera, web camera, 360 ∘ camera, LiDAR and radar, while precise localization is available from fused IMU and GNSS. Both static and moving obstacles are present, including humans, mannequin dolls, rocks, barrels, buildings, vehicles and vegetation. All obstacles have ground truth object labels and geographic coordinates.

  3. FieldSAFE: Dataset for Obstacle Detection in Agriculture

    PubMed Central

    Christiansen, Peter; Larsen, Morten; Steen, Kim Arild; Green, Ole; Karstoft, Henrik

    2017-01-01

    In this paper, we present a multi-modal dataset for obstacle detection in agriculture. The dataset comprises approximately 2 h of raw sensor data from a tractor-mounted sensor system in a grass mowing scenario in Denmark, October 2016. Sensing modalities include stereo camera, thermal camera, web camera, 360∘ camera, LiDAR and radar, while precise localization is available from fused IMU and GNSS. Both static and moving obstacles are present, including humans, mannequin dolls, rocks, barrels, buildings, vehicles and vegetation. All obstacles have ground truth object labels and geographic coordinates. PMID:29120383

  4. Fast randomization of large genomic datasets while preserving alteration counts.

    PubMed

    Gobbi, Andrea; Iorio, Francesco; Dawson, Kevin J; Wedge, David C; Tamborero, David; Alexandrov, Ludmil B; Lopez-Bigas, Nuria; Garnett, Mathew J; Jurman, Giuseppe; Saez-Rodriguez, Julio

    2014-09-01

    Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a 'mutually exclusive' manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive. We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks. BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.

  5. Inter-comparison of multiple statistically downscaled climate datasets for the Pacific Northwest, USA

    PubMed Central

    Jiang, Yueyang; Kim, John B.; Still, Christopher J.; Kerns, Becky K.; Kline, Jeffrey D.; Cunningham, Patrick G.

    2018-01-01

    Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies. PMID:29461513

  6. Inter-comparison of multiple statistically downscaled climate datasets for the Pacific Northwest, USA.

    PubMed

    Jiang, Yueyang; Kim, John B; Still, Christopher J; Kerns, Becky K; Kline, Jeffrey D; Cunningham, Patrick G

    2018-02-20

    Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies.

  7. An RNA-Seq based gene expression atlas of the common bean.

    PubMed

    O'Rourke, Jamie A; Iniguez, Luis P; Fu, Fengli; Bucciarelli, Bruna; Miller, Susan S; Jackson, Scott A; McClean, Philip E; Li, Jun; Dai, Xinbin; Zhao, Patrick X; Hernandez, Georgina; Vance, Carroll P

    2014-10-06

    Common bean (Phaseolus vulgaris) is grown throughout the world and comprises roughly 50% of the grain legumes consumed worldwide. Despite this, genetic resources for common beans have been lacking. Next generation sequencing, has facilitated our investigation of the gene expression profiles associated with biologically important traits in common bean. An increased understanding of gene expression in common bean will improve our understanding of gene expression patterns in other legume species. Combining recently developed genomic resources for Phaseolus vulgaris, including predicted gene calls, with RNA-Seq technology, we measured the gene expression patterns from 24 samples collected from seven tissues at developmentally important stages and from three nitrogen treatments. Gene expression patterns throughout the plant were analyzed to better understand changes due to nodulation, seed development, and nitrogen utilization. We have identified 11,010 genes differentially expressed with a fold change ≥ 2 and a P-value < 0.05 between different tissues at the same time point, 15,752 genes differentially expressed within a tissue due to changes in development, and 2,315 genes expressed only in a single tissue. These analyses identified 2,970 genes with expression patterns that appear to be directly dependent on the source of available nitrogen. Finally, we have assembled this data in a publicly available database, The Phaseolus vulgaris Gene Expression Atlas (Pv GEA), http://plantgrn.noble.org/PvGEA/ . Using the website, researchers can query gene expression profiles of their gene of interest, search for genes expressed in different tissues, or download the dataset in a tabular form. These data provide the basis for a gene expression atlas, which will facilitate functional genomic studies in common bean. Analysis of this dataset has identified genes important in regulating seed composition and has increased our understanding of nodulation and impact of the

  8. Integrated Analyses of Gene Expression Profiles Digs out Common Markers for Rheumatic Diseases

    PubMed Central

    Wang, Lan; Wu, Long-Fei; Lu, Xin; Mo, Xing-Bo; Tang, Zai-Xiang; Lei, Shu-Feng; Deng, Fei-Yan

    2015-01-01

    Objective Rheumatic diseases have some common symptoms. Extensive gene expression studies, accumulated thus far, have successfully identified signature molecules for each rheumatic disease, individually. However, whether there exist shared factors across rheumatic diseases has yet to be tested. Methods We collected and utilized 6 public microarray datasets covering 4 types of representative rheumatic diseases including rheumatoid arthritis, systemic lupus erythematosus, ankylosing spondylitis, and osteoarthritis. Then we detected overlaps of differentially expressed genes across datasets and performed a meta-analysis aiming at identifying common differentially expressed genes that discriminate between pathological cases and normal controls. To further gain insights into the functions of the identified common differentially expressed genes, we conducted gene ontology enrichment analysis and protein-protein interaction analysis. Results We identified a total of eight differentially expressed genes (TNFSF10, CX3CR1, LY96, TLR5, TXN, TIA1, PRKCH, PRF1), each associated with at least 3 of the 4 studied rheumatic diseases. Meta-analysis warranted the significance of the eight genes and highlighted the general significance of four genes (CX3CR1, LY96, TLR5, and PRF1). Protein-protein interaction and gene ontology enrichment analyses indicated that the eight genes interact with each other to exert functions related to immune response and immune regulation. Conclusion The findings support that there exist common factors underlying rheumatic diseases. For rheumatoid arthritis, systemic lupus erythematosus, ankylosing spondylitis and osteoarthritis diseases, those common factors include TNFSF10, CX3CR1, LY96, TLR5, TXN, TIA1, PRKCH, and PRF1. In-depth studies on these common factors may provide keys to understanding the pathogenesis and developing intervention strategies for rheumatic diseases. PMID:26352601

  9. Passive Containment DataSet

    EPA Pesticide Factsheets

    This data is for Figures 6 and 7 in the journal article. The data also includes the two EPANET input files used for the analysis described in the paper, one for the looped system and one for the block system.This dataset is associated with the following publication:Grayman, W., R. Murray , and D. Savic. Redesign of Water Distribution Systems for Passive Containment of Contamination. JOURNAL OF THE AMERICAN WATER WORKS ASSOCIATION. American Water Works Association, Denver, CO, USA, 108(7): 381-391, (2016).

  10. Metabarcoding of marine nematodes – evaluation of reference datasets used in tree-based taxonomy assignment approach

    PubMed Central

    2016-01-01

    Abstract Background Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. New information In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand. Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset. Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach. PMID:27932919

  11. The Lunar Source Disk: Old Lunar Datasets on a New CD-ROM

    NASA Astrophysics Data System (ADS)

    Hiesinger, H.

    1998-01-01

    A compilation of previously published datasets on CD-ROM is presented. This Lunar Source Disk is intended to be a first step in the improvement/expansion of the Lunar Consortium Disk, in order to create an "image-cube"-like data pool that can be easily accessed and might be useful for a variety of future lunar investigations. All datasets were transformed to a standard map projection that allows direct comparison of different types of information on a pixel-by pixel basis. Lunar observations have a long history and have been important to mankind for centuries, notably since the work of Plutarch and Galileo. As a consequence of centuries of lunar investigations, knowledge of the characteristics and properties of the Moon has accumulated over time. However, a side effect of this accumulation is that it has become more and more complicated for scientists to review all the datasets obtained through different techniques, to interpret them properly, to recognize their weaknesses and strengths in detail, and to combine them synoptically in geologic interpretations. Such synoptic geologic interpretations are crucial for the study of planetary bodies through remote-sensing data in order to avoid misinterpretation. In addition, many of the modem datasets, derived from Earth-based telescopes as well as from spacecraft missions, are acquired at different geometric and radiometric conditions. These differences make it challenging to compare or combine datasets directly or to extract information from different datasets on a pixel-by-pixel basis. Also, as there is no convention for the presentation of lunar datasets, different authors choose different map projections, depending on the location of the investigated areas and their personal interests. Insufficient or incomplete information on the map parameters used by different authors further complicates the reprojection of these datasets to a standard geometry. The goal of our efforts was to transfer previously published lunar

  12. Lessons learned in the generation of biomedical research datasets using Semantic Open Data technologies.

    PubMed

    Legaz-García, María del Carmen; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2015-01-01

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources. Such heterogeneity makes difficult not only the generation of research-oriented dataset but also its exploitation. In recent years, the Open Data paradigm has proposed new ways for making data available in ways that sharing and integration are facilitated. Open Data approaches may pursue the generation of content readable only by humans and by both humans and machines, which are the ones of interest in our work. The Semantic Web provides a natural technological space for data integration and exploitation and offers a range of technologies for generating not only Open Datasets but also Linked Datasets, that is, open datasets linked to other open datasets. According to the Berners-Lee's classification, each open dataset can be given a rating between one and five stars attending to can be given to each dataset. In the last years, we have developed and applied our SWIT tool, which automates the generation of semantic datasets from heterogeneous data sources. SWIT produces four stars datasets, given that fifth one can be obtained by being the dataset linked from external ones. In this paper, we describe how we have applied the tool in two projects related to health care records and orthology data, as well as the major lessons learned from such efforts.

  13. Global Precipitation Measurement: Methods, Datasets and Applications

    NASA Technical Reports Server (NTRS)

    Tapiador, Francisco; Turk, Francis J.; Petersen, Walt; Hou, Arthur Y.; Garcia-Ortega, Eduardo; Machado, Luiz, A. T.; Angelis, Carlos F.; Salio, Paola; Kidd, Chris; Huffman, George J.; hide

    2011-01-01

    This paper reviews the many aspects of precipitation measurement that are relevant to providing an accurate global assessment of this important environmental parameter. Methods discussed include ground data, satellite estimates and numerical models. First, the methods for measuring, estimating, and modeling precipitation are discussed. Then, the most relevant datasets gathering precipitation information from those three sources are presented. The third part of the paper illustrates a number of the many applications of those measurements and databases. The aim of the paper is to organize the many links and feedbacks between precipitation measurement, estimation and modeling, indicating the uncertainties and limitations of each technique in order to identify areas requiring further attention, and to show the limits within which datasets can be used.

  14. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

    PubMed

    Planey, Catherine R; Gevaert, Olivier

    2016-03-09

    Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.

  15. Annotating spatio-temporal datasets for meaningful analysis in the Web

    NASA Astrophysics Data System (ADS)

    Stasch, Christoph; Pebesma, Edzer; Scheider, Simon

    2014-05-01

    More and more environmental datasets that vary in space and time are available in the Web. This comes along with an advantage of using the data for other purposes than originally foreseen, but also with the danger that users may apply inappropriate analysis procedures due to lack of important assumptions made during the data collection process. In order to guide towards a meaningful (statistical) analysis of spatio-temporal datasets available in the Web, we have developed a Higher-Order-Logic formalism that captures some relevant assumptions in our previous work [1]. It allows to proof on meaningful spatial prediction and aggregation in a semi-automated fashion. In this poster presentation, we will present a concept for annotating spatio-temporal datasets available in the Web with concepts defined in our formalism. Therefore, we have defined a subset of the formalism as a Web Ontology Language (OWL) pattern. It allows capturing the distinction between the different spatio-temporal variable types, i.e. point patterns, fields, lattices and trajectories, that in turn determine whether a particular dataset can be interpolated or aggregated in a meaningful way using a certain procedure. The actual annotations that link spatio-temporal datasets with the concepts in the ontology pattern are provided as Linked Data. In order to allow data producers to add the annotations to their datasets, we have implemented a Web portal that uses a triple store at the backend to store the annotations and to make them available in the Linked Data cloud. Furthermore, we have implemented functions in the statistical environment R to retrieve the RDF annotations and, based on these annotations, to support a stronger typing of spatio-temporal datatypes guiding towards a meaningful analysis in R. [1] Stasch, C., Scheider, S., Pebesma, E., Kuhn, W. (2014): "Meaningful spatial prediction and aggregation", Environmental Modelling & Software, 51, 149-165.

  16. A dataset of multiresolution functional brain parcellations in an elderly population with no or mild cognitive impairment.

    PubMed

    Tam, Angela; Dansereau, Christian; Badhwar, AmanPreet; Orban, Pierre; Belleville, Sylvie; Chertkow, Howard; Dagher, Alain; Hanganu, Alexandru; Monchi, Oury; Rosa-Neto, Pedro; Shmuel, Amir; Breitner, John; Bellec, Pierre

    2016-12-01

    We present group eight resolutions of brain parcellations for clusters generated from resting-state functional magnetic resonance images for 99 cognitively normal elderly persons and 129 patients with mild cognitive impairment, pooled from four independent datasets. This dataset was generated as part of the following study: Common Effects of Amnestic Mild Cognitive Impairment on Resting-State Connectivity Across Four Independent Studies (Tam et al., 2015) [1]. The brain parcellations have been registered to both symmetric and asymmetric MNI brain templates and generated using a method called bootstrap analysis of stable clusters (BASC) (Bellec et al., 2010) [2]. We present two variants of these parcellations. One variant contains bihemisphereic parcels (4, 6, 12, 22, 33, 65, 111, and 208 total parcels across eight resolutions). The second variant contains spatially connected regions of interest (ROIs) that span only one hemisphere (10, 17, 30, 51, 77, 199, and 322 total ROIs across eight resolutions). We also present maps illustrating functional connectivity differences between patients and controls for four regions of interest (striatum, dorsal prefrontal cortex, middle temporal lobe, and medial frontal cortex). The brain parcels and associated statistical maps have been publicly released as 3D volumes, available in .mnc and .nii file formats on figshare and on Neurovault. Finally, the code used to generate this dataset is available on Github.

  17. Land cover trends dataset, 1973-2000

    USGS Publications Warehouse

    Soulard, Christopher E.; Acevedo, William; Auch, Roger F.; Sohl, Terry L.; Drummond, Mark A.; Sleeter, Benjamin M.; Sorenson, Daniel G.; Kambly, Steven; Wilson, Tamara S.; Taylor, Janis L.; Sayler, Kristi L.; Stier, Michael P.; Barnes, Christopher A.; Methven, Steven C.; Loveland, Thomas R.; Headley, Rachel; Brooks, Mark S.

    2014-01-01

    The U.S. Geological Survey Land Cover Trends Project is releasing a 1973–2000 time-series land-use/land-cover dataset for the conterminous United States. The dataset contains 5 dates of land-use/land-cover data for 2,688 sample blocks randomly selected within 84 ecological regions. The nominal dates of the land-use/land-cover maps are 1973, 1980, 1986, 1992, and 2000. The land-use/land-cover maps were classified manually from Landsat Multispectral Scanner, Thematic Mapper, and Enhanced Thematic Mapper Plus imagery using a modified Anderson Level I classification scheme. The resulting land-use/land-cover data has a 60-meter resolution and the projection is set to Albers Equal-Area Conic, North American Datum of 1983. The files are labeled using a standard file naming convention that contains the number of the ecoregion, sample block, and Landsat year. The downloadable files are organized by ecoregion, and are available in the ERDAS IMAGINETM (.img) raster file format.

  18. DATS, the data tag suite to enable discoverability of datasets.

    PubMed

    Sansone, Susanna-Assunta; Gonzalez-Beltran, Alejandra; Rocca-Serra, Philippe; Alter, George; Grethe, Jeffrey S; Xu, Hua; Fore, Ian M; Lyle, Jared; Gururaj, Anupama E; Chen, Xiaoling; Kim, Hyeon-Eui; Zong, Nansu; Li, Yueling; Liu, Ruiling; Ozyurt, I Burak; Ohno-Machado, Lucila

    2017-06-06

    Today's science increasingly requires effective ways to find and access existing datasets that are distributed across a range of repositories. For researchers in the life sciences, discoverability of datasets may soon become as essential as identifying the latest publications via PubMed. Through an international collaborative effort funded by the National Institutes of Health (NIH)'s Big Data to Knowledge (BD2K) initiative, we have designed and implemented the DAta Tag Suite (DATS) model to support the DataMed data discovery index. DataMed's goal is to be for data what PubMed has been for the scientific literature. Akin to the Journal Article Tag Suite (JATS) used in PubMed, the DATS model enables submission of metadata on datasets to DataMed. DATS has a core set of elements, which are generic and applicable to any type of dataset, and an extended set that can accommodate more specialized data types. DATS is a platform-independent model also available as an annotated serialization in schema.org, which in turn is widely used by major search engines like Google, Microsoft, Yahoo and Yandex.

  19. A large dataset of protein dynamics in the mammalian heart proteome

    PubMed Central

    Lau, Edward; Cao, Quan; Ng, Dominic C.M.; Bleakley, Brian J.; Dincer, T. Umut; Bot, Brian M.; Wang, Ding; Liem, David A.; Lam, Maggie P.Y.; Ge, Junbo; Ping, Peipei

    2016-01-01

    Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems. PMID:26977904

  20. A large dataset of protein dynamics in the mammalian heart proteome.

    PubMed

    Lau, Edward; Cao, Quan; Ng, Dominic C M; Bleakley, Brian J; Dincer, T Umut; Bot, Brian M; Wang, Ding; Liem, David A; Lam, Maggie P Y; Ge, Junbo; Ping, Peipei

    2016-03-15

    Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems.

  1. A Dataset from TIMSS to Examine the Relationship between Computer Use and Mathematics Achievement

    ERIC Educational Resources Information Center

    Kadijevich, Djordje M.

    2015-01-01

    Because the relationship between computer use and achievement is still puzzling, there is a need to prepare and analyze good quality datasets on computer use and achievement. Such a dataset can be derived from TIMSS data. This paper describes how this dataset can be prepared. It also gives an example of how the dataset may be analyzed. The…

  2. The Development of a Noncontact Letter Input Interface “Fingual” Using Magnetic Dataset

    NASA Astrophysics Data System (ADS)

    Fukushima, Taishi; Miyazaki, Fumio; Nishikawa, Atsushi

    We have newly developed a noncontact letter input interface called “Fingual”. Fingual uses a glove mounted with inexpensive and small magnetic sensors. Using the glove, users can input letters to form the finger alphabets, a kind of sign language. The proposed method uses some dataset which consists of magnetic field and the corresponding letter information. In this paper, we show two recognition methods using the dataset. First method uses Euclidean norm, and second one additionally uses Gaussian function as a weighting function. Then we conducted verification experiments for the recognition rate of each method in two situations. One of the situations is that subjects used their own dataset; the other is that they used another person's dataset. As a result, the proposed method could recognize letters with a high rate in both situations, even though it is better to use their own dataset than to use another person's dataset. Though Fingual needs to collect magnetic dataset for each letter in advance, its feature is the ability to recognize letters without the complicated calculations such as inverse problems. This paper shows results of the recognition experiments, and shows the utility of the proposed system “Fingual”.

  3. A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie

    PubMed Central

    Hanke, Michael; Baumgartner, Florian J.; Ibe, Pierre; Kaule, Falko R.; Pollmann, Stefan; Speck, Oliver; Zinke, Wolf; Stadler, Jörg

    2014-01-01

    Here we present a high-resolution functional magnetic resonance (fMRI) dataset – 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film (“Forrest Gump”). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response patterns to complex auditory stimulation. Among the potential uses of this dataset are the study of auditory attention and cognition, language and music perception, and social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures – from stimulus creation to data analysis. In order to facilitate replicative and derived works, only free and open-source software was utilized. PMID:25977761

  4. A new dataset validation system for the Planetary Science Archive

    NASA Astrophysics Data System (ADS)

    Manaud, N.; Zender, J.; Heather, D.; Martinez, S.

    2007-08-01

    The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that

  5. Data assimilation and model evaluation experiment datasets

    NASA Technical Reports Server (NTRS)

    Lai, Chung-Cheng A.; Qian, Wen; Glenn, Scott M.

    1994-01-01

    The Institute for Naval Oceanography, in cooperation with Naval Research Laboratories and universities, executed the Data Assimilation and Model Evaluation Experiment (DAMEE) for the Gulf Stream region during fiscal years 1991-1993. Enormous effort has gone into the preparation of several high-quality and consistent datasets for model initialization and verification. This paper describes the preparation process, the temporal and spatial scopes, the contents, the structure, etc., of these datasets. The goal of DAMEE and the need of data for the four phases of experiment are briefly stated. The preparation of DAMEE datasets consisted of a series of processes: (1) collection of observational data; (2) analysis and interpretation; (3) interpolation using the Optimum Thermal Interpolation System package; (4) quality control and re-analysis; and (5) data archiving and software documentation. The data products from these processes included a time series of 3D fields of temperature and salinity, 2D fields of surface dynamic height and mixed-layer depth, analysis of the Gulf Stream and rings system, and bathythermograph profiles. To date, these are the most detailed and high-quality data for mesoscale ocean modeling, data assimilation, and forecasting research. Feedback from ocean modeling groups who tested this data was incorporated into its refinement. Suggestions for DAMEE data usages include (1) ocean modeling and data assimilation studies, (2) diagnosis and theoretical studies, and (3) comparisons with locally detailed observations.

  6. Recent Development on the NOAA's Global Surface Temperature Dataset

    NASA Astrophysics Data System (ADS)

    Zhang, H. M.; Huang, B.; Boyer, T.; Lawrimore, J. H.; Menne, M. J.; Rennie, J.

    2016-12-01

    Global Surface Temperature (GST) is one of the most widely used indicators for climate trend and extreme analyses. A widely used GST dataset is the NOAA merged land-ocean surface temperature dataset known as NOAAGlobalTemp (formerly MLOST). The NOAAGlobalTemp had recently been updated from version 3.5.4 to version 4. The update includes a significant improvement in the ocean surface component (Extended Reconstructed Sea Surface Temperature or ERSST, from version 3b to version 4) which resulted in an increased temperature trends in recent decades. Since then, advancements in both the ocean component (ERSST) and land component (GHCN-Monthly) have been made, including the inclusion of Argo float SSTs and expanded EOT modes in ERSST, and the use of ISTI databank in GHCN-Monthly. In this presentation, we describe the impact of those improvements on the merged global temperature dataset, in terms of global trends and other aspects.

  7. Integrative Exploratory Analysis of Two or More Genomic Datasets.

    PubMed

    Meng, Chen; Culhane, Aedin

    2016-01-01

    Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.

  8. Status and Preliminary Evaluation for Chinese Re-Analysis Datasets

    NASA Astrophysics Data System (ADS)

    bin, zhao; chunxiang, shi; tianbao, zhao; dong, si; jingwei, liu

    2016-04-01

    Based on operational T639L60 spectral model, combined with Hybird_GSI assimilation system by using meteorological observations including radiosondes, buoyes, satellites el al., a set of Chinese Re-Analysis (CRA) datasets is developing by Chinese National Meteorological Information Center (NMIC) of Chinese Meteorological Administration (CMA). The datasets are run at 30km (0.28°latitude / longitude) resolution which holds higher resolution than most of the existing reanalysis dataset. The reanalysis is done in an effort to enhance the accuracy of historical synoptic analysis and aid to find out detailed investigation of various weather and climate systems. The current status of reanalysis is in a stage of preliminary experimental analysis. One-year forecast data during Jun 2013 and May 2014 has been simulated and used in synoptic and climate evaluation. We first examine the model prediction ability with the new assimilation system, and find out that it represents significant improvement in Northern and Southern hemisphere, due to addition of new satellite data, compared with operational T639L60 model, the effect of upper-level prediction is improved obviously and overall prediction stability is enhanced. In climatological analysis, compared with ERA-40, NCEP/NCAR and NCEP/DOE reanalyses, the results show that surface temperature simulates a bit lower in land and higher over ocean, 850-hPa specific humidity reflects weakened anomaly and the zonal wind value anomaly is focus on equatorial tropics. Meanwhile, the reanalysis dataset shows good ability for various climate index, such as subtropical high index, ESMI (East-Asia subtropical Summer Monsoon Index) et al., especially for the Indian and western North Pacific monsoon index. Latter we will further improve the assimilation system and dynamical simulating performance, and obtain 40-years (1979-2018) reanalysis datasets. It will provide a more comprehensive analysis for synoptic and climate diagnosis.

  9. Realistic computer network simulation for network intrusion detection dataset generation

    NASA Astrophysics Data System (ADS)

    Payer, Garrett

    2015-05-01

    The KDD-99 Cup dataset is dead. While it can continue to be used as a toy example, the age of this dataset makes it all but useless for intrusion detection research and data mining. Many of the attacks used within the dataset are obsolete and do not reflect the features important for intrusion detection in today's networks. Creating a new dataset encompassing a large cross section of the attacks found on the Internet today could be useful, but would eventually fall to the same problem as the KDD-99 Cup; its usefulness would diminish after a period of time. To continue research into intrusion detection, the generation of new datasets needs to be as dynamic and as quick as the attacker. Simply examining existing network traffic and using domain experts such as intrusion analysts to label traffic is inefficient, expensive, and not scalable. The only viable methodology is simulation using technologies including virtualization, attack-toolsets such as Metasploit and Armitage, and sophisticated emulation of threat and user behavior. Simulating actual user behavior and network intrusion events dynamically not only allows researchers to vary scenarios quickly, but enables online testing of intrusion detection mechanisms by interacting with data as it is generated. As new threat behaviors are identified, they can be added to the simulation to make quicker determinations as to the effectiveness of existing and ongoing network intrusion technology, methodology and models.

  10. Datasets related to in-land water for limnology and remote sensing applications: distance-to-land, distance-to-water, water-body identifier and lake-centre co-ordinates.

    PubMed

    Carrea, Laura; Embury, Owen; Merchant, Christopher J

    2015-11-01

    Datasets containing information to locate and identify water bodies have been generated from data locating static-water-bodies with resolution of about 300 m (1/360 ∘ ) recently released by the Land Cover Climate Change Initiative (LC CCI) of the European Space Agency. The LC CCI water-bodies dataset has been obtained from multi-temporal metrics based on time series of the backscattered intensity recorded by ASAR on Envisat between 2005 and 2010. The new derived datasets provide coherently: distance to land, distance to water, water-body identifiers and lake-centre locations. The water-body identifier dataset locates the water bodies assigning the identifiers of the Global Lakes and Wetlands Database (GLWD), and lake centres are defined for in-land waters for which GLWD IDs were determined. The new datasets therefore link recent lake/reservoir/wetlands extent to the GLWD, together with a set of coordinates which locates unambiguously the water bodies in the database. Information on distance-to-land for each water cell and the distance-to-water for each land cell has many potential applications in remote sensing, where the applicability of geophysical retrieval algorithms may be affected by the presence of water or land within a satellite field of view (image pixel). During the generation and validation of the datasets some limitations of the GLWD database and of the LC CCI water-bodies mask have been found. Some examples of the inaccuracies/limitations are presented and discussed. Temporal change in water-body extent is common. Future versions of the LC CCI dataset are planned to represent temporal variation, and this will permit these derived datasets to be updated.

  11. Factor XIII Val34Leu polymorphism and the risk of myocardial infarction under the age of 36 years.

    PubMed

    Rallidis, Loukianos S; Politou, Marianna; Komporozos, Christoforos; Panagiotakos, Demosthenes B; Belessi, Chrisoula I; Travlou, Anthi; Lekakis, John; Kremastinos, Dimitrios T

    2008-06-01

    There are limited and controversial data regarding the impact of factor XIII (FXIII) Val34Leu polymorphism in the pathogenesis of premature myocardial infarction (MI). We examined whether FXIII Val34Leu polymorphism is associated with the development of early MI. We recruited 159 consecutive patients who had survived their first acute MI under the age of 36 years (mean age = 32.1 +/- 3.6 years, 138 were men). The control group consisted of 121 healthy individuals matched with cases for age and sex, without a family history of premature coronary heart disease (CHD). FXIII Val34Leu polymorphism was tested with polymerase chain reaction and reverse hybridization. There was a lower prevalence of carriers of the Leu34 allele in patients than in controls (30.2 vs. 47.1%, p = 0.006). FXIII Val34Leu polymorphism was associated with lower risk for acute MI after adjusting for major cardiovascular risk factors (odds ratio [OR] = 0.51, 95% confidence interval [CI] 0.27-0.95, p = 0.03). Subgroup analysis according to angiographic findings ("normal" coronary arteries [n = 29] or significant CHD [n = 130]) showed that only patients with MI and significant CHD had lower prevalence of carriers of the Leu34 allele compared to controls after adjusting for major cardiovascular risk factors (OR = 0.42, 95% CI 0.22-0.83, p = 0.01). Our data indicate that FXIII Val34Leu polymorphism has a protective effect against the development of MI under the age of 36 years, particularly in the setting of significant CHD.

  12. Isotherm ranking and selection using thirteen literature datasets involving hydrophobic organic compounds.

    PubMed

    Matott, L Shawn; Jiang, Zhengzheng; Rabideau, Alan J; Allen-King, Richelle M

    2015-01-01

    Numerous isotherm expressions have been developed for describing sorption of hydrophobic organic compounds (HOCs), including "dual-mode" approaches that combine nonlinear behavior with a linear partitioning component. Choosing among these alternative expressions for describing a given dataset is an important task that can significantly influence subsequent transport modeling and/or mechanistic interpretation. In this study, a series of numerical experiments were undertaken to identify "best-in-class" isotherms by refitting 10 alternative models to a suite of 13 previously published literature datasets. The corrected Akaike Information Criterion (AICc) was used for ranking these alternative fits and distinguishing between plausible and implausible isotherms for each dataset. The occurrence of multiple plausible isotherms was inversely correlated with dataset "richness", such that datasets with fewer observations and/or a narrow range of aqueous concentrations resulted in a greater number of plausible isotherms. Overall, only the Polanyi-partition dual-mode isotherm was classified as "plausible" across all 13 of the considered datasets, indicating substantial statistical support consistent with current advances in sorption theory. However, these findings are predicated on the use of the AICc measure as an unbiased ranking metric and the adoption of a subjective, but defensible, threshold for separating plausible and implausible isotherms. Copyright © 2015 Elsevier B.V. All rights reserved.

  13. Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

    NASA Astrophysics Data System (ADS)

    Beck, Hylke E.; Vergopolan, Noemi; Pan, Ming; Levizzani, Vincenzo; van Dijk, Albert I. J. M.; Weedon, Graham P.; Brocca, Luca; Pappenberger, Florian; Huffman, George J.; Wood, Eric F.

    2017-12-01

    We undertook a comprehensive evaluation of 22 gridded (quasi-)global (sub-)daily precipitation (P) datasets for the period 2000-2016. Thirteen non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76 086 gauges worldwide. Another nine gauge-corrected datasets were evaluated using hydrological modeling, by calibrating the HBV conceptual model against streamflow records for each of 9053 small to medium-sized ( < 50 000 km2) catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR) and the satellite- and reanalysis-based CHIRP V2.0 dataset, the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7) or near-surface soil moisture (SM2RAIN-ASCAT), and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS). Two of the three reanalyses (ERA-Interim and JRA-55) unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified, and MSWEP V1.2 and V2.0) generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU), which in turn outperformed the one indirectly incorporating gauge data through another multi-source dataset (PERSIANN-CDR V1R1). Our results highlight large differences in estimation accuracy, and hence the importance of P

  14. Reference datasets for 2-treatment, 2-sequence, 2-period bioequivalence studies.

    PubMed

    Schütz, Helmut; Labes, Detlew; Fuglsang, Anders

    2014-11-01

    It is difficult to validate statistical software used to assess bioequivalence since very few datasets with known results are in the public domain, and the few that are published are of moderate size and balanced. The purpose of this paper is therefore to introduce reference datasets of varying complexity in terms of dataset size and characteristics (balance, range, outlier presence, residual error distribution) for 2-treatment, 2-period, 2-sequence bioequivalence studies and to report their point estimates and 90% confidence intervals which companies can use to validate their installations. The results for these datasets were calculated using the commercial packages EquivTest, Kinetica, SAS and WinNonlin, and the non-commercial package R. The results of three of these packages mostly agree, but imbalance between sequences seems to provoke questionable results with one package, which illustrates well the need for proper software validation.

  15. Multi-facetted Metadata - Describing datasets with different metadata schemas at the same time

    NASA Astrophysics Data System (ADS)

    Ulbricht, Damian; Klump, Jens; Bertelmann, Roland

    2013-04-01

    Inspired by the wish to re-use research data a lot of work is done to bring data systems of the earth sciences together. Discovery metadata is disseminated to data portals to allow building of customized indexes of catalogued dataset items. Data that were once acquired in the context of a scientific project are open for reappraisal and can now be used by scientists that were not part of the original research team. To make data re-use easier, measurement methods and measurement parameters must be documented in an application metadata schema and described in a written publication. Linking datasets to publications - as DataCite [1] does - requires again a specific metadata schema and every new use context of the measured data may require yet another metadata schema sharing only a subset of information with the meta information already present. To cope with the problem of metadata schema diversity in our common data repository at GFZ Potsdam we established a solution to store file-based research data and describe these with an arbitrary number of metadata schemas. Core component of the data repository is an eSciDoc infrastructure that provides versioned container objects, called eSciDoc [2] "items". The eSciDoc content model allows assigning files to "items" and adding any number of metadata records to these "items". The eSciDoc items can be submitted, revised, and finally published, which makes the data and metadata available through the internet worldwide. GFZ Potsdam uses eSciDoc to support its scientific publishing workflow, including mechanisms for data review in peer review processes by providing temporary web links for external reviewers that do not have credentials to access the data. Based on the eSciDoc API, panMetaDocs [3] provides a web portal for data management in research projects. PanMetaDocs, which is based on panMetaWorks [4], is a PHP based web application that allows to describe data with any XML-based schema. It uses the eSciDoc infrastructures

  16. Atlas Toolkit: Fast registration of 3D morphological datasets in the absence of landmarks

    PubMed Central

    Grocott, Timothy; Thomas, Paul; Münsterberg, Andrea E.

    2016-01-01

    Image registration is a gateway technology for Developmental Systems Biology, enabling computational analysis of related datasets within a shared coordinate system. Many registration tools rely on landmarks to ensure that datasets are correctly aligned; yet suitable landmarks are not present in many datasets. Atlas Toolkit is a Fiji/ImageJ plugin collection offering elastic group-wise registration of 3D morphological datasets, guided by segmentation of the interesting morphology. We demonstrate the method by combinatorial mapping of cell signalling events in the developing eyes of chick embryos, and use the integrated datasets to predictively enumerate Gene Regulatory Network states. PMID:26864723

  17. Atlas Toolkit: Fast registration of 3D morphological datasets in the absence of landmarks.

    PubMed

    Grocott, Timothy; Thomas, Paul; Münsterberg, Andrea E

    2016-02-11

    Image registration is a gateway technology for Developmental Systems Biology, enabling computational analysis of related datasets within a shared coordinate system. Many registration tools rely on landmarks to ensure that datasets are correctly aligned; yet suitable landmarks are not present in many datasets. Atlas Toolkit is a Fiji/ImageJ plugin collection offering elastic group-wise registration of 3D morphological datasets, guided by segmentation of the interesting morphology. We demonstrate the method by combinatorial mapping of cell signalling events in the developing eyes of chick embryos, and use the integrated datasets to predictively enumerate Gene Regulatory Network states.

  18. Correction of elevation offsets in multiple co-located lidar datasets

    USGS Publications Warehouse

    Thompson, David M.; Dalyander, P. Soupy; Long, Joseph W.; Plant, Nathaniel G.

    2017-04-07

    IntroductionTopographic elevation data collected with airborne light detection and ranging (lidar) can be used to analyze short- and long-term changes to beach and dune systems. Analysis of multiple lidar datasets at Dauphin Island, Alabama, revealed systematic, island-wide elevation differences on the order of 10s of centimeters (cm) that were not attributable to real-world change and, therefore, were likely to represent systematic sampling offsets. These offsets vary between the datasets, but appear spatially consistent within a given survey. This report describes a method that was developed to identify and correct offsets between lidar datasets collected over the same site at different times so that true elevation changes over time, associated with sediment accumulation or erosion, can be analyzed.

  19. DNAism: exploring genomic datasets on the web with Horizon Charts.

    PubMed

    Rio Deiros, David; Gibbs, Richard A; Rogers, Jeffrey

    2016-01-27

    Computational biologists daily face the need to explore massive amounts of genomic data. New visualization techniques can help researchers navigate and understand these big data. Horizon Charts are a relatively new visualization method that, under the right circumstances, maximizes data density without losing graphical perception. Horizon Charts have been successfully applied to understand multi-metric time series data. We have adapted an existing JavaScript library (Cubism) that implements Horizon Charts for the time series domain so that it works effectively with genomic datasets. We call this new library DNAism. Horizon Charts can be an effective visual tool to explore complex and large genomic datasets. Researchers can use our library to leverage these techniques to extract additional insights from their own datasets.

  20. Synthetic ALSPAC longitudinal datasets for the Big Data VR project.

    PubMed

    Avraam, Demetris; Wilson, Rebecca C; Burton, Paul

    2017-01-01

    Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information.  In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.

  1. Characterization and visualization of the accuracy of FIA's CONUS-wide tree species datasets

    Treesearch

    Rachel Riemann; Barry T. Wilson

    2014-01-01

    Modeled geospatial datasets have been created for 325 tree species across the contiguous United States (CONUS). Effective application of all geospatial datasets depends on their accuracy. Dataset error can be systematic (bias) or unsystematic (scatter), and their magnitude can vary by region and scale. Each of these characteristics affects the locations, scales, uses,...

  2. Evaluation of Greenland near surface air temperature datasets

    DOE PAGES

    Reeves Eyre, J. E. Jack; Zeng, Xubin

    2017-07-05

    Near-surface air temperature (SAT) over Greenland has important effects on mass balance of the ice sheet, but it is unclear which SAT datasets are reliable in the region. Here extensive in situ SAT measurements ( ∼  1400 station-years) are used to assess monthly mean SAT from seven global reanalysis datasets, five gridded SAT analyses, one satellite retrieval and three dynamically downscaled reanalyses. Strengths and weaknesses of these products are identified, and their biases are found to vary by season and glaciological regime. MERRA2 reanalysis overall performs best with mean absolute error less than 2 °C in all months. Ice sheet-average annual mean SAT frommore » different datasets are highly correlated in recent decades, but their 1901–2000 trends differ even in sign. Compared with the MERRA2 climatology combined with gridded SAT analysis anomalies, thirty-one earth system model historical runs from the CMIP5 archive reach  ∼  5 °C for the 1901–2000 average bias and have opposite trends for a number of sub-periods.« less

  3. Evaluation of Greenland near surface air temperature datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Reeves Eyre, J. E. Jack; Zeng, Xubin

    Near-surface air temperature (SAT) over Greenland has important effects on mass balance of the ice sheet, but it is unclear which SAT datasets are reliable in the region. Here extensive in situ SAT measurements ( ∼  1400 station-years) are used to assess monthly mean SAT from seven global reanalysis datasets, five gridded SAT analyses, one satellite retrieval and three dynamically downscaled reanalyses. Strengths and weaknesses of these products are identified, and their biases are found to vary by season and glaciological regime. MERRA2 reanalysis overall performs best with mean absolute error less than 2 °C in all months. Ice sheet-average annual mean SAT frommore » different datasets are highly correlated in recent decades, but their 1901–2000 trends differ even in sign. Compared with the MERRA2 climatology combined with gridded SAT analysis anomalies, thirty-one earth system model historical runs from the CMIP5 archive reach  ∼  5 °C for the 1901–2000 average bias and have opposite trends for a number of sub-periods.« less

  4. De-identification of health records using Anonym: effectiveness and robustness across datasets.

    PubMed

    Zuccon, Guido; Kotzur, Daniel; Nguyen, Anthony; Bergheim, Anton

    2014-07-01

    Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data. Crown Copyright © 2014. Published by Elsevier B.V. All rights reserved.

  5. Differentially Private Histogram Publication For Dynamic Datasets: An Adaptive Sampling Approach

    PubMed Central

    Li, Haoran; Jiang, Xiaoqian; Xiong, Li; Liu, Jinfei

    2016-01-01

    Differential privacy has recently become a de facto standard for private statistical data release. Many algorithms have been proposed to generate differentially private histograms or synthetic data. However, most of them focus on “one-time” release of a static dataset and do not adequately address the increasing need of releasing series of dynamic datasets in real time. A straightforward application of existing histogram methods on each snapshot of such dynamic datasets will incur high accumulated error due to the composibility of differential privacy and correlations or overlapping users between the snapshots. In this paper, we address the problem of releasing series of dynamic datasets in real time with differential privacy, using a novel adaptive distance-based sampling approach. Our first method, DSFT, uses a fixed distance threshold and releases a differentially private histogram only when the current snapshot is sufficiently different from the previous one, i.e., with a distance greater than a predefined threshold. Our second method, DSAT, further improves DSFT and uses a dynamic threshold adaptively adjusted by a feedback control mechanism to capture the data dynamics. Extensive experiments on real and synthetic datasets demonstrate that our approach achieves better utility than baseline methods and existing state-of-the-art methods. PMID:26973795

  6. Satellite-Based Precipitation Datasets

    NASA Astrophysics Data System (ADS)

    Munchak, S. J.; Huffman, G. J.

    2017-12-01

    Of the possible sources of precipitation data, those based on satellites provide the greatest spatial coverage. There is a wide selection of datasets, algorithms, and versions from which to choose, which can be confusing to non-specialists wishing to use the data. The International Precipitation Working Group (IPWG) maintains tables of the major publicly available, long-term, quasi-global precipitation data sets (http://www.isac.cnr.it/ ipwg/data/datasets.html), and this talk briefly reviews the various categories. As examples, NASA provides two sets of quasi-global precipitation data sets: the older Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) and current Integrated Multi-satellitE Retrievals for Global Precipitation Measurement (GPM) mission (IMERG). Both provide near-real-time and post-real-time products that are uniformly gridded in space and time. The TMPA products are 3-hourly 0.25°x0.25° on the latitude band 50°N-S for about 16 years, while the IMERG products are half-hourly 0.1°x0.1° on 60°N-S for over 3 years (with plans to go to 16+ years in Spring 2018). In addition to the precipitation estimates, each data set provides fields of other variables, such as the satellite sensor providing estimates and estimated random error. The discussion concludes with advice about determining suitability for use, the necessity of being clear about product names and versions, and the need for continued support for satellite- and surface-based observation.

  7. Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

    PubMed Central

    Yoon, Chun Hong; DeMirci, Hasan; Sierra, Raymond G.; Dao, E. Han; Ahmadi, Radman; Aksit, Fulya; Aquila, Andrew L.; Batyuk, Alexander; Ciftci, Halilibrahim; Guillet, Serge; Hayes, Matt J.; Hayes, Brandon; Lane, Thomas J.; Liang, Meng; Lundström, Ulf; Koglin, Jason E.; Mgbam, Paul; Rao, Yashas; Rendahl, Theodore; Rodriguez, Evan; Zhang, Lindsey; Wakatsuki, Soichi; Boutet, Sébastien; Holton, James M.; Hunter, Mark S.

    2017-01-01

    We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C10H16N2O3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operate simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development. PMID:28440794

  8. Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

    NASA Astrophysics Data System (ADS)

    Yoon, Chun Hong; Demirci, Hasan; Sierra, Raymond G.; Dao, E. Han; Ahmadi, Radman; Aksit, Fulya; Aquila, Andrew L.; Batyuk, Alexander; Ciftci, Halilibrahim; Guillet, Serge; Hayes, Matt J.; Hayes, Brandon; Lane, Thomas J.; Liang, Meng; Lundström, Ulf; Koglin, Jason E.; Mgbam, Paul; Rao, Yashas; Rendahl, Theodore; Rodriguez, Evan; Zhang, Lindsey; Wakatsuki, Soichi; Boutet, Sébastien; Holton, James M.; Hunter, Mark S.

    2017-04-01

    We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C10H16N2O3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operate simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development.

  9. Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

    DOE PAGES

    Yoon, Chun Hong; DeMirci, Hasan; Sierra, Raymond G.; ...

    2017-04-25

    We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C 10H 16N 2O 3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operatemore » simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development.« less

  10. Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yoon, Chun Hong; DeMirci, Hasan; Sierra, Raymond G.

    We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C 10H 16N 2O 3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operatemore » simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development.« less

  11. Igloo-Plot: a tool for visualization of multidimensional datasets.

    PubMed

    Kuntal, Bhusan K; Ghosh, Tarini Shankar; Mande, Sharmila S

    2014-01-01

    Advances in science and technology have resulted in an exponential growth of multivariate (or multi-dimensional) datasets which are being generated from various research areas especially in the domain of biological sciences. Visualization and analysis of such data (with the objective of uncovering the hidden patterns therein) is an important and challenging task. We present a tool, called Igloo-Plot, for efficient visualization of multidimensional datasets. The tool addresses some of the key limitations of contemporary multivariate visualization and analysis tools. The visualization layout, not only facilitates an easy identification of clusters of data-points having similar feature compositions, but also the 'marker features' specific to each of these clusters. The applicability of the various functionalities implemented herein is demonstrated using several well studied multi-dimensional datasets. Igloo-Plot is expected to be a valuable resource for researchers working in multivariate data mining studies. Igloo-Plot is available for download from: http://metagenomics.atc.tcs.com/IglooPlot/. Copyright © 2014 Elsevier Inc. All rights reserved.

  12. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

    PubMed Central

    Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila

    2018-01-01

    Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374

  13. Targeted inactivation of the mouse locus encoding coagulation factor XIII-A: hemostatic abnormalities in mutant mice and characterization of the coagulation deficit.

    PubMed

    Lauer, Peter; Metzner, Hubert J; Zettlmeissl, Gerd; Li, Meng; Smith, Austin G; Lathe, Richard; Dickneite, Gerhard

    2002-12-01

    Blood coagulation factor XIII (FXIII) promotes cross-linking of fibrin during blood coagulation; impaired clot stabilization in human genetic deficiency is associated with marked pathologies of major clinical impact, including bleeding symptoms and deficient wound healing. To investigate the role of FXIII we employed homologous recombination to generate a targeted deletion of the inferred exon 7 of the FXIII-A gene. FXIII transglutaminase activity in plasma was reduced to about 50% in mice heterozygous for the mutant allele, and was abolished in homozygous null mice. Plasma fibrin gamma-dimerization was also indetectable in the homozygous deficient animals, confirming the absence of activatable FXIII. Homozygous mutant mice were fertile, although reproduction was impaired. Bleeding episodes, hematothorax, hematoperitoneum and subcutaneous hemorrhage in mutant mice were associated with reduced survival. Arrest of tail-tip bleeding in FXIII-A deficient mice was markedly and significantly delayed; replacement of mutant mice with human plasma FXIII (Fibrogammin P) restored bleeding time to within the normal range. Thrombelastography (TEG) experiments demonstrated impaired clot stabilization in FXIII-A mutant mice, replacement with human FXIII led to dose-dependent TEG normalization. The mutant mice thus reiterate some key features of the human genetic disorder: they will be valuable in assessing the role of FXIII in other associated pathologies and the development of new therapies.

  14. Local activation of coagulation factor XIII reduces systemic complications and improves the survival of mice after Streptococcus pyogenes M1 skin infection.

    PubMed

    Deicke, Christin; Chakrakodi, Bhavya; Pils, Marina C; Dickneite, Gerhard; Johansson, Linda; Medina, Eva; Loof, Torsten G

    2016-11-01

    Coagulation is a mechanism for wound healing after injury. Several recent studies delineate an additional role of the intrinsic pathway of coagulation, also known as the contact system, in the early innate immune response against bacterial infections. In this study, we investigated the role of factor XIII (FXIII), which is activated upon coagulation induction, during Streptococcus pyogenes-mediated skin and soft tissue infections. FXIII has previously been shown to be responsible for the immobilization of bacteria within a fibrin network which may prevent systemic bacterial dissemination. In order to investigate if the FXIII-mediated entrapment of S. pyogenes also influences the disease outcome we used a murine S. pyogenes M1 skin and soft tissue infection model. Here, we demonstrate that a lack of FXIII leads to prolonged clotting times, increased signs of inflammation, and elevated bacterial dissemination. Moreover, FXIII-deficient mice show an impaired survival when compared with wildtype animals. Additionally, local reconstitution of FXIII-deficient mice with a human FXIII-concentrate (Fibrogammin ® P) could reduce the systemic complications, suggesting a protective role for FXIII during early S. pyogenes skin infection. FXIII therefore might be a possible therapeutically application to support the early innate immune response during skin infections caused by S. pyogenes. Copyright © 2016 Elsevier GmbH. All rights reserved.

  15. UK surveillance: provision of quality assured information from combined datasets.

    PubMed

    Paiba, G A; Roberts, S R; Houston, C W; Williams, E C; Smith, L H; Gibbens, J C; Holdship, S; Lysons, R

    2007-09-14

    Surveillance information is most useful when provided within a risk framework, which is achieved by presenting results against an appropriate denominator. Often the datasets are captured separately and for different purposes, and will have inherent errors and biases that can be further confounded by the act of merging. The United Kingdom Rapid Analysis and Detection of Animal-related Risks (RADAR) system contains data from several sources and provides both data extracts for research purposes and reports for wider stakeholders. Considerable efforts are made to optimise the data in RADAR during the Extraction, Transformation and Loading (ETL) process. Despite efforts to ensure data quality, the final dataset inevitably contains some data errors and biases, most of which cannot be rectified during subsequent analysis. So, in order for users to establish the 'fitness for purpose' of data merged from more than one data source, Quality Statements are produced as defined within the overarching surveillance Quality Framework. These documents detail identified data errors and biases following ETL and report construction as well as relevant aspects of the datasets from which the data originated. This paper illustrates these issues using RADAR datasets, and describes how they can be minimised.

  16. Datasets, Technologies and Products from the NASA/NOAA Electronic Theater 2002

    NASA Technical Reports Server (NTRS)

    Hasler, A. Fritz; Starr, David (Technical Monitor)

    2001-01-01

    An in depth look at the Earth Science datasets used in the Etheater Visualizations will be presented. This will include the satellite orbits, platforms, scan patterns, the size, temporal and spatial resolution, and compositing techniques used to obtain the datasets as well as the spectral bands utilized.

  17. A biclustering algorithm for extracting bit-patterns from binary datasets.

    PubMed

    Rodriguez-Baena, Domingo S; Perez-Pulido, Antonio J; Aguilar-Ruiz, Jesus S

    2011-10-01

    Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html dsrodbae@upo.es Supplementary data are available at Bioinformatics online.

  18. Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

    PubMed Central

    Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.

    2016-01-01

    An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777

  19. Residential load and rooftop PV generation: an Australian distribution network dataset

    NASA Astrophysics Data System (ADS)

    Ratnam, Elizabeth L.; Weller, Steven R.; Kellett, Christopher M.; Murray, Alan T.

    2017-09-01

    Despite the rapid uptake of small-scale solar photovoltaic (PV) systems in recent years, public availability of generation and load data at the household level remains very limited. Moreover, such data are typically measured using bi-directional meters recording only PV generation in excess of residential load rather than recording generation and load separately. In this paper, we report a publicly available dataset consisting of load and rooftop PV generation for 300 de-identified residential customers in an Australian distribution network, with load centres covering metropolitan Sydney and surrounding regional areas. The dataset spans a 3-year period, with separately reported measurements of load and PV generation at 30-min intervals. Following a detailed description of the dataset, we identify several means by which anomalous records (e.g. due to inverter failure) are identified and excised. With the resulting 'clean' dataset, we identify key customer-specific and aggregated characteristics of rooftop PV generation and residential load.

  20. An extensive dataset of eye movements during viewing of complex images.

    PubMed

    Wilming, Niklas; Onat, Selim; Ossandón, José P; Açık, Alper; Kietzmann, Tim C; Kaspar, Kai; Gameiro, Ricardo R; Vormberg, Alexandra; König, Peter

    2017-01-31

    We present a dataset of free-viewing eye-movement recordings that contains more than 2.7 million fixation locations from 949 observers on more than 1000 images from different categories. This dataset aggregates and harmonizes data from 23 different studies conducted at the Institute of Cognitive Science at Osnabrück University and the University Medical Center in Hamburg-Eppendorf. Trained personnel recorded all studies under standard conditions with homogeneous equipment and parameter settings. All studies allowed for free eye-movements, and differed in the age range of participants (~7-80 years), stimulus sizes, stimulus modifications (phase scrambled, spatial filtering, mirrored), and stimuli categories (natural and urban scenes, web sites, fractal, pink-noise, and ambiguous artistic figures). The size and variability of viewing behavior within this dataset presents a strong opportunity for evaluating and comparing computational models of overt attention, and furthermore, for thoroughly quantifying strategies of viewing behavior. This also makes the dataset a good starting point for investigating whether viewing strategies change in patient groups.

  1. Crowdsourcing a Normative Natural Language Dataset: A Comparison of Amazon Mechanical Turk and In-Lab Data Collection

    PubMed Central

    Bex, Peter J; Woods, Russell L

    2013-01-01

    of shared words showed (P<.001). Within both datasets, responses contained substantial relevant content, with more words in common with responses to the same clip than to other clips (P<.001). There was evidence that responses from female and older crowdsourced participants had more shared words (P=.004 and .01 respectively), whereas younger participants had higher numbers of shared words in the lab-sourced population (P=.01). Conclusions Crowdsourcing is an effective approach to quickly and economically collect a large reliable dataset of normative natural language responses. PMID:23689038

  2. Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset.

    PubMed

    Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert

    2018-02-13

    We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for 'target' versus 'non-target' (dataset A) and symbol 'O' versus symbol 'X' (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques.

  3. Improving average ranking precision in user searches for biomedical research datasets

    PubMed Central

    Gobeill, Julien; Gaudinat, Arnaud; Vachon, Thérèse; Ruch, Patrick

    2017-01-01

    Abstract Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw

  4. Lineages with long durations are old and morphologically average: an analysis using multiple datasets.

    PubMed

    Liow, Lee Hsiang

    2007-04-01

    samples of their shorter duration relatives (a null individual morpho-duration distribution) for the majority of datasets studied. Contrary to the common idea that very persistent lineages are special or unique in some significant way, both the results from analyses of long-duration lineages as groups and individuals show that they are morphologically average. Persistent lineages often arise early in a group's history, even though there is no prior expectation for this tendency in datasets of extinct groups. The implications of these results for diversification histories and niche preemption are discussed.

  5. Who should take responsibility for decisions on internationally recommended datasets? The case of the mass concentration of mercury in air at saturation

    NASA Astrophysics Data System (ADS)

    Brown, Richard J. C.; Brewer, Paul J.; Ent, Hugo; Fisicaro, Paola; Horvat, Milena; Kim, Ki-Hyun; Quétel, Christophe R.

    2015-10-01

    This paper considers how decisions on internationally recommended datasets are made and implemented and, further, how the ownership of these decisions comes about. Examples are given of conventionally agreed data and values where the responsibility is clear and comes about through official designation or by common usage and practice over long time periods. The example of the dataset describing the mass concentration of mercury in air at saturation is discussed in detail. This is a case where there are now several competing datasets that are in disagreement with each other, some with historical authority and some more recent but, arguably, with more robust metrological traceability to the SI. Further, it is elaborated that there is no body charged with the responsibility to make a decision on an international recommendation for such a dataset. This has led to the situation where several competing datasets are in use simultaneously. Close parallels are drawn with the current debate over changes to the ozone absorption cross section, which has equal importance to the measurement of ozone amount fraction in air and to subsequent compliance with air quality legislation. It is noted that in the case of the ozone cross section there is already a committee appointed to deliberate over any change. We make the proposal that a similar committee, under the auspices of IUPAC or the CIPM’s CCQM (if it adopted a reference data function) could be formed to perform a similar role for the mass concentration of mercury in air at saturation.

  6. Content-level deduplication on mobile internet datasets

    NASA Astrophysics Data System (ADS)

    Hou, Ziyu; Chen, Xunxun; Wang, Yang

    2017-06-01

    Various systems and applications involve a large volume of duplicate items. Based on high data redundancy in real world datasets, data deduplication can reduce storage capacity and improve the utilization of network bandwidth. However, chunks of existing deduplications range in size from 4KB to over 16KB, existing systems are not applicable to the datasets consisting of short records. In this paper, we propose a new framework called SF-Dedup which is able to implement the deduplication process on a large set of Mobile Internet records, the size of records can be smaller than 100B, or even smaller than 10B. SF-Dedup is a short fingerprint, in-line, hash-collisions-resolved deduplication. Results of experimental applications illustrate that SH-Dedup is able to reduce storage capacity and shorten query time on relational database.

  7. Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

    NASA Astrophysics Data System (ADS)

    Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

    2017-12-01

    Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.

  8. Efficient segmentation of 3D fluoroscopic datasets from mobile C-arm

    NASA Astrophysics Data System (ADS)

    Styner, Martin A.; Talib, Haydar; Singh, Digvijay; Nolte, Lutz-Peter

    2004-05-01

    The emerging mobile fluoroscopic 3D technology linked with a navigation system combines the advantages of CT-based and C-arm-based navigation. The intra-operative, automatic segmentation of 3D fluoroscopy datasets enables the combined visualization of surgical instruments and anatomical structures for enhanced planning, surgical eye-navigation and landmark digitization. We performed a thorough evaluation of several segmentation algorithms using a large set of data from different anatomical regions and man-made phantom objects. The analyzed segmentation methods include automatic thresholding, morphological operations, an adapted region growing method and an implicit 3D geodesic snake method. In regard to computational efficiency, all methods performed within acceptable limits on a standard Desktop PC (30sec-5min). In general, the best results were obtained with datasets from long bones, followed by extremities. The segmentations of spine, pelvis and shoulder datasets were generally of poorer quality. As expected, the threshold-based methods produced the worst results. The combined thresholding and morphological operations methods were considered appropriate for a smaller set of clean images. The region growing method performed generally much better in regard to computational efficiency and segmentation correctness, especially for datasets of joints, and lumbar and cervical spine regions. The less efficient implicit snake method was able to additionally remove wrongly segmented skin tissue regions. This study presents a step towards efficient intra-operative segmentation of 3D fluoroscopy datasets, but there is room for improvement. Next, we plan to study model-based approaches for datasets from the knee and hip joint region, which would be thenceforth applied to all anatomical regions in our continuing development of an ideal segmentation procedure for 3D fluoroscopic images.

  9. Evaluation of reanalysis datasets against observational soil temperature data over China

    NASA Astrophysics Data System (ADS)

    Yang, Kai; Zhang, Jingyong

    2018-01-01

    Soil temperature is a key land surface variable, and is a potential predictor for seasonal climate anomalies and extremes. Using observational soil temperature data in China for 1981-2005, we evaluate four reanalysis datasets, the land surface reanalysis of the European Centre for Medium-Range Weather Forecasts (ERA-Interim/Land), the second modern-era retrospective analysis for research and applications (MERRA-2), the National Center for Environmental Prediction Climate Forecast System Reanalysis (NCEP-CFSR), and version 2 of the Global Land Data Assimilation System (GLDAS-2.0), with a focus on 40 cm soil layer. The results show that reanalysis data can mainly reproduce the spatial distributions of soil temperature in summer and winter, especially over the east of China, but generally underestimate their magnitudes. Owing to the influence of precipitation on soil temperature, the four datasets perform better in winter than in summer. The ERA-Interim/Land and GLDAS-2.0 produce spatial characteristics of the climatological mean that are similar to observations. The interannual variability of soil temperature is well reproduced by the ERA-Interim/Land dataset in summer and by the CFSR dataset in winter. The linear trend of soil temperature in summer is well rebuilt by reanalysis datasets. We demonstrate that soil heat fluxes in April-June and in winter are highly correlated with the soil temperature in summer and winter, respectively. Different estimations of surface energy balance components can contribute to different behaviors in reanalysis products in terms of estimating soil temperature. In addition, reanalysis datasets can mainly rebuild the northwest-southeast gradient of soil temperature memory over China.

  10. Mutual-information-based registration for ultrasound and CT datasets

    NASA Astrophysics Data System (ADS)

    Firle, Evelyn A.; Wesarg, Stefan; Dold, Christian

    2004-05-01

    In many applications for minimal invasive surgery the acquisition of intra-operative medical images is helpful if not absolutely necessary. Especially for Brachytherapy imaging is critically important to the safe delivery of the therapy. Modern computed tomography (CT) and magnetic resonance (MR) scanners allow minimal invasive procedures to be performed under direct imaging guidance. However, conventional scanners do not have real-time imaging capability and are expensive technologies requiring a special facility. Ultrasound (U/S) is a much cheaper and one of the most flexible imaging modalities. It can be moved to the application room as required and the physician sees what is happening as it occurs. Nevertheless it may be easier to interpret these 3D intra-operative U/S images if they are used in combination with less noisier preoperative data such as CT. The purpose of our current investigation is to develop a registration tool for automatically combining pre-operative CT volumes with intra-operatively acquired 3D U/S datasets. The applied alignment procedure is based on the information theoretic approach of maximizing the mutual information of two arbitrary datasets from different modalities. Since the CT datasets include a much bigger field of view we introduced a bounding box to narrow down the region of interest within the CT dataset. We conducted a phantom experiment using a CIRS Model 53 U/S Prostate Training Phantom to evaluate the feasibility and accuracy of the proposed method.

  11. Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.

    PubMed

    Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi

    2015-09-01

    Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the

  12. Evaluating satellite-derived long-term historical precipitation datasets for drought monitoring in Chile

    NASA Astrophysics Data System (ADS)

    Zambrano, Francisco; Wardlow, Brian; Tadesse, Tsegaye; Lillo-Saavedra, Mario; Lagos, Octavio

    2017-04-01

    Precipitation is a key parameter for the study of climate change and variability and the detection and monitoring of natural disaster such as drought. Precipitation datasets that accurately capture the amount and spatial variability of rainfall is critical for drought monitoring and a wide range of other climate applications. This is challenging in many parts of the world, which often have a limited number of weather stations and/or historical data records. Satellite-derived precipitation products offer a viable alternative with several remotely sensed precipitation datasets now available with long historical data records (+30years), which include the Climate Hazards Group InfraRed Precipitation with Station (CHIRPS) and Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks-Climate Data Record (PERSIANN-CDR) datasets. This study presents a comparative analysis of three historical satellite-based precipitation datasets that include Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) 3B43 version 7 (1998-2015), PERSIANN-CDR (1983-2015) and CHIRPS 2.0 (1981-2015) over Chile to assess their performance across the country and for the case of the two long-term products the applicability for agricultural drought were evaluated when used in the calculation of commonly used drought indicator as the Standardized Precipitation Index (SPI). In this analysis, 278 weather stations of in situ rainfall measurements across Chile were initially compared to the satellite data. The study area (Chile) was divided into five latitudinal zones: North, North-Central, Central, South-Central and South to determine if there were a regional difference among these satellite products, and nine statistics were used to evaluate their performance to estimate the amount and spatial distribution of historical rainfall across Chile. Hierarchical cluster analysis, k-means and singular value decomposition were used to analyze

  13. Data Discovery of Big and Diverse Climate Change Datasets - Options, Practices and Challenges

    NASA Astrophysics Data System (ADS)

    Palanisamy, G.; Boden, T.; McCord, R. A.; Frame, M. T.

    2013-12-01

    Developing data search tools is a very common, but often confusing, task for most of the data intensive scientific projects. These search interfaces need to be continually improved to handle the ever increasing diversity and volume of data collections. There are many aspects which determine the type of search tool a project needs to provide to their user community. These include: number of datasets, amount and consistency of discovery metadata, ancillary information such as availability of quality information and provenance, and availability of similar datasets from other distributed sources. Environmental Data Science and Systems (EDSS) group within the Environmental Science Division at the Oak Ridge National Laboratory has a long history of successfully managing diverse and big observational datasets for various scientific programs via various data centers such as DOE's Atmospheric Radiation Measurement Program (ARM), DOE's Carbon Dioxide Information and Analysis Center (CDIAC), USGS's Core Science Analytics and Synthesis (CSAS) metadata Clearinghouse and NASA's Distributed Active Archive Center (ORNL DAAC). This talk will showcase some of the recent developments for improving the data discovery within these centers The DOE ARM program recently developed a data discovery tool which allows users to search and discover over 4000 observational datasets. These datasets are key to the research efforts related to global climate change. The ARM discovery tool features many new functions such as filtered and faceted search logic, multi-pass data selection, filtering data based on data quality, graphical views of data quality and availability, direct access to data quality reports, and data plots. The ARM Archive also provides discovery metadata to other broader metadata clearinghouses such as ESGF, IASOA, and GOS. In addition to the new interface, ARM is also currently working on providing DOI metadata records to publishers such as Thomson Reuters and Elsevier. The ARM

  14. The Great Lakes Hydrography Dataset: Consistent, binational ...

    EPA Pesticide Factsheets

    Ecosystem-based management of the Laurentian Great Lakes, which spans both the United States and Canada, is hampered by the lack of consistent binational watersheds for the entire Basin. Using comparable data sources and consistent methods we developed spatially equivalent watershed boundaries for the binational extent of the Basin to create the Great Lakes Hydrography Dataset (GLHD). The GLHD consists of 5,589 watersheds for the entire Basin, covering a total area of approximately 547,967 km2, or about twice the 247,003 km2 surface water area of the Great Lakes. The GLHD improves upon existing watershed efforts by delineating watersheds for the entire Basin using consistent methods; enhancing the precision of watershed delineation by using recently developed flow direction grids that have been hydrologically enforced and vetted by provincial and federal water resource agencies; and increasing the accuracy of watershed boundaries by enforcing embayments, delineating watersheds on islands, and delineating watersheds for all tributaries draining to connecting channels. In addition, the GLHD is packaged in a publically available geodatabase that includes synthetic stream networks, reach catchments, watershed boundaries, a broad set of attribute data for each tributary, and metadata documenting methodology. The GLHD provides a common set of watersheds and associated hydrography data for the Basin that will enhance binational efforts to protect and restore the Great

  15. A multimodal dataset for authoring and editing multimedia content: The MAMEM project.

    PubMed

    Nikolopoulos, Spiros; Petrantonakis, Panagiotis C; Georgiadis, Kostas; Kalaganis, Fotis; Liaros, Georgios; Lazarou, Ioulietta; Adam, Katerina; Papazoglou-Chalikias, Anastasios; Chatzilari, Elisavet; Oikonomou, Vangelis P; Kumar, Chandan; Menges, Raphael; Staab, Steffen; Müller, Daniel; Sengupta, Korok; Bostantjopoulou, Sevasti; Katsarou, Zoe; Zeilig, Gabi; Plotnik, Meir; Gotlieb, Amihai; Kizoni, Racheli; Fountoukidou, Sofia; Ham, Jaap; Athanasiou, Dimitrios; Mariakaki, Agnes; Comanducci, Dario; Sabatini, Edoardo; Nistico, Walter; Plank, Markus; Kompatsiaris, Ioannis

    2017-12-01

    We present a dataset that combines multimodal biosignals and eye tracking information gathered under a human-computer interaction framework. The dataset was developed in the vein of the MAMEM project that aims to endow people with motor disabilities with the ability to edit and author multimedia content through mental commands and gaze activity. The dataset includes EEG, eye-tracking, and physiological (GSR and Heart rate) signals collected from 34 individuals (18 able-bodied and 16 motor-impaired). Data were collected during the interaction with specifically designed interface for web browsing and multimedia content manipulation and during imaginary movement tasks. The presented dataset will contribute towards the development and evaluation of modern human-computer interaction systems that would foster the integration of people with severe motor impairments back into society.

  16. Comparative analysis and assessment of M. tuberculosis H37Rv protein-protein interaction datasets

    PubMed Central

    2011-01-01

    Background M. tuberculosis is a formidable bacterial pathogen. There is thus an increasing demand on understanding the function and relationship of proteins in various strains of M. tuberculosis. Protein-protein interactions (PPIs) data are crucial for this kind of knowledge. However, the quality of the main available M. tuberculosis PPI datasets is unclear. This hampers the effectiveness of research works that rely on these PPI datasets. Here, we analyze the two main available M. tuberculosis H37Rv PPI datasets. The first dataset is the high-throughput B2H PPI dataset from Wang et al’s recent paper in Journal of Proteome Research. The second dataset is from STRING database, version 8.3, comprising entirely of H37Rv PPIs predicted using various methods. We find that these two datasets have a surprisingly low level of agreement. We postulate the following causes for this low level of agreement: (i) the H37Rv B2H PPI dataset is of low quality; (ii) the H37Rv STRING PPI dataset is of low quality; and/or (iii) the H37Rv STRING PPIs are predictions of other forms of functional associations rather than direct physical interactions. Results To test the quality of these two datasets, we evaluate them based on correlated gene expression profiles, coherent informative GO term annotations, and conservation in other organisms. We observe a significantly greater portion of PPIs in the H37Rv STRING PPI dataset (with score ≥ 770) having correlated gene expression profiles and coherent informative GO term annotations in both interaction partners than that in the H37Rv B2H PPI dataset. Predicted H37Rv interologs derived from non-M. tuberculosis experimental PPIs are much more similar to the H37Rv STRING functional associations dataset (with score ≥ 770) than the H37Rv B2H PPI dataset. H37Rv predicted physical interologs from IntAct also show extremely low similarity with the H37Rv B2H PPI dataset; and this similarity level is much lower than that between the S. aureus MRSA252

  17. An innovative privacy preserving technique for incremental datasets on cloud computing.

    PubMed

    Aldeen, Yousra Abdul Alsahib S; Salleh, Mazleena; Aljeroudi, Yazan

    2016-08-01

    Cloud computing (CC) is a magnificent service-based delivery with gigantic computer processing power and data storage across connected communications channels. It imparted overwhelming technological impetus in the internet (web) mediated IT industry, where users can easily share private data for further analysis and mining. Furthermore, user affable CC services enable to deploy sundry applications economically. Meanwhile, simple data sharing impelled various phishing attacks and malware assisted security threats. Some privacy sensitive applications like health services on cloud that are built with several economic and operational benefits necessitate enhanced security. Thus, absolute cyberspace security and mitigation against phishing blitz became mandatory to protect overall data privacy. Typically, diverse applications datasets are anonymized with better privacy to owners without providing all secrecy requirements to the newly added records. Some proposed techniques emphasized this issue by re-anonymizing the datasets from the scratch. The utmost privacy protection over incremental datasets on CC is far from being achieved. Certainly, the distribution of huge datasets volume across multiple storage nodes limits the privacy preservation. In this view, we propose a new anonymization technique to attain better privacy protection with high data utility over distributed and incremental datasets on CC. The proficiency of data privacy preservation and improved confidentiality requirements is demonstrated through performance evaluation. Copyright © 2016 Elsevier Inc. All rights reserved.

  18. Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

    PubMed Central

    Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene

    2018-01-01

    Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie PMID:29688379

  19. Experiences and lessons learned from creating a generalized workflow for data publication of field campaign datasets

    NASA Astrophysics Data System (ADS)

    Santhana Vannan, S. K.; Ramachandran, R.; Deb, D.; Beaty, T.; Wright, D.

    2017-12-01

    This paper summarizes the workflow challenges of curating and publishing data produced from disparate data sources and provides a generalized workflow solution to efficiently archive data generated by researchers. The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) for biogeochemical dynamics and the Global Hydrology Resource Center (GHRC) DAAC have been collaborating on the development of a generalized workflow solution to efficiently manage the data publication process. The generalized workflow presented here are built on lessons learned from implementations of the workflow system. Data publication consists of the following steps: Accepting the data package from the data providers, ensuring the full integrity of the data files. Identifying and addressing data quality issues Assembling standardized, detailed metadata and documentation, including file level details, processing methodology, and characteristics of data files Setting up data access mechanisms Setup of the data in data tools and services for improved data dissemination and user experience Registering the dataset in online search and discovery catalogues Preserving the data location through Digital Object Identifiers (DOI) We will describe the steps taken to automate, and realize efficiencies to the above process. The goals of the workflow system are to reduce the time taken to publish a dataset, to increase the quality of documentation and metadata, and to track individual datasets through the data curation process. Utilities developed to achieve these goal will be described. We will also share metrics driven value of the workflow system and discuss the future steps towards creation of a common software framework.

  20. Linking Disparate Datasets of the Earth Sciences with the SemantEco Annotator

    NASA Astrophysics Data System (ADS)

    Seyed, P.; Chastain, K.; McGuinness, D. L.

    2013-12-01

    Use of Semantic Web technologies for data management in the Earth sciences (and beyond) has great potential but is still in its early stages, since the challenges of translating data into a more explicit or semantic form for immediate use within applications has not been fully addressed. In this abstract we help address this challenge by introducing the SemantEco Annotator, which enables anyone, regardless of expertise, to semantically annotate tabular Earth Science data and translate it into linked data format, while applying the logic inherent in community-standard vocabularies to guide the process. The Annotator was conceived under a desire to unify dataset content from a variety of sources under common vocabularies, for use in semantically-enabled web applications. Our current use case employs linked data generated by the Annotator for use in the SemantEco environment, which utilizes semantics to help users explore, search, and visualize water or air quality measurement and species occurrence data through a map-based interface. The generated data can also be used immediately to facilitate discovery and search capabilities within 'big data' environments. The Annotator provides a method for taking information about a dataset, that may only be known to its maintainers, and making it explicit, in a uniform and machine-readable fashion, such that a person or information system can more easily interpret the underlying structure and meaning. Its primary mechanism is to enable a user to formally describe how columns of a tabular dataset relate and/or describe entities. For example, if a user identifies columns for latitude and longitude coordinates, we can infer the data refers to a point that can be plotted on a map. Further, it can be made explicit that measurements of 'nitrate' and 'NO3-' are of the same entity through vocabulary assignments, thus more easily utilizing data sets that use different nomenclatures. The Annotator provides an extensive and searchable

  1. Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset

    PubMed Central

    Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert

    2018-01-01

    We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for ‘target’ versus ‘non-target’ (dataset A) and symbol ‘O’ versus symbol ‘X’ (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques. PMID:29437166

  2. Multiresolution persistent homology for excessively large biomolecular datasets

    NASA Astrophysics Data System (ADS)

    Xia, Kelin; Zhao, Zhixiong; Wei, Guo-Wei

    2015-10-01

    Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.

  3. Parton Distributions based on a Maximally Consistent Dataset

    NASA Astrophysics Data System (ADS)

    Rojo, Juan

    2016-04-01

    The choice of data that enters a global QCD analysis can have a substantial impact on the resulting parton distributions and their predictions for collider observables. One of the main reasons for this has to do with the possible presence of inconsistencies, either internal within an experiment or external between different experiments. In order to assess the robustness of the global fit, different definitions of a conservative PDF set, that is, a PDF set based on a maximally consistent dataset, have been introduced. However, these approaches are typically affected by theory biases in the selection of the dataset. In this contribution, after a brief overview of recent NNPDF developments, we propose a new, fully objective, definition of a conservative PDF set, based on the Bayesian reweighting approach. Using the new NNPDF3.0 framework, we produce various conservative sets, which turn out to be mutually in agreement within the respective PDF uncertainties, as well as with the global fit. We explore some of their implications for LHC phenomenology, finding also good consistency with the global fit result. These results provide a non-trivial validation test of the new NNPDF3.0 fitting methodology, and indicate that possible inconsistencies in the fitted dataset do not affect substantially the global fit PDFs.

  4. A reference human genome dataset of the BGISEQ-500 sequencer.

    PubMed

    Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian

    2017-05-01

    BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.

  5. Pooled assembly of marine metagenomic datasets: enriching annotation through chimerism.

    PubMed

    Magasin, Jonathan D; Gerloff, Dietlind L

    2015-02-01

    Despite advances in high-throughput sequencing, marine metagenomic samples remain largely opaque. A typical sample contains billions of microbial organisms from thousands of genomes and quadrillions of DNA base pairs. Its derived metagenomic dataset underrepresents this complexity by orders of magnitude because of the sparseness and shortness of sequencing reads. Read shortness and sequencing errors pose a major challenge to accurate species and functional annotation. This includes distinguishing known from novel species. Often the majority of reads cannot be annotated and thus cannot help our interpretation of the sample. Here, we demonstrate quantitatively how careful assembly of marine metagenomic reads within, but also across, datasets can alleviate this problem. For 10 simulated datasets, each with species complexity modeled on a real counterpart, chimerism remained within the same species for most contigs (97%). For 42 real pyrosequencing ('454') datasets, assembly increased the proportion of annotated reads, and even more so when datasets were pooled, by on average 1.6% (max 6.6%) for species, 9.0% (max 28.7%) for Pfam protein domains and 9.4% (max 22.9%) for PANTHER gene families. Our results outline exciting prospects for data sharing in the metagenomics community. While chimeric sequences should be avoided in other areas of metagenomics (e.g. biodiversity analyses), conservative pooled assembly is advantageous for annotation specificity and sensitivity. Intriguingly, our experiment also found potential prospects for (low-cost) discovery of new species in 'old' data. dgerloff@ffame.org Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  6. Classification of foods by transferring knowledge from ImageNet dataset

    NASA Astrophysics Data System (ADS)

    Heravi, Elnaz J.; Aghdam, Hamed H.; Puig, Domenec

    2017-03-01

    Automatic classification of foods is a way to control food intake and tackle with obesity. However, it is a challenging problem since foods are highly deformable and complex objects. Results on ImageNet dataset have revealed that Convolutional Neural Network has a great expressive power to model natural objects. Nonetheless, it is not trivial to train a ConvNet from scratch for classification of foods. This is due to the fact that ConvNets require large datasets and to our knowledge there is not a large public dataset of food for this purpose. Alternative solution is to transfer knowledge from trained ConvNets to the domain of foods. In this work, we study how transferable are state-of-art ConvNets to the task of food classification. We also propose a method for transferring knowledge from a bigger ConvNet to a smaller ConvNet by keeping its accuracy similar to the bigger ConvNet. Our experiments on UECFood256 datasets show that Googlenet, VGG and residual networks produce comparable results if we start transferring knowledge from appropriate layer. In addition, we show that our method is able to effectively transfer knowledge to the smaller ConvNet using unlabeled samples.

  7. National Hydropower Plant Dataset, Version 2 (FY18Q3)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Samu, Nicole; Kao, Shih-Chieh; O'Connor, Patrick

    The National Hydropower Plant Dataset, Version 2 (FY18Q3) is a geospatially comprehensive point-level dataset containing locations and key characteristics of U.S. hydropower plants that are currently either in the hydropower development pipeline (pre-operational), operational, withdrawn, or retired. These data are provided in GIS and tabular formats with corresponding metadata for each. In addition, we include access to download 2 versions of the National Hydropower Map, which was produced with these data (i.e. Map 1 displays the geospatial distribution and characteristics of all operational hydropower plants; Map 2 displays the geospatial distribution and characteristics of operational hydropower plants with pumped storagemore » and mixed capabilities only). This dataset is a subset of ORNL's Existing Hydropower Assets data series, updated quarterly as part of ORNL's National Hydropower Asset Assessment Program.« less

  8. A randomised study of perioperative esmolol infusion for haemodynamic stability during major vascular surgery; rationale and design of DECREASE-XIII.

    PubMed

    Bakker, E J; Ravensbergen, N J; Voute, M T; Hoeks, S E; Chonchol, M; Klimek, M; Poldermans, D

    2011-09-01

    This article describes the rationale and design of the DECREASE-XIII trial, which aims to evaluate the potential of esmolol infusion, an ultra-short-acting beta-blocker, during surgery as an add-on to chronic low-dose beta-blocker therapy to maintain perioperative haemodynamic stability during major vascular surgery. A double-blind, placebo-controlled, randomised trial. A total of 260 vascular surgery patients will be randomised to esmolol or placebo as an add-on to standard medical care, including chronic low-dose beta-blockers. Esmolol is titrated to maintain a heart rate within a target window of 60-80 beats per minute for 24 h from the induction of anaesthesia. Heart rate and ischaemia are assessed by continuous 12-lead electrocardiographic monitoring for 72 h, starting 1 day prior to surgery. The primary outcome measure is duration of heart rate outside the target window during infusion of the study drug. Secondary outcome measures will be the efficacy parameters of occurrence of cardiac ischaemia, troponin T release, myocardial infarction and cardiac death within 30 days after surgery and safety parameters such as the occurrence of stroke and hypotension. This study will provide data on the efficacy of esmolol titration in chronic beta-blocker users for tight heart-rate control and reduction of ischaemia in patients undergoing vascular surgery as well as data on safety parameters. Copyright © 2011 European Society for Vascular Surgery. Published by Elsevier Ltd. All rights reserved.

  9. Securely Measuring the Overlap between Private Datasets with Cryptosets

    PubMed Central

    Swamidass, S. Joshua; Matlock, Matthew; Rozenblit, Leon

    2015-01-01

    Many scientific questions are best approached by sharing data—collected by different groups or across large collaborative networks—into a combined analysis. Unfortunately, some of the most interesting and powerful datasets—like health records, genetic data, and drug discovery data—cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset’s contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach “information-theoretic” security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure. PMID:25714898

  10. NASA Cold Land Processes Experiment (CLPX 2002/03): Atmospheric analyses datasets

    Treesearch

    Glen E. Liston; Daniel L. Birkenheuer; Christopher A. Hiemstra; Donald W. Cline; Kelly Elder

    2008-01-01

    This paper describes the Local Analysis and Prediction System (LAPS) and the 20-km horizontal grid version of the Rapid Update Cycle (RUC20) atmospheric analyses datasets, which are available as part of the Cold Land Processes Field Experiment (CLPX) data archive. The LAPS dataset contains spatially and temporally continuous atmospheric and surface variables over...

  11. MODFLOW-LGR: Practical application to a large regional dataset

    NASA Astrophysics Data System (ADS)

    Barnes, D.; Coulibaly, K. M.

    2011-12-01

    In many areas of the US, including southwest Florida, large regional-scale groundwater models have been developed to aid in decision making and water resources management. These models are subsequently used as a basis for site-specific investigations. Because the large scale of these regional models is not appropriate for local application, refinement is necessary to analyze the local effects of pumping wells and groundwater related projects at specific sites. The most commonly used approach to date is Telescopic Mesh Refinement or TMR. It allows the extraction of a subset of the large regional model with boundary conditions derived from the regional model results. The extracted model is then updated and refined for local use using a variable sized grid focused on the area of interest. MODFLOW-LGR, local grid refinement, is an alternative approach which allows model discretization at a finer resolution in areas of interest and provides coupling between the larger "parent" model and the locally refined "child." In the present work, these two approaches are tested on a mining impact assessment case in southwest Florida using a large regional dataset (The Lower West Coast Surficial Aquifer System Model). Various metrics for performance are considered. They include: computation time, water balance (as compared to the variable sized grid), calibration, implementation effort, and application advantages and limitations. The results indicate that MODFLOW-LGR is a useful tool to improve local resolution of regional scale models. While performance metrics, such as computation time, are case-dependent (model size, refinement level, stresses involved), implementation effort, particularly when regional models of suitable scale are available, can be minimized. The creation of multiple child models within a larger scale parent model makes it possible to reuse the same calibrated regional dataset with minimal modification. In cases similar to the Lower West Coast model, where a

  12. Efficient genotype compression and analysis of large genetic variation datasets

    PubMed Central

    Layer, Ryan M.; Kindlon, Neil; Karczewski, Konrad J.; Quinlan, Aaron R.

    2015-01-01

    Genotype Query Tools (GQT) is a new indexing strategy that expedites analyses of genome variation datasets in VCF format based on sample genotypes, phenotypes and relationships. GQT’s compressed genotype index minimizes decompression for analysis, and performance relative to existing methods improves with cohort size. We show substantial (up to 443 fold) performance gains over existing methods and demonstrate GQT’s utility for exploring massive datasets involving thousands to millions of genomes. PMID:26550772

  13. Pantheon 1.0, a manually verified dataset of globally famous biographies.

    PubMed

    Yu, Amy Zhao; Ronen, Shahar; Hu, Kevin; Lu, Tiffany; Hidalgo, César A

    2016-01-05

    We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the number of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008-2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals.

  14. Pantheon 1.0, a manually verified dataset of globally famous biographies

    PubMed Central

    Yu, Amy Zhao; Ronen, Shahar; Hu, Kevin; Lu, Tiffany; Hidalgo, César A.

    2016-01-01

    We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the number of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008–2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals. PMID:26731133

  15. Developing a new global network of river reaches from merged satellite-derived datasets

    NASA Astrophysics Data System (ADS)

    Lion, C.; Allen, G. H.; Beighley, E.; Pavelsky, T.

    2015-12-01

    In 2020, the Surface Water and Ocean Topography satellite (SWOT), a joint mission of NASA/CNES/CSA/UK will be launched. One of its major products will be the measurements of continental water extent, including the width, height, and slope of rivers and the surface area and elevations of lakes. The mission will improve the monitoring of continental water and also our understanding of the interactions between different hydrologic reservoirs. For rivers, SWOT measurements of slope must be carried out over predefined river reaches. As such, an a priori dataset for rivers is needed in order to facilitate analysis of the raw SWOT data. The information required to produce this dataset includes measurements of river width, elevation, slope, planform, river network topology, and flow accumulation. To produce this product, we have linked two existing global datasets: the Global River Widths from Landsat (GRWL) database, which contains river centerline locations, widths, and a braiding index derived from Landsat imagery, and a modified version of the HydroSHEDS hydrologically corrected digital elevation product, which contains heights and flow accumulation measurements for streams at 3 arcsecond spatial resolution. Merging these two datasets requires considerable care. The difficulties, among others, lie in the difference of resolution: 30m versus 3 arseconds, and the age of the datasets: 2000 versus ~2010 (some rivers have moved, the braided sections are different). As such, we have developed custom software to merge the two datasets, taking into account the spatial proximity of river channels in the two datasets and ensuring that flow accumulation in the final dataset always increases downstream. Here, we present our preliminary results for a portion of South America and demonstrate the strengths and weaknesses of the method.

  16. Evaluation and inter-comparison of modern day reanalysis datasets over Africa and the Middle East

    NASA Astrophysics Data System (ADS)

    Shukla, S.; Arsenault, K. R.; Hobbins, M.; Peters-Lidard, C. D.; Verdin, J. P.

    2015-12-01

    Reanalysis datasets are potentially very valuable for otherwise data-sparse regions such as Africa and the Middle East. They are potentially useful for long-term climate and hydrologic analyses and, given their availability in real-time, they are particularity attractive for real-time hydrologic monitoring purposes (e.g. to monitor flood and drought events). Generally in data-sparse regions, reanalysis variables such as precipitation, temperature, radiation and humidity are used in conjunction with in-situ and/or satellite-based datasets to generate long-term gridded atmospheric forcing datasets. These atmospheric forcing datasets are used to drive offline land surface models and simulate soil moisture and runoff, which are natural indicators of hydrologic conditions. Therefore, any uncertainty or bias in the reanalysis datasets contributes to uncertainties in hydrologic monitoring estimates. In this presentation, we report on a comprehensive analysis that evaluates several modern-day reanalysis products (such as NASA's MERRA-1 and -2, ECMWF's ERA-Interim and NCEP's CFS Reanalysis) over Africa and the Middle East region. We compare the precipitation and temperature from the reanalysis products with other independent gridded datasets such as GPCC, CRU, and USGS/UCSB's CHIRPS precipitation datasets, and CRU's temperature datasets. The evaluations are conducted at a monthly time scale, since some of these independent datasets are only available at this temporal resolution. The evaluations range from the comparison of the monthly mean climatology to inter-annual variability and long-term changes. Finally, we also present the results of inter-comparisons of radiation and humidity variables from the different reanalysis datasets.

  17. The in vivo effect of fibrinogen and factor XIII on clot formation and fibrinolysis in Glanzmann's thrombasthenia.

    PubMed

    Shenkman, Boris; Livnat, Tami; Misgav, Mudi; Budnik, Ivan; Einav, Yulia; Martinowitz, Uriel

    2012-01-01

    Glanzmann's thrombasthenia (GT) is characterized by increased bleeding risk. The treatment options in GT are limited. The aim of this study was to test the effect of GT blood supplementation with fibrinogen and factor XIII on thrombin generation, blood clotting, and fibrinolysis. Whole blood samples of GT patients and normal donors treated with eptifibatide (GT model) were subjected to clotting by CaCl(2) and tissue factor. Thrombin generation was measured in platelet-rich plasma. Clot formation and tPA-induced fibrinolysis were evaluated in whole blood by rotation thromboelastometry (ROTEM). Blood was supplemented with fibrinogen (3 g/L) and/or FXIII (2 IU/mL). Thrombin generation analysis of blood derived from GT model and GT patients revealed decreased endogenous thrombin potential and peak height and extended lag time compared to control. However, this method was not sensitive to blood spiking with fibrinogen and FXIII. ROTEM revealed lower maximum clot firmness (MCF) and area under curve (AUC) in the blood of GT model and GT patients. In the absence of exogenous tPA, blood spiking with fibrinogen markedly enhanced clot quality while FXIII had no effect. Combination of fibrinogen and FXIII did not add to the effect of fibrinogen. In contrast, by the addition of tPA, both fibrinogen and FXIII separately and, to more extent, in combination enhanced clot quality as well as resistance against tPA-induced fibrinolysis (increasing MCF, AUC, and lysis onset time). In conclusion, fibrinogen and FXIII exerted stimulation of blood clotting and inhibition of fibrinolysis. Treating normal blood with eptifibatide mimics the changes of coagulopathy in GT blood.

  18. MVIRI/SEVIRI TOA Radiation Datasets within the Climate Monitoring SAF

    NASA Astrophysics Data System (ADS)

    Urbain, Manon; Clerbaux, Nicolas; Ipe, Alessandro; Baudrez, Edward; Velazquez Blazquez, Almudena; Moreels, Johan

    2016-04-01

    Within CM SAF, Interim Climate Data Records (ICDR) of Top-Of-Atmosphere (TOA) radiation products from the Geostationary Earth Radiation Budget (GERB) instruments on the Meteosat Second Generation (MSG) satellites have been released in 2013. These datasets (referred to as CM-113 and CM-115, resp. for shortwave (SW) and longwave (LW) radiation) are based on the instantaneous TOA fluxes from the GERB Edition-1 dataset. They cover the time period 2004-2011. Extending these datasets backward in the past is not possible as no GERB instruments were available on the Meteosat First Generation (MFG) satellites. As an alternative, it is proposed to rely on the Meteosat Visible and InfraRed Imager (MVIRI - from 1982 until 2004) and the Spinning Enhanced Visible and Infrared Imager (SEVIRI - from 2004 onward) to generate a long Thematic Climate Data Record (TCDR) from Meteosat instruments. Combining MVIRI and SEVIRI allows an unprecedented temporal (30 minutes / 15 minutes) and spatial (2.5 km / 3 km) resolution compared to the Clouds and the Earth's Radiant Energy System (CERES) products. This is a step forward as it helps to increase the knowledge of the diurnal cycle and the small-scale spatial variations of radiation. The MVIRI/SEVIRI datasets (referred to as CM-23311 and CM-23341, resp. for SW and LW radiation) will provide daily and monthly averaged TOA Reflected Solar (TRS) and Emitted Thermal (TET) radiation in "all-sky" conditions (no clear-sky conditions for this first version of the datasets), as well as monthly averaged of the hourly integrated values. The SEVIRI Solar Channels Calibration (SSCC) and the operational calibration have been used resp. for the SW and LW channels. For MFG, it is foreseen to replace the latter by the EUMETSAT/GSICS recalibration of MVIRI using HIRS. The CERES TRMM angular dependency models have been used to compute TRS fluxes while theoretical models have been used for TET fluxes. The CM-23311 and CM-23341 datasets will cover a 32 years

  19. An analysis of early oncologic head and neck free flap reoperations from the 2005-2012 ACS-NSQIP dataset.

    PubMed

    Ligh, Cassandra A; Nelson, Jonas A; Wink, Jason D; Gerety, Patrick A; Fischer, John P; Wu, Liza C; Kanchwala, Suhail K

    2016-01-01

    There are limited population-based studies that examine perioperative factors that influence postoperative surgical take-backs to the OR following free flap (FF) reconstruction for head/neck cancer extirpation. The purpose of this study was to critically analyse head/neck free flaps (HNFF) captured in the ACS-NSQIP dataset with a specific focus on postoperative complications and the incidence of factors associated with re-operation. The 2005-2012 ACS-NSQIP datasets were accessed to identify patients undergoing FF reconstruction after a diagnosis of head/neck cancer. Patient demographics, comorbidities, and perioperative risk factors were examined as covariates, and the primary outcome was return to OR within 30 days of surgery. A multivariate regression was performed to determine independent preoperative factors associated with this complication. In total, 855 patients underwent FF for head/neck reconstruction most commonly for the Tongue (24.7%) and Mouth/Floor/cavity (25.0%). Of these, 153 patients (17.9%) returned to the OR within 30 days of surgery. Patients in this cohort had higher rates of wound infections and dehiscence (p < 0.01). Medical complications were significantly higher and included pneumonia (12.4% vs 5.0%, p < 0.01), prolonged ventilation (16.3% vs 4.8%, p < 0.01), myocardial infarction (2.6% vs 0.6%, p = 0.017), and sepsis (7.2% vs 3.4%, p = 0.033). Regression analysis demonstrated that visceral flaps (OR = 9.7, p = 0.012) and hypoalbuminemia (OR = 2.4, p = 0.009) were significant predictors of a return to the OR. Based on data from the nationwide NSQIP dataset, up to 17% of HNFF return to the OR within 30 days. Although this data-set has some significant limitations, these results can cautiously help to improve preoperative patient optimisation and surgical decision-making.

  20. Transfer Kernel Common Spatial Patterns for Motor Imagery Brain-Computer Interface Classification

    PubMed Central

    Dai, Mengxi; Liu, Shucong; Zhang, Pengju

    2018-01-01

    Motor-imagery-based brain-computer interfaces (BCIs) commonly use the common spatial pattern (CSP) as preprocessing step before classification. The CSP method is a supervised algorithm. Therefore a lot of time-consuming training data is needed to build the model. To address this issue, one promising approach is transfer learning, which generalizes a learning model can extract discriminative information from other subjects for target classification task. To this end, we propose a transfer kernel CSP (TKCSP) approach to learn a domain-invariant kernel by directly matching distributions of source subjects and target subjects. The dataset IVa of BCI Competition III is used to demonstrate the validity by our proposed methods. In the experiment, we compare the classification performance of the TKCSP against CSP, CSP for subject-to-subject transfer (CSP SJ-to-SJ), regularizing CSP (RCSP), stationary subspace CSP (ssCSP), multitask CSP (mtCSP), and the combined mtCSP and ssCSP (ss + mtCSP) method. The results indicate that the superior mean classification performance of TKCSP can achieve 81.14%, especially in case of source subjects with fewer number of training samples. Comprehensive experimental evidence on the dataset verifies the effectiveness and efficiency of the proposed TKCSP approach over several state-of-the-art methods. PMID:29743934

  1. Transfer Kernel Common Spatial Patterns for Motor Imagery Brain-Computer Interface Classification.

    PubMed

    Dai, Mengxi; Zheng, Dezhi; Liu, Shucong; Zhang, Pengju

    2018-01-01

    Motor-imagery-based brain-computer interfaces (BCIs) commonly use the common spatial pattern (CSP) as preprocessing step before classification. The CSP method is a supervised algorithm. Therefore a lot of time-consuming training data is needed to build the model. To address this issue, one promising approach is transfer learning, which generalizes a learning model can extract discriminative information from other subjects for target classification task. To this end, we propose a transfer kernel CSP (TKCSP) approach to learn a domain-invariant kernel by directly matching distributions of source subjects and target subjects. The dataset IVa of BCI Competition III is used to demonstrate the validity by our proposed methods. In the experiment, we compare the classification performance of the TKCSP against CSP, CSP for subject-to-subject transfer (CSP SJ-to-SJ), regularizing CSP (RCSP), stationary subspace CSP (ssCSP), multitask CSP (mtCSP), and the combined mtCSP and ssCSP (ss + mtCSP) method. The results indicate that the superior mean classification performance of TKCSP can achieve 81.14%, especially in case of source subjects with fewer number of training samples. Comprehensive experimental evidence on the dataset verifies the effectiveness and efficiency of the proposed TKCSP approach over several state-of-the-art methods.

  2. A multimodal MRI dataset of professional chess players.

    PubMed

    Li, Kaiming; Jiang, Jing; Qiu, Lihua; Yang, Xun; Huang, Xiaoqi; Lui, Su; Gong, Qiyong

    2015-01-01

    Chess is a good model to study high-level human brain functions such as spatial cognition, memory, planning, learning and problem solving. Recent studies have demonstrated that non-invasive MRI techniques are valuable for researchers to investigate the underlying neural mechanism of playing chess. For professional chess players (e.g., chess grand masters and masters or GM/Ms), what are the structural and functional alterations due to long-term professional practice, and how these alterations relate to behavior, are largely veiled. Here, we report a multimodal MRI dataset from 29 professional Chinese chess players (most of whom are GM/Ms), and 29 age matched novices. We hope that this dataset will provide researchers with new materials to further explore high-level human brain functions.

  3. REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

    NASA Astrophysics Data System (ADS)

    Moulik, P.; Lekic, V.; Romanowicz, B. A.

    2017-12-01

    A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history

  4. Real-world datasets for portfolio selection and solutions of some stochastic dominance portfolio models.

    PubMed

    Bruni, Renato; Cesarone, Francesco; Scozzari, Andrea; Tardella, Fabio

    2016-09-01

    A large number of portfolio selection models have appeared in the literature since the pioneering work of Markowitz. However, even when computational and empirical results are described, they are often hard to replicate and compare due to the unavailability of the datasets used in the experiments. We provide here several datasets for portfolio selection generated using real-world price values from several major stock markets. The datasets contain weekly return values, adjusted for dividends and for stock splits, which are cleaned from errors as much as possible. The datasets are available in different formats, and can be used as benchmarks for testing the performances of portfolio selection models and for comparing the efficiency of the algorithms used to solve them. We also provide, for these datasets, the portfolios obtained by several selection strategies based on Stochastic Dominance models (see "On Exact and Approximate Stochastic Dominance Strategies for Portfolio Selection" (Bruni et al. [2])). We believe that testing portfolio models on publicly available datasets greatly simplifies the comparison of the different portfolio selection strategies.

  5. Selecting AGN through Variability in SN Datasets

    NASA Astrophysics Data System (ADS)

    Boutsia, K.; Leibundgut, B.; Trevese, D.; Vagnetti, F.

    2010-07-01

    Variability is a main property of Active Galactic Nuclei (AGN) and it was adopted as a selection criterion using multi epoch surveys conducted for the detection of supernovae (SNe). We have used two SN datasets. First we selected the AXAF field of the STRESS project, centered in the Chandra Deep Field South where, besides the deep X-ray surveys also various optical catalogs exist. Our method yielded 132 variable AGN candidates. We then extended our method including the dataset of the ESSENCE project that has been active for 6 years, producing high quality light curves in the R and I bands. We obtained a sample of ˜4800 variable sources, down to R=22, in the whole 12 deg2 ESSENCE field. Among them, a subsample of ˜500 high priority AGN candidates was created using as secondary criterion the shape of the structure function. In a pilot spectroscopic run we have confirmed the AGN nature for nearly all of our candidates.

  6. Selection-Fusion Approach for Classification of Datasets with Missing Values

    PubMed Central

    Ghannad-Rezaie, Mostafa; Soltanian-Zadeh, Hamid; Ying, Hao; Dong, Ming

    2010-01-01

    This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values. PMID:20212921

  7. Dataset for Testing Contamination Source Identification Methods for Water Distribution Networks

    EPA Pesticide Factsheets

    This dataset includes the results of a simulation study using the source inversion techniques available in the Water Security Toolkit. The data was created to test the different techniques for accuracy, specificity, false positive rate, and false negative rate. The tests examined different parameters including measurement error, modeling error, injection characteristics, time horizon, network size, and sensor placement. The water distribution system network models that were used in the study are also included in the dataset. This dataset is associated with the following publication:Seth, A., K. Klise, J. Siirola, T. Haxton , and C. Laird. Testing Contamination Source Identification Methods for Water Distribution Networks. Journal of Environmental Division, Proceedings of American Society of Civil Engineers. American Society of Civil Engineers (ASCE), Reston, VA, USA, ., (2016).

  8. Dataset variability leverages white-matter lesion segmentation performance with convolutional neural network

    NASA Astrophysics Data System (ADS)

    Ravnik, Domen; Jerman, Tim; Pernuš, Franjo; Likar, Boštjan; Å piclin, Žiga

    2018-03-01

    Performance of a convolutional neural network (CNN) based white-matter lesion segmentation in magnetic resonance (MR) brain images was evaluated under various conditions involving different levels of image preprocessing and augmentation applied and different compositions of the training dataset. On images of sixty multiple sclerosis patients, half acquired on one and half on another scanner of different vendor, we first created a highly accurate multi-rater consensus based lesion segmentations, which were used in several experiments to evaluate the CNN segmentation result. First, the CNN was trained and tested without preprocessing the images and by using various combinations of preprocessing techniques, namely histogram-based intensity standardization, normalization by whitening, and train dataset augmentation by flipping the images across the midsagittal plane. Then, the CNN was trained and tested on images of the same, different or interleaved scanner datasets using a cross-validation approach. The results indicate that image preprocessing has little impact on performance in a same-scanner situation, while between-scanner performance benefits most from intensity standardization and normalization, but also further by incorporating heterogeneous multi-scanner datasets in the training phase. Under such conditions the between-scanner performance of the CNN approaches that of the ideal situation, when the CNN is trained and tested on the same scanner dataset.

  9. Nanocubes for real-time exploration of spatiotemporal datasets.

    PubMed

    Lins, Lauro; Klosowski, James T; Scheidegger, Carlos

    2013-12-01

    Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Many relational databases implement the well-known data cube aggregation operation, which in a sense precomputes every possible aggregate query over the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk storage. In contrast, we show how to construct a data cube that fits in a modern laptop's main memory, even for billions of entries; we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets, and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are dominated by network and user-interaction latencies.

  10. Increasing consistency of disease biomarker prediction across datasets.

    PubMed

    Chikina, Maria D; Sealfon, Stuart C

    2014-01-01

    Microarray studies with human subjects often have limited sample sizes which hampers the ability to detect reliable biomarkers associated with disease and motivates the need to aggregate data across studies. However, human gene expression measurements may be influenced by many non-random factors such as genetics, sample preparations, and tissue heterogeneity. These factors can contribute to a lack of agreement among related studies, limiting the utility of their aggregation. We show that it is feasible to carry out an automatic correction of individual datasets to reduce the effect of such 'latent variables' (without prior knowledge of the variables) in such a way that datasets addressing the same condition show better agreement once each is corrected. We build our approach on the method of surrogate variable analysis but we demonstrate that the original algorithm is unsuitable for the analysis of human tissue samples that are mixtures of different cell types. We propose a modification to SVA that is crucial to obtaining the improvement in agreement that we observe. We develop our method on a compendium of multiple sclerosis data and verify it on an independent compendium of Parkinson's disease datasets. In both cases, we show that our method is able to improve agreement across varying study designs, platforms, and tissues. This approach has the potential for wide applicability to any field where lack of inter-study agreement has been a concern.

  11. Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

    NASA Astrophysics Data System (ADS)

    Beck, H.; Vergopolan, N.; Pan, M.; Levizzani, V.; van Dijk, A.; Weedon, G. P.; Brocca, L.; Huffman, G. J.; Wood, E. F.; William, L.

    2017-12-01

    We undertook a comprehensive evaluation of 22 gridded (quasi-)global (sub-)daily precipitation (P) datasets for the period 2000-2016. Twelve non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76,086 gauges worldwide. Another ten gauge-corrected ones were evaluated using hydrological modeling, by calibrating the conceptual model HBV against streamflow records for each of 9053 small to medium-sized (<50,000 km2) catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR), the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7) or near-surface soil moisture (SM2RAIN-ASCAT), and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS). Two of the three reanalyses (ERA-Interim and JRA-55) unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified and MSWEP V1.2 and V2.0) generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU), which in turn outperformed those indirectly incorporating gauge data through other multi-source datasets (PERSIANN-CDR V1R1 and PGF). Our results highlight large differences in estimation accuracy, and hence, the importance of P dataset selection in both research and operational applications

  12. Evaluation of Factor V G1691A, prothrombin G20210A, Factor XIII V34L, MTHFR A1298C, MTHFR C677T and PAI-1 4G/5G genotype frequencies of patients subjected to cardiovascular disease (CVD) panel in south-east region of Turkey.

    PubMed

    Oztuzcu, Serdar; Ergun, Sercan; Ulaşlı, Mustafa; Nacarkahya, Gülper; Iğci, Yusuf Ziya; Iğci, Mehri; Bayraktar, Recep; Tamer, Ali; Çakmak, Ecir Ali; Arslan, Ahmet

    2014-06-01

    Cardiovascular disease (CVD) risk factors, such as arterial hypertension, obesity, dyslipidemia or diabetes mellitus, as well as CVDs, including myocardial infarction, coronary artery disease or stroke, are the most prevalent diseases and account for the major causes of death worldwide. In the present study, 4,709 unrelated patients subjected to CVD panel in south-east part of Turkey between the years 2010 and 2013 were enrolled and DNA was isolated from the blood samples of these patients. Mutation analyses were conducted using the real-time polymerase chain reaction method to screen six common mutations (Factor V G1691A, PT G20210A, Factor XIII V34L, MTHFR A1298C and C677T and PAI-1 -675 4G/5G) found in CVD panel. The prevalence of these mutations were 0.57, 0.25, 2.61, 13.78, 9.34 and 24.27 % in homozygous form, respectively. Similarly, the mutation percent of them in heterozygous form were 7.43, 3.44, 24.91, 44.94, 41.09 and 45.66%, respectively. No mutation was detected in 92 (1.95%) patients in total. Because of the fact that this is the first study to screen six common mutations in CVD panel in south-east region of Turkey, it has a considerable value on the diagnosis and treatment of these diseases. Upon the results of the present and previous studied a careful examination for these genetic variants should be carried out in thrombophilia screening programs, particularly in Turkish population.

  13. Amino acid changes in disease-associated variants differ radically from variants observed in the 1000 genomes project dataset.

    PubMed

    de Beer, Tjaart A P; Laskowski, Roman A; Parks, Sarah L; Sipos, Botond; Goldman, Nick; Thornton, Janet M

    2013-01-01

    The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans.

  14. A longitudinal dataset of five years of public activity in the Scratch online community.

    PubMed

    Hill, Benjamin Mako; Monroy-Hernández, Andrés

    2017-01-31

    Scratch is a programming environment and an online community where young people can create, share, learn, and communicate. In collaboration with the Scratch Team at MIT, we created a longitudinal dataset of public activity in the Scratch online community during its first five years (2007-2012). The dataset comprises 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more. To help researchers understand this dataset, and to establish the validity of the data, we also include the source code of every version of the software that operated the website, as well as the software used to generate this dataset. We believe this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.

  15. Systematic chemical-genetic and chemical-chemical interaction datasets for prediction of compound synergism

    PubMed Central

    Wildenhain, Jan; Spitzer, Michaela; Dolma, Sonam; Jarvik, Nick; White, Rachel; Roy, Marcia; Griffiths, Emma; Bellows, David S.; Wright, Gerard D.; Tyers, Mike

    2016-01-01

    The network structure of biological systems suggests that effective therapeutic intervention may require combinations of agents that act synergistically. However, a dearth of systematic chemical combination datasets have limited the development of predictive algorithms for chemical synergism. Here, we report two large datasets of linked chemical-genetic and chemical-chemical interactions in the budding yeast Saccharomyces cerevisiae. We screened 5,518 unique compounds against 242 diverse yeast gene deletion strains to generate an extended chemical-genetic matrix (CGM) of 492,126 chemical-gene interaction measurements. This CGM dataset contained 1,434 genotype-specific inhibitors, termed cryptagens. We selected 128 structurally diverse cryptagens and tested all pairwise combinations to generate a benchmark dataset of 8,128 pairwise chemical-chemical interaction tests for synergy prediction, termed the cryptagen matrix (CM). An accompanying database resource called ChemGRID was developed to enable analysis, visualisation and downloads of all data. The CGM and CM datasets will facilitate the benchmarking of computational approaches for synergy prediction, as well as chemical structure-activity relationship models for anti-fungal drug discovery. PMID:27874849

  16. Thesaurus Dataset of Educational Technology in Chinese

    ERIC Educational Resources Information Center

    Wu, Linjing; Liu, Qingtang; Zhao, Gang; Huang, Huan; Huang, Tao

    2015-01-01

    The thesaurus dataset of educational technology is a knowledge description of educational technology in Chinese. The aims of this thesaurus were to collect the subject terms in the domain of educational technology, facilitate the standardization of terminology and promote the communication between Chinese researchers and scholars from various…

  17. Statistical tests and identifiability conditions for pooling and analyzing multisite datasets

    PubMed Central

    Zhou, Hao Henry; Singh, Vikas; Johnson, Sterling C.; Wahba, Grace

    2018-01-01

    When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer’s disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies. PMID:29386387

  18. Statistical tests and identifiability conditions for pooling and analyzing multisite datasets.

    PubMed

    Zhou, Hao Henry; Singh, Vikas; Johnson, Sterling C; Wahba, Grace

    2018-02-13

    When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer's disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies. Copyright © 2018 the Author(s). Published by PNAS.

  19. Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation

    PubMed Central

    Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B.

    2016-01-01

    Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware

  20. Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation.

    PubMed

    Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B

    2016-01-01

    Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware

  1. A collection of Australian Drosophila datasets on climate adaptation and species distributions.

    PubMed

    Hangartner, Sandra B; Hoffmann, Ary A; Smith, Ailie; Griffin, Philippa C

    2015-11-24

    The Australian Drosophila Ecology and Evolution Resource (ADEER) collates Australian datasets on drosophilid flies, which are aimed at investigating questions around climate adaptation, species distribution limits and population genetics. Australian drosophilid species are diverse in climatic tolerance, geographic distribution and behaviour. Many species are restricted to the tropics, a few are temperate specialists, and some have broad distributions across climatic regions. Whereas some species show adaptability to climate changes through genetic and plastic changes, other species have limited adaptive capacity. This knowledge has been used to identify traits and genetic polymorphisms involved in climate change adaptation and build predictive models of responses to climate change. ADEER brings together 103 datasets from 39 studies published between 1982-2013 in a single online resource. All datasets can be downloaded freely in full, along with maps and other visualisations. These historical datasets are preserved for future studies, which will be especially useful for assessing climate-related changes over time.

  2. Boosting association rule mining in large datasets via Gibbs sampling.

    PubMed

    Qian, Guoqi; Rao, Calyampudi Radhakrishna; Sun, Xiaoying; Wu, Yuehua

    2016-05-03

    Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.

  3. A conceptual prototype for the next-generation national elevation dataset

    USGS Publications Warehouse

    Stoker, Jason M.; Heidemann, Hans Karl; Evans, Gayla A.; Greenlee, Susan K.

    2013-01-01

    In 2012 the U.S. Geological Survey's (USGS) National Geospatial Program (NGP) funded a study to develop a conceptual prototype for a new National Elevation Dataset (NED) design with expanded capabilities to generate and deliver a suite of bare earth and above ground feature information over the United States. This report details the research on identifying operational requirements based on prior research, evaluation of what is needed for the USGS to meet these requirements, and development of a possible conceptual framework that could potentially deliver the kinds of information that are needed to support NGP's partners and constituents. This report provides an initial proof-of-concept demonstration using an existing dataset, and recommendations for the future, to inform NGP's ongoing and future elevation program planning and management decisions. The demonstration shows that this type of functional process can robustly create derivatives from lidar point cloud data; however, more research needs to be done to see how well it extends to multiple datasets.

  4. Kernel-based discriminant feature extraction using a representative dataset

    NASA Astrophysics Data System (ADS)

    Li, Honglin; Sancho Gomez, Jose-Luis; Ahalt, Stanley C.

    2002-07-01

    Discriminant Feature Extraction (DFE) is widely recognized as an important pre-processing step in classification applications. Most DFE algorithms are linear and thus can only explore the linear discriminant information among the different classes. Recently, there has been several promising attempts to develop nonlinear DFE algorithms, among which is Kernel-based Feature Extraction (KFE). The efficacy of KFE has been experimentally verified by both synthetic data and real problems. However, KFE has some known limitations. First, KFE does not work well for strongly overlapped data. Second, KFE employs all of the training set samples during the feature extraction phase, which can result in significant computation when applied to very large datasets. Finally, KFE can result in overfitting. In this paper, we propose a substantial improvement to KFE that overcomes the above limitations by using a representative dataset, which consists of critical points that are generated from data-editing techniques and centroid points that are determined by using the Frequency Sensitive Competitive Learning (FSCL) algorithm. Experiments show that this new KFE algorithm performs well on significantly overlapped datasets, and it also reduces computational complexity. Further, by controlling the number of centroids, the overfitting problem can be effectively alleviated.

  5. a Comparative Analysis of Five Cropland Datasets in Africa

    NASA Astrophysics Data System (ADS)

    Wei, Y.; Lu, M.; Wu, W.

    2018-04-01

    The food security, particularly in Africa, is a challenge to be resolved. The cropland area and spatial distribution obtained from remote sensing imagery are vital information. In this paper, according to cropland area and spatial location, we compare five global cropland datasets including CCI Land Cover, GlobCover, MODIS Collection 5, GlobeLand30 and Unified Cropland in circa 2010 of Africa in terms of cropland area and spatial location. The accuracy of cropland area calculated from five datasets was analyzed compared with statistic data. Based on validation samples, the accuracies of spatial location for the five cropland products were assessed by error matrix. The results show that GlobeLand30 has the best fitness with the statistics, followed by MODIS Collection 5 and Unified Cropland, GlobCover and CCI Land Cover have the lower accuracies. For the accuracy of spatial location of cropland, GlobeLand30 reaches the highest accuracy, followed by Unified Cropland, MODIS Collection 5 and GlobCover, CCI Land Cover has the lowest accuracy. The spatial location accuracy of five datasets in the Csa with suitable farming condition is generally higher than in the Bsk.

  6. Mean Tide Level Data in the PSMSL Mean Sea Level Dataset

    NASA Astrophysics Data System (ADS)

    Matthews, Andrew; Bradshaw, Elizabeth; Gordon, Kathy; Jevrejeva, Svetlana; Rickards, Lesley; Tamisiea, Mark; Williams, Simon; Woodworth, Philip

    2016-04-01

    The Permanent Service for Mean Sea Level (PSMSL) is the internationally recognised global sea level data bank for long term sea level change information from tide gauges. Established in 1933, the PSMSL continues to be responsible for the collection, publication, analysis and interpretation of sea level data. The PSMSL operates under the auspices of the International Council for Science (ICSU), is a regular member of the ICSU World Data System and is associated with the International Association for the Physical Sciences of the Oceans (IAPSO) and the International Association of Geodesy (IAG). The PSMSL continues to work closely with other members of the sea level community through the Intergovernmental Oceanographic Commission's Global Sea Level Observing System (GLOSS). Currently, the PSMSL data bank holds over 67,000 station-years of monthly and annual mean sea level data from over 2250 tide gauge stations. Data from each site are quality controlled and, wherever possible, reduced to a common datum, whose stability is monitored through a network of geodetic benchmarks. PSMSL also distributes a data bank of measurements taken from in-situ ocean bottom pressure recorders. Most of the records in the main PSMSL dataset indicate mean sea level (MSL), derived from high-frequency tide gauge data, with sampling typically once per hour or higher. However, some of the older data is based on mean tide level (MTL), which is obtained from measurements taken at high and low tide only. While usually very close, MSL and MTL can occasionally differ by many centimetres, particularly in shallow water locations. As a result, care must be taken when using long sea level records that contain periods of MTL data. Previously, periods during which the values indicated MTL rather than MSL were noted in the documentation, and sometimes suggested corrections were supplied. However, these comments were easy to miss, particularly in large scale studies that used multiple stations from across

  7. Topographical effects of climate dataset and their impacts on the estimation of regional net primary productivity

    NASA Astrophysics Data System (ADS)

    Sun, L. Qing; Feng, Feng X.

    2014-11-01

    In this study, we first built and compared two different climate datasets for Wuling mountainous area in 2010, one of which considered topographical effects during the ANUSPLIN interpolation was referred as terrain-based climate dataset, while the other one did not was called ordinary climate dataset. Then, we quantified the topographical effects of climatic inputs on NPP estimation by inputting two different climate datasets to the same ecosystem model, the Boreal Ecosystem Productivity Simulator (BEPS), to evaluate the importance of considering relief when estimating NPP. Finally, we found the primary contributing variables to the topographical effects through a series of experiments given an overall accuracy of the model output for NPP. The results showed that: (1) The terrain-based climate dataset presented more reliable topographic information and had closer agreements with the station dataset than the ordinary climate dataset at successive time series of 365 days in terms of the daily mean values. (2) On average, ordinary climate dataset underestimated NPP by 12.5% compared with terrain-based climate dataset over the whole study area. (3) The primary climate variables contributing to the topographical effects of climatic inputs for Wuling mountainous area were temperatures, which suggest that it is necessary to correct temperature differences for estimating NPP accurately in such a complex terrain.

  8. Assessment of NASA's Physiographic and Meteorological Datasets as Input to HSPF and SWAT Hydrological Models

    NASA Technical Reports Server (NTRS)

    Alacron, Vladimir J.; Nigro, Joseph D.; McAnally, William H.; OHara, Charles G.; Engman, Edwin Ted; Toll, David

    2011-01-01

    This paper documents the use of simulated Moderate Resolution Imaging Spectroradiometer land use/land cover (MODIS-LULC), NASA-LIS generated precipitation and evapo-transpiration (ET), and Shuttle Radar Topography Mission (SRTM) datasets (in conjunction with standard land use, topographical and meteorological datasets) as input to hydrological models routinely used by the watershed hydrology modeling community. The study is focused in coastal watersheds in the Mississippi Gulf Coast although one of the test cases focuses in an inland watershed located in northeastern State of Mississippi, USA. The decision support tools (DSTs) into which the NASA datasets were assimilated were the Soil Water & Assessment Tool (SWAT) and the Hydrological Simulation Program FORTRAN (HSPF). These DSTs are endorsed by several US government agencies (EPA, FEMA, USGS) for water resources management strategies. These models use physiographic and meteorological data extensively. Precipitation gages and USGS gage stations in the region were used to calibrate several HSPF and SWAT model applications. Land use and topographical datasets were swapped to assess model output sensitivities. NASA-LIS meteorological data were introduced in the calibrated model applications for simulation of watershed hydrology for a time period in which no weather data were available (1997-2006). The performance of the NASA datasets in the context of hydrological modeling was assessed through comparison of measured and model-simulated hydrographs. Overall, NASA datasets were as useful as standard land use, topographical , and meteorological datasets. Moreover, NASA datasets were used for performing analyses that the standard datasets could not made possible, e.g., introduction of land use dynamics into hydrological simulations

  9. Toward a complete dataset of drug-drug interaction information from publicly available sources.

    PubMed

    Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D

    2015-06-01

    Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms

  10. SDCLIREF - A sub-daily gridded reference dataset

    NASA Astrophysics Data System (ADS)

    Wood, Raul R.; Willkofer, Florian; Schmid, Franz-Josef; Trentini, Fabian; Komischke, Holger; Ludwig, Ralf

    2017-04-01

    Climate change is expected to impact the intensity and frequency of hydrometeorological extreme events. In order to adequately capture and analyze extreme rainfall events, in particular when assessing flood and flash flood situations, data is required at high spatial and sub-daily resolution which is often not available in sufficient density and over extended time periods. The ClimEx project (Climate Change and Hydrological Extreme Events) addresses the alteration of hydrological extreme events under climate change conditions. In order to differentiate between a clear climate change signal and the limits of natural variability, unique Single-Model Regional Climate Model Ensembles (CRCM5 driven by CanESM2, RCP8.5) were created for a European and North-American domain, each comprising 50 members of 150 years (1951-2100). In combination with the CORDEX-Database, this newly created ClimEx-Ensemble is a one-of-a-kind model dataset to analyze changes of sub-daily extreme events. For the purpose of bias-correcting the regional climate model ensembles as well as for the baseline calibration and validation of hydrological catchment models, a new sub-daily (3h) high-resolution (500m) gridded reference dataset (SDCLIREF) was created for a domain covering the Upper Danube and Main watersheds ( 100.000km2). As the sub-daily observations lack a continuous time series for the reference period 1980-2010, the need for a suitable method to bridge the gap of the discontinuous time series arouse. The Method of Fragments (Sharma and Srikanthan (2006); Westra et al. (2012)) was applied to transform daily observations to sub-daily rainfall events to extend the time series and densify the station network. Prior to applying the Method of Fragments and creating the gridded dataset using rigorous interpolation routines, data collection of observations, operated by several institutions in three countries (Germany, Austria, Switzerland), and the subsequent quality control of the observations

  11. Combining global land cover datasets to quantify agricultural expansion into forests in Latin America: Limitations and challenges

    PubMed Central

    Persson, U. Martin

    2017-01-01

    While we know that deforestation in the tropics is increasingly driven by commercial agriculture, most tropical countries still lack recent and spatially-explicit assessments of the relative importance of pasture and cropland expansion in causing forest loss. Here we present a spatially explicit quantification of the extent to which cultivated land and grassland expanded at the expense of forests across Latin America in 2001–2011, by combining two “state-of-the-art” global datasets (Global Forest Change forest loss and GlobeLand30-2010 land cover). We further evaluate some of the limitations and challenges in doing this. We find that this approach does capture some of the major patterns of land cover following deforestation, with GlobeLand30-2010’s Grassland class (which we interpret as pasture) being the most common land cover replacing forests across Latin America. However, our analysis also reveals some major limitations to combining these land cover datasets for quantifying pasture and cropland expansion into forest. First, a simple one-to-one translation between GlobeLand30-2010’s Cultivated land and Grassland classes into cropland and pasture respectively, should not be made without caution, as GlobeLand30-2010 defines its Cultivated land to include some pastures. Comparisons with the TerraClass dataset over the Brazilian Amazon and with previous literature indicates that Cultivated land in GlobeLand30-2010 includes notable amounts of pasture and other vegetation (e.g. in Paraguay and the Brazilian Amazon). This further suggests that the approach taken here generally leads to an underestimation (of up to ~60%) of the role of pasture in replacing forest. Second, a large share (~33%) of the Global Forest Change forest loss is found to still be forest according to GlobeLand30-2010 and our analysis suggests that the accuracy of the combined datasets, especially for areas with heterogeneous land cover and/or small-scale forest loss, is still too poor for

  12. Combining global land cover datasets to quantify agricultural expansion into forests in Latin America: Limitations and challenges.

    PubMed

    Pendrill, Florence; Persson, U Martin

    2017-01-01

    While we know that deforestation in the tropics is increasingly driven by commercial agriculture, most tropical countries still lack recent and spatially-explicit assessments of the relative importance of pasture and cropland expansion in causing forest loss. Here we present a spatially explicit quantification of the extent to which cultivated land and grassland expanded at the expense of forests across Latin America in 2001-2011, by combining two "state-of-the-art" global datasets (Global Forest Change forest loss and GlobeLand30-2010 land cover). We further evaluate some of the limitations and challenges in doing this. We find that this approach does capture some of the major patterns of land cover following deforestation, with GlobeLand30-2010's Grassland class (which we interpret as pasture) being the most common land cover replacing forests across Latin America. However, our analysis also reveals some major limitations to combining these land cover datasets for quantifying pasture and cropland expansion into forest. First, a simple one-to-one translation between GlobeLand30-2010's Cultivated land and Grassland classes into cropland and pasture respectively, should not be made without caution, as GlobeLand30-2010 defines its Cultivated land to include some pastures. Comparisons with the TerraClass dataset over the Brazilian Amazon and with previous literature indicates that Cultivated land in GlobeLand30-2010 includes notable amounts of pasture and other vegetation (e.g. in Paraguay and the Brazilian Amazon). This further suggests that the approach taken here generally leads to an underestimation (of up to ~60%) of the role of pasture in replacing forest. Second, a large share (~33%) of the Global Forest Change forest loss is found to still be forest according to GlobeLand30-2010 and our analysis suggests that the accuracy of the combined datasets, especially for areas with heterogeneous land cover and/or small-scale forest loss, is still too poor for deriving

  13. Soil Erosion map of Europe based on high resolution input datasets

    NASA Astrophysics Data System (ADS)

    Panagos, Panos; Borrelli, Pasquale; Meusburger, Katrin; Ballabio, Cristiano; Alewell, Christine

    2015-04-01

    Modelling soil erosion in European Union is of major importance for agro-environmental policies. Soil erosion estimates are important inputs for the Common Agricultural Policy (CAP) and the implementation of the Soil Thematic Strategy. Using the findings of a recent pan-European data collection through the EIONET network, it was concluded that most Member States are applying the empirical Revised Universal Soil Loss Equation (RUSLE) for the modelling soil erosion at National level. This model was chosen for the pan-European soil erosion risk assessment and it is based on 6 input factors. Compared to past approaches, each of the factors is modelled using the latest pan-European datasets, expertise and data from Member states and high resolution remote sensing data. The soil erodibility (K-factor) is modelled using the recently published LUCAS topsoil database with 20,000 point measurements and incorporating the surface stone cover which can reduce K-factor by 15%. The rainfall erosivity dataset (R-factor) has been implemented using high temporal resolution rainfall data from more than 1,500 precipitation stations well distributed in Europe. The cover-management (C-factor) incorporates crop statistics and management practices such as cover crops, tillage practices and plant residuals. The slope length and steepness (combined LS-factor) is based on the first ever 25m Digital Elevation Model (DEM) of Europe. Finally, the support practices (P-factor) is modelled for first time at this scale taking into account the 270,000 LUCAS earth observations and the Good Agricultural and Environmental Condition (GAEC) that farmers have to follow in Europe. The high resolution input layers produce the final soil erosion risk map at 100m resolution and allow policy makers to run future land use, management and climate change scenarios.

  14. The distance function effect on k-nearest neighbor classification for medical datasets.

    PubMed

    Hu, Li-Yu; Huang, Min-Wei; Ke, Shih-Wen; Tsai, Chih-Fong

    2016-01-01

    K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output. Since the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually. The experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets. In this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best.

  15. Heart rate and sentiment experimental data with common timeline.

    PubMed

    Salamon, Jaromír; Mouček, Roman

    2017-12-01

    Sentiment extraction and analysis using spoken utterances or written corpora as well as collection and analysis of human heart rate data using sensors are commonly used techniques and methods. On the other hand, these have been not combined yet. The collected data can be used e.g. to investigate the mutual dependence of human physical and emotional activity. The paper describes the procedure of parallel acquisition of heart rate sensor data and tweets expressing sentiment and difficulties related to this procedure. The obtained datasets are described in detail and further discussed to provide as much information as possible for subsequent analyses and conclusions. Analyses and conclusions are not included in this paper. The presented experiment and provided datasets serve as the first basis for further studies where all four presented data sources can be used independently, combined in a reasonable way or used all together. For instance, when the data is used all together, performing studies comparing human sensor data, acquired noninvasively from the surface of the human body and considered as more objective, and human written data expressing the sentiment, which is at least partly cognitively interpreted and thus considered as more subjective, could be beneficial.

  16. Quality Controlling CMIP datasets at GFDL

    NASA Astrophysics Data System (ADS)

    Horowitz, L. W.; Radhakrishnan, A.; Balaji, V.; Adcroft, A.; Krasting, J. P.; Nikonov, S.; Mason, E. E.; Schweitzer, R.; Nadeau, D.

    2017-12-01

    As GFDL makes the switch from model development to production in light of the Climate Model Intercomparison Project (CMIP), GFDL's efforts are shifted to testing and more importantly establishing guidelines and protocols for Quality Controlling and semi-automated data publishing. Every CMIP cycle introduces key challenges and the upcoming CMIP6 is no exception. The new CMIP experimental design comprises of multiple MIPs facilitating research in different focus areas. This paradigm has implications not only for the groups that develop the models and conduct the runs, but also for the groups that monitor, analyze and quality control the datasets before data publishing, before their knowledge makes its way into reports like the IPCC (Intergovernmental Panel on Climate Change) Assessment Reports. In this talk, we discuss some of the paths taken at GFDL to quality control the CMIP-ready datasets including: Jupyter notebooks, PrePARE, LAMP (Linux, Apache, MySQL, PHP/Python/Perl): technology-driven tracker system to monitor the status of experiments qualitatively and quantitatively, provide additional metadata and analysis services along with some in-built controlled-vocabulary validations in the workflow. In addition to this, we also discuss the integration of community-based model evaluation software (ESMValTool, PCMDI Metrics Package, and ILAMB) as part of our CMIP6 workflow.

  17. MicroRNA array normalization: an evaluation using a randomized dataset as the benchmark.

    PubMed

    Qin, Li-Xuan; Zhou, Qin

    2014-01-01

    MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.

  18. Sinfonevada: Dataset of Floristic diversity in Sierra Nevada forests (SE Spain)

    PubMed Central

    Pérez-Luque, Antonio Jesús; Bonet, Francisco Javier; Pérez-Pérez, Ramón; Rut Aspizua; Lorite, Juan; Zamora, Regino

    2014-01-01

    Abstract The Sinfonevada database is a forest inventory that contains information on the forest ecosystem in the Sierra Nevada mountains (SE Spain). The Sinfonevada dataset contains more than 7,500 occurrence records belonging to 270 taxa (24 of these threatened) from floristic inventories of the Sinfonevada Forest inventory. Expert field workers collected the information. The whole dataset underwent a quality control by botanists with broad expertise in Sierra Nevada flora. This floristic inventory was created to gather useful information for the proper management of Pinus plantations in Sierra Nevada. This is the only dataset that shows a comprehensive view of the forest flora in Sierra Nevada. This is the reason why it is being used to assess the biodiversity in the very dense pine plantations on this massif. With this dataset, managers have improved their ability to decide where to apply forest treatments in order to avoid biodiversity loss. The dataset forms part of the Sierra Nevada Global Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:24843285

  19. OpenCL based machine learning labeling of biomedical datasets

    NASA Astrophysics Data System (ADS)

    Amoros, Oscar; Escalera, Sergio; Puig, Anna

    2011-03-01

    In this paper, we propose a two-stage labeling method of large biomedical datasets through a parallel approach in a single GPU. Diagnostic methods, structures volume measurements, and visualization systems are of major importance for surgery planning, intra-operative imaging and image-guided surgery. In all cases, to provide an automatic and interactive method to label or to tag different structures contained into input data becomes imperative. Several approaches to label or segment biomedical datasets has been proposed to discriminate different anatomical structures in an output tagged dataset. Among existing methods, supervised learning methods for segmentation have been devised to easily analyze biomedical datasets by a non-expert user. However, they still have some problems concerning practical application, such as slow learning and testing speeds. In addition, recent technological developments have led to widespread availability of multi-core CPUs and GPUs, as well as new software languages, such as NVIDIA's CUDA and OpenCL, allowing to apply parallel programming paradigms in conventional personal computers. Adaboost classifier is one of the most widely applied methods for labeling in the Machine Learning community. In a first stage, Adaboost trains a binary classifier from a set of pre-labeled samples described by a set of features. This binary classifier is defined as a weighted combination of weak classifiers. Each weak classifier is a simple decision function estimated on a single feature value. Then, at the testing stage, each weak classifier is independently applied on the features of a set of unlabeled samples. In this work, we propose an alternative representation of the Adaboost binary classifier. We use this proposed representation to define a new GPU-based parallelized Adaboost testing stage using OpenCL. We provide numerical experiments based on large available data sets and we compare our results to CPU-based strategies in terms of time and

  20. Development of a video tampering dataset for forensic investigation.

    PubMed

    Ismael Al-Sanjary, Omar; Ahmed, Ahmed Abdullah; Sulong, Ghazali

    2016-09-01

    Forgery is an act of modifying a document, product, image or video, among other media. Video tampering detection research requires an inclusive database of video modification. This paper aims to discuss a comprehensive proposal to create a dataset composed of modified videos for forensic investigation, in order to standardize existing techniques for detecting video tampering. The primary purpose of developing and designing this new video library is for usage in video forensics, which can be consciously associated with reliable verification using dynamic and static camera recognition. To the best of the author's knowledge, there exists no similar library among the research community. Videos were sourced from YouTube and by exploring social networking sites extensively by observing posted videos and rating their feedback. The video tampering dataset (VTD) comprises a total of 33 videos, divided among three categories in video tampering: (1) copy-move, (2) splicing, and (3) swapping-frames. Compared to existing datasets, this is a higher number of tampered videos, and with longer durations. The duration of every video is 16s, with a 1280×720 resolution, and a frame rate of 30 frames per second. Moreover, all videos possess the same formatting quality (720p(HD).avi). Both temporal and spatial video features were considered carefully during selection of the videos, and there exists complete information related to the doctored regions in every modified video in the VTD dataset. This database has been made publically available for research on splicing, Swapping frames, and copy-move tampering, and, as such, various video tampering detection issues with ground truth. The database has been utilised by many international researchers and groups of researchers. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.